-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Synthesize Fake Data
Tier: Intermediate
Commands used: stats, describegpt, synthesize
Anchor dataset: a small customers.csv you build inline (so the recipe runs without external downloads)
New in: qsv 20.1.0
You need to share a CSV with a partner, a contractor, an interview candidate, or a public demo — but it contains real customer data. You want the shape of the data preserved (distributions, null rates, value cardinality, column types) and the content faked-but-realistic (email columns get real-looking emails, state_abbr columns get real US state abbreviations, zip_code columns get five digits).
Concretely:
- the synthetic file should statistically resemble the source — so models trained on it behave similarly, dashboards look the same, query plans hit the same indexes
- no real records leak through — every row is independently generated
- output is reproducible — re-running with the same seed produces the same fake file
- the workflow stays on your machine — no source data leaves the host
A tiny inline-built customers.csv is enough to demonstrate every moving part. Run:
cat > customers.csv <<'EOF'
customer_id,first_name,last_name,email,phone,street_address,city,state_abbr,zip_code,signup_date,plan,monthly_spend_usd,is_active
C-0001,Alice,Nguyen,alice.nguyen@example.com,415-555-0102,742 Evergreen Terrace,San Francisco,CA,94110,2024-03-12,Pro,49.00,true
C-0002,Bob,Garcia,bob.g@example.org,212-555-0188,221B Baker Street,New York,NY,10001,2024-04-02,Free,0.00,true
C-0003,Carol,Tanaka,carol.t@example.com,503-555-0144,1600 Pennsylvania Ave NW,Portland,OR,97205,2024-04-19,Pro,49.00,false
C-0004,Dan,Müller,dan.m@example.net,,500 South Buena Vista St,Burbank,CA,91521,2024-05-05,Enterprise,499.00,true
C-0005,Eve,Williams,eve.w@example.com,617-555-0123,1 Cyclotron Rd,Berkeley,CA,94720,2024-05-28,Pro,49.00,true
C-0006,Frank,O'Connor,,415-555-0181,1 Hacker Way,Menlo Park,CA,94025,2024-06-14,Free,0.00,true
EOFThe shape: 13 columns, mix of ID / PII / categorical / numeric / boolean / date / nullable. customer_id is ALL-UNIQUE (qsv will tag it unique_id deterministically — no LLM guess required). The 100-row scaling story is the same for a 100 million-row source.
qsv stats --everything --cardinality --infer-dates --infer-boolean --stats-jsonl customers.csvThis drops customers.stats.csv and customers.stats.csv.data.jsonl next to the source. Every downstream command in this recipe — describegpt, synthesize, and any verification frequency / stats run — reuses that cache. See Stats Cache & Caching for the deeper reasoning.
qsv describegpt customers.csv \
--dictionary \
--infer-content-type \
--two-pass \
--format JSON \
-u http://localhost:11434/v1 \
--model deepseek-r1:14b \
-o customers.dict.jsonWhat the flags do:
-
--dictionary— emit a Data Dictionary (one entry per column). -
--infer-content-type— ask the LLM to tag each column with a semantic Content Type from the curated 47-token vocabulary (email,phone,street_address,city,state_abbr,zip_code,first_name,last_name, …).customer_idgetsunique_iddeterministically, before the LLM sees it. -
--two-pass— re-run the LLM over the full first-pass dictionary so it can spot field relationships. This is what tellssynthesizethatstate_abbr+zip_codebelong together, so the synthesizedstate_abbrwill be a real US state abbreviation that matches the row'szip_code. Doubles dictionary LLM cost — worth it when realism matters. -
--format JSON—synthesizeneeds the JSON form. -
-u … --model …— point at a local Ollama / Jan / LM Studio so the data stays on-device. For OpenAI-compatible API endpoints, setQSV_LLM_APIKEYinstead.
For a hosted endpoint:
QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv describegpt customers.csv \
--dictionary --infer-content-type --two-pass --format JSON \
-o customers.dict.jsonqsv synthesize customers.csv \
--dictionary customers.dict.json \
--locale en \
--seed 42 \
-n 1000 \
> customers-fake.csvWhat happens:
-
--dictionary customers.dict.jsonplugs in the Content Types from step 2, so columns get realistic fakes from fake-rs (real-looking emails, names, addresses, phone numbers). -
--locale enpicks US-flavored fakes. Switch tofr_fr,de_de,ja_jp,pt_br, etc. for regional flavors (14 locales supported). -
--seed 42— fully reproducible: same seed, same input, same output. -
-n 1000— produce 1,000 synthetic rows from the 6-row source. Numeric and date columns are reproduced from quartile buckets of the source's distribution, so larger N just gets a denser sample of the same shape.
qsv stats --everything --cardinality customers-fake.csv | qsv colorYou should see:
- the same per-column types as the source
-
is_activecardinality of 2 (true/false) — categorical columns are reproduced by frequency-weighted sampling -
monthly_spend_usdmean / median in the same neighborhood as the source - the null rate on
phoneandemailmatches the source proportions
Spot-check by eye:
qsv slice --start 0 --len 5 customers-fake.csv | qsv colorThe email column should look like real-looking emails (not String#1), state_abbr should be 2-letter US state codes that match their row's zip_code, customer_id should be unique-per-row.
qsv synthesize customers.csv \
--dictionary customers.dict.json \
--consistent-fakes \
--seed 42 \
-n 1000 \
> customers-fake.csvFor deidentified synthesis where you want stable joins on the faked columns: with --consistent-fakes, the same source value always produces the same fake (for structured-faker columns with bounded cardinality). Useful when you want to share a synthetic CSV that two parties can still join on a faked key.
qsv synthesize customers.csv --seed 42 -n 1000 > customers-fake.csvWithout a dictionary, synthesize falls back to pure type/frequency-based generation: distributions and null rates are preserved, but PII columns get generic strings instead of realistic fakes. Useful when the source is non-sensitive and you just need shape-faithful test data.
QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv synthesize customers.csv \
--infer-content-type \
--seed 42 \
-n 1000 \
> customers-fake.csvOne-shot equivalent of steps 2 + 3. Note this runs describegpt --dictionary --infer-content-type internally (without --two-pass — for cross-field consistency, generate the dictionary explicitly as in step 2).
qsv synthesize customers.csv --dictionary customers.dict.json --locale ja_jp --seed 42 -n 1000 > customers-jp.csv
qsv synthesize customers.csv --dictionary customers.dict.json --locale fr_fr --seed 42 -n 1000 > customers-fr.csvSupported locales: en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr. Sparse locales fall back to en data for missing categories.
-
synthesizeis a single source-read plus an N-row write, so cost scales linearly with-n. On a 100 M-row source, the source-read dominates — useindexto multithread it. - The expensive step is the LLM call in
describegpt.--two-passroughly doubles dictionary cost (two passes over the dictionary) — opt in when accuracy matters more than throughput. For interactive development, start with single-pass, then turn on--two-passfor the final dictionary. - The Data Dictionary is reusable. Run
describegptonce, savecustomers.dict.json, then re-runsynthesizewith different--seed/--locale/-nvalues for free. - Columns are generated independently — there's no per-row LLM call. The synthetic CSV writes at near-CSV-writer speed.
-
No correlation modeling by default. Numeric columns are generated from per-column distributions; if
monthly_spend_usdandplanare correlated in the source,synthesizewon't preserve that correlation. Categorical-only correlations (e.g.state_abbr↔zip_code) can be preserved viadescribegpt --two-pass, which clusters related fields in the dictionary — see AI & Documentation →--two-pass. - High-cardinality faker columns (cardinality above the internal 100,000 cap) generate a fresh fake per row, so distinct count is approximate rather than exact for those columns.
-
Structured fakers ignore length stats. A
length-stat truncation rule only applies to unstructured text columns (lorem_*,free_text, no-faker fallback). Structured fakers (email,uuid,phone, address parts) emit their natural format unmodified — truncating them would break the format.
- AI & Documentation →
synthesize -
AI & Documentation →
describegpt— Content Types,--two-pass,--markdown-template -
/docs/help/synthesize.md— full flag reference /docs/help/describegpt.md-
src/cmd/synthesize/faker_map.rs— every Content Type token and the faker it maps to - Stats Cache & Caching — why step 1 makes step 3 cheap
- qsv 20.1.0 release notes — the "Synthetic Data" release
- tests/test_synthesize.rs — the in-repo test suite is itself a worked-example gallery
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation