Recipe Synthesize Fake Data

Recipe: Synthesize Fake Data

Tier: Intermediate Commands used: stats, describegpt, synthesize Anchor dataset: a small customers.csv you build inline (so the recipe runs without external downloads) New in: qsv 20.1.0

Problem

You need to share a CSV with a partner, a contractor, an interview candidate, or a public demo — but it contains real customer data. You want the shape of the data preserved (distributions, null rates, value cardinality, column types) and the content faked-but-realistic (email columns get real-looking emails, state_abbr columns get real US state abbreviations, zip_code columns get five digits).

Concretely:

the synthetic file should statistically resemble the source — so models trained on it behave similarly, dashboards look the same, query plans hit the same indexes
no real records leak through — every row is independently generated
output is reproducible — re-running with the same seed produces the same fake file
the workflow stays on your machine — no source data leaves the host

Data

A tiny inline-built customers.csv is enough to demonstrate every moving part. Run:

cat > customers.csv <<'EOF'
customer_id,first_name,last_name,email,phone,street_address,city,state_abbr,zip_code,signup_date,plan,monthly_spend_usd,is_active
C-0001,Alice,Nguyen,alice.nguyen@example.com,415-555-0102,742 Evergreen Terrace,San Francisco,CA,94110,2024-03-12,Pro,49.00,true
C-0002,Bob,Garcia,bob.g@example.org,212-555-0188,221B Baker Street,New York,NY,10001,2024-04-02,Free,0.00,true
C-0003,Carol,Tanaka,carol.t@example.com,503-555-0144,1600 Pennsylvania Ave NW,Portland,OR,97205,2024-04-19,Pro,49.00,false
C-0004,Dan,Müller,dan.m@example.net,,500 South Buena Vista St,Burbank,CA,91521,2024-05-05,Enterprise,499.00,true
C-0005,Eve,Williams,eve.w@example.com,617-555-0123,1 Cyclotron Rd,Berkeley,CA,94720,2024-05-28,Pro,49.00,true
C-0006,Frank,O'Connor,,415-555-0181,1 Hacker Way,Menlo Park,CA,94025,2024-06-14,Free,0.00,true
EOF

The shape: 13 columns, mix of ID / PII / categorical / numeric / boolean / date / nullable. customer_id is ALL-UNIQUE (qsv will tag it unique_id deterministically — no LLM guess required). The 100-row scaling story is the same for a 100 million-row source.

Solution

1. Build the stats cache once

qsv stats --everything --cardinality --infer-dates --infer-boolean --stats-jsonl customers.csv

This drops customers.stats.csv and customers.stats.csv.data.jsonl next to the source. Every downstream command in this recipe — describegpt, synthesize, and any verification frequency / stats run — reuses that cache. See Stats Cache & Caching for the deeper reasoning.

2. Generate a Content-Type-tagged Data Dictionary with `describegpt --two-pass`

qsv describegpt customers.csv \
  --dictionary \
  --infer-content-type \
  --two-pass \
  --format JSON \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  -o customers.dict.json

What the flags do:

--dictionary — emit a Data Dictionary (one entry per column).
--infer-content-type — ask the LLM to tag each column with a semantic Content Type from the curated 47-token vocabulary (email, phone, street_address, city, state_abbr, zip_code, first_name, last_name, …). customer_id gets unique_id deterministically, before the LLM sees it.
--two-pass — re-run the LLM over the full first-pass dictionary so it can spot field relationships. This is what tells synthesize that state_abbr + zip_code belong together, so the synthesized state_abbr will be a real US state abbreviation that matches the row's zip_code. Doubles dictionary LLM cost — worth it when realism matters.
--format JSON — synthesize needs the JSON form.
-u … --model … — point at a local Ollama / Jan / LM Studio so the data stays on-device. For OpenAI-compatible API endpoints, set QSV_LLM_APIKEY instead.

For a hosted endpoint:

QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv describegpt customers.csv \
  --dictionary --infer-content-type --two-pass --format JSON \
  -o customers.dict.json

3. Synthesize the fake CSV

qsv synthesize customers.csv \
  --dictionary customers.dict.json \
  --locale en \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

What happens:

--dictionary customers.dict.json plugs in the Content Types from step 2, so columns get realistic fakes from fake-rs (real-looking emails, names, addresses, phone numbers).
--locale en picks US-flavored fakes. Switch to fr_fr, de_de, ja_jp, pt_br, etc. for regional flavors (14 locales supported).
--seed 42 — fully reproducible: same seed, same input, same output.
-n 1000 — produce 1,000 synthetic rows from the 6-row source. Numeric and date columns are reproduced from quartile buckets of the source's distribution, so larger N just gets a denser sample of the same shape.

4. Verify the shape matches

qsv stats --everything --cardinality customers-fake.csv | qsv color

You should see:

the same per-column types as the source
is_active cardinality of 2 (true / false) — categorical columns are reproduced by frequency-weighted sampling
monthly_spend_usd mean / median in the same neighborhood as the source
the null rate on phone and email matches the source proportions

Spot-check by eye:

qsv slice --start 0 --len 5 customers-fake.csv | qsv color

The email column should look like real-looking emails (not String#1), state_abbr should be 2-letter US state codes that match their row's zip_code, customer_id should be unique-per-row.

Variations

Stable source → fake mapping (`--consistent-fakes`)

qsv synthesize customers.csv \
  --dictionary customers.dict.json \
  --consistent-fakes \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

For deidentified synthesis where you want stable joins on the faked columns: with --consistent-fakes, the same source value always produces the same fake (for structured-faker columns with bounded cardinality). Useful when you want to share a synthetic CSV that two parties can still join on a faked key.

Skip the LLM entirely

qsv synthesize customers.csv --seed 42 -n 1000 > customers-fake.csv

Without a dictionary, synthesize falls back to pure type/frequency-based generation: distributions and null rates are preserved, but PII columns get generic strings instead of realistic fakes. Useful when the source is non-sensitive and you just need shape-faithful test data.

Let `synthesize` build the dictionary itself

QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv synthesize customers.csv \
  --infer-content-type \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

One-shot equivalent of steps 2 + 3. Note this runs describegpt --dictionary --infer-content-type internally (without --two-pass — for cross-field consistency, generate the dictionary explicitly as in step 2).

Locale switching

qsv synthesize customers.csv --dictionary customers.dict.json --locale ja_jp --seed 42 -n 1000 > customers-jp.csv
qsv synthesize customers.csv --dictionary customers.dict.json --locale fr_fr --seed 42 -n 1000 > customers-fr.csv

Supported locales: en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr. Sparse locales fall back to en data for missing categories.

Performance notes

synthesize is a single source-read plus an N-row write, so cost scales linearly with -n. On a 100 M-row source, the source-read dominates — use index to multithread it.
The expensive step is the LLM call in describegpt. --two-pass roughly doubles dictionary cost (two passes over the dictionary) — opt in when accuracy matters more than throughput. For interactive development, start with single-pass, then turn on --two-pass for the final dictionary.
The Data Dictionary is reusable. Run describegpt once, save customers.dict.json, then re-run synthesize with different --seed / --locale / -n values for free.
Columns are generated independently — there's no per-row LLM call. The synthetic CSV writes at near-CSV-writer speed.

Caveats

No correlation modeling by default. Numeric columns are generated from per-column distributions; if monthly_spend_usd and plan are correlated in the source, synthesize won't preserve that correlation. Categorical-only correlations (e.g. state_abbr ↔ zip_code) can be preserved via describegpt --two-pass, which clusters related fields in the dictionary — see AI & Documentation → --two-pass.
High-cardinality faker columns (cardinality above the internal 100,000 cap) generate a fresh fake per row, so distinct count is approximate rather than exact for those columns.
Structured fakers ignore length stats. A length-stat truncation rule only applies to unstructured text columns (lorem_*, free_text, no-faker fallback). Structured fakers (email, uuid, phone, address parts) emit their natural format unmodified — truncating them would break the format.

Recipe Synthesize Fake Data

Recipe: Synthesize Fake Data

Problem

Data

Solution

1. Build the stats cache once

2. Generate a Content-Type-tagged Data Dictionary with describegpt --two-pass

3. Synthesize the fake CSV

4. Verify the shape matches

Variations

Stable source → fake mapping (--consistent-fakes)

Skip the LLM entirely

Let synthesize build the dictionary itself

Locale switching

Performance notes

Caveats

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

2. Generate a Content-Type-tagged Data Dictionary with `describegpt --two-pass`

Stable source → fake mapping (`--consistent-fakes`)

Let `synthesize` build the dictionary itself