Skip to content

Recipe Synthesize Fake Data

Joel Natividad edited this page May 18, 2026 · 1 revision

Recipe: Synthesize Fake Data

Tier: Intermediate Commands used: stats, describegpt, synthesize Anchor dataset: a small customers.csv you build inline (so the recipe runs without external downloads) New in: qsv 20.1.0

Problem

You need to share a CSV with a partner, a contractor, an interview candidate, or a public demo — but it contains real customer data. You want the shape of the data preserved (distributions, null rates, value cardinality, column types) and the content faked-but-realistic (email columns get real-looking emails, state_abbr columns get real US state abbreviations, zip_code columns get five digits).

Concretely:

  • the synthetic file should statistically resemble the source — so models trained on it behave similarly, dashboards look the same, query plans hit the same indexes
  • no real records leak through — every row is independently generated
  • output is reproducible — re-running with the same seed produces the same fake file
  • the workflow stays on your machine — no source data leaves the host

Data

A tiny inline-built customers.csv is enough to demonstrate every moving part. Run:

cat > customers.csv <<'EOF'
customer_id,first_name,last_name,email,phone,street_address,city,state_abbr,zip_code,signup_date,plan,monthly_spend_usd,is_active
C-0001,Alice,Nguyen,alice.nguyen@example.com,415-555-0102,742 Evergreen Terrace,San Francisco,CA,94110,2024-03-12,Pro,49.00,true
C-0002,Bob,Garcia,bob.g@example.org,212-555-0188,221B Baker Street,New York,NY,10001,2024-04-02,Free,0.00,true
C-0003,Carol,Tanaka,carol.t@example.com,503-555-0144,1600 Pennsylvania Ave NW,Portland,OR,97205,2024-04-19,Pro,49.00,false
C-0004,Dan,Müller,dan.m@example.net,,500 South Buena Vista St,Burbank,CA,91521,2024-05-05,Enterprise,499.00,true
C-0005,Eve,Williams,eve.w@example.com,617-555-0123,1 Cyclotron Rd,Berkeley,CA,94720,2024-05-28,Pro,49.00,true
C-0006,Frank,O'Connor,,415-555-0181,1 Hacker Way,Menlo Park,CA,94025,2024-06-14,Free,0.00,true
EOF

The shape: 13 columns, mix of ID / PII / categorical / numeric / boolean / date / nullable. customer_id is ALL-UNIQUE (qsv will tag it unique_id deterministically — no LLM guess required). The 100-row scaling story is the same for a 100 million-row source.

Solution

1. Build the stats cache once

qsv stats --everything --cardinality --infer-dates --infer-boolean --stats-jsonl customers.csv

This drops customers.stats.csv and customers.stats.csv.data.jsonl next to the source. Every downstream command in this recipe — describegpt, synthesize, and any verification frequency / stats run — reuses that cache. See Stats Cache & Caching for the deeper reasoning.

2. Generate a Content-Type-tagged Data Dictionary with describegpt --two-pass

qsv describegpt customers.csv \
  --dictionary \
  --infer-content-type \
  --two-pass \
  --format JSON \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  -o customers.dict.json

What the flags do:

  • --dictionary — emit a Data Dictionary (one entry per column).
  • --infer-content-type — ask the LLM to tag each column with a semantic Content Type from the curated 47-token vocabulary (email, phone, street_address, city, state_abbr, zip_code, first_name, last_name, …). customer_id gets unique_id deterministically, before the LLM sees it.
  • --two-pass — re-run the LLM over the full first-pass dictionary so it can spot field relationships. This is what tells synthesize that state_abbr + zip_code belong together, so the synthesized state_abbr will be a real US state abbreviation that matches the row's zip_code. Doubles dictionary LLM cost — worth it when realism matters.
  • --format JSONsynthesize needs the JSON form.
  • -u … --model … — point at a local Ollama / Jan / LM Studio so the data stays on-device. For OpenAI-compatible API endpoints, set QSV_LLM_APIKEY instead.

For a hosted endpoint:

QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv describegpt customers.csv \
  --dictionary --infer-content-type --two-pass --format JSON \
  -o customers.dict.json

3. Synthesize the fake CSV

qsv synthesize customers.csv \
  --dictionary customers.dict.json \
  --locale en \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

What happens:

  • --dictionary customers.dict.json plugs in the Content Types from step 2, so columns get realistic fakes from fake-rs (real-looking emails, names, addresses, phone numbers).
  • --locale en picks US-flavored fakes. Switch to fr_fr, de_de, ja_jp, pt_br, etc. for regional flavors (14 locales supported).
  • --seed 42 — fully reproducible: same seed, same input, same output.
  • -n 1000 — produce 1,000 synthetic rows from the 6-row source. Numeric and date columns are reproduced from quartile buckets of the source's distribution, so larger N just gets a denser sample of the same shape.

4. Verify the shape matches

qsv stats --everything --cardinality customers-fake.csv | qsv color

You should see:

  • the same per-column types as the source
  • is_active cardinality of 2 (true / false) — categorical columns are reproduced by frequency-weighted sampling
  • monthly_spend_usd mean / median in the same neighborhood as the source
  • the null rate on phone and email matches the source proportions

Spot-check by eye:

qsv slice --start 0 --len 5 customers-fake.csv | qsv color

The email column should look like real-looking emails (not String#1), state_abbr should be 2-letter US state codes that match their row's zip_code, customer_id should be unique-per-row.

Variations

Stable source → fake mapping (--consistent-fakes)

qsv synthesize customers.csv \
  --dictionary customers.dict.json \
  --consistent-fakes \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

For deidentified synthesis where you want stable joins on the faked columns: with --consistent-fakes, the same source value always produces the same fake (for structured-faker columns with bounded cardinality). Useful when you want to share a synthetic CSV that two parties can still join on a faked key.

Skip the LLM entirely

qsv synthesize customers.csv --seed 42 -n 1000 > customers-fake.csv

Without a dictionary, synthesize falls back to pure type/frequency-based generation: distributions and null rates are preserved, but PII columns get generic strings instead of realistic fakes. Useful when the source is non-sensitive and you just need shape-faithful test data.

Let synthesize build the dictionary itself

QSV_LLM_APIKEY="$OPENAI_API_KEY" qsv synthesize customers.csv \
  --infer-content-type \
  --seed 42 \
  -n 1000 \
  > customers-fake.csv

One-shot equivalent of steps 2 + 3. Note this runs describegpt --dictionary --infer-content-type internally (without --two-pass — for cross-field consistency, generate the dictionary explicitly as in step 2).

Locale switching

qsv synthesize customers.csv --dictionary customers.dict.json --locale ja_jp --seed 42 -n 1000 > customers-jp.csv
qsv synthesize customers.csv --dictionary customers.dict.json --locale fr_fr --seed 42 -n 1000 > customers-fr.csv

Supported locales: en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr. Sparse locales fall back to en data for missing categories.

Performance notes

  • synthesize is a single source-read plus an N-row write, so cost scales linearly with -n. On a 100 M-row source, the source-read dominates — use index to multithread it.
  • The expensive step is the LLM call in describegpt. --two-pass roughly doubles dictionary cost (two passes over the dictionary) — opt in when accuracy matters more than throughput. For interactive development, start with single-pass, then turn on --two-pass for the final dictionary.
  • The Data Dictionary is reusable. Run describegpt once, save customers.dict.json, then re-run synthesize with different --seed / --locale / -n values for free.
  • Columns are generated independently — there's no per-row LLM call. The synthetic CSV writes at near-CSV-writer speed.

Caveats

  • No correlation modeling by default. Numeric columns are generated from per-column distributions; if monthly_spend_usd and plan are correlated in the source, synthesize won't preserve that correlation. Categorical-only correlations (e.g. state_abbrzip_code) can be preserved via describegpt --two-pass, which clusters related fields in the dictionary — see AI & Documentation → --two-pass.
  • High-cardinality faker columns (cardinality above the internal 100,000 cap) generate a fresh fake per row, so distinct count is approximate rather than exact for those columns.
  • Structured fakers ignore length stats. A length-stat truncation rule only applies to unstructured text columns (lorem_*, free_text, no-faker fallback). Structured fakers (email, uuid, phone, address parts) emit their natural format unmodified — truncating them would break the format.

See also

Clone this wiki locally