feat(synthesize): generate statistically-faithful synthetic CSVs by jqnatividad · Pull Request #3854 · dathere/qsv

jqnatividad · 2026-05-15T03:09:04Z

What

Adds qsv synthesize <input.csv> — a new command that generates a synthetic CSV
that is statistically faithful to a source CSV. It runs stats + frequency +
count on the source and emits N rows that reproduce the source's per-column
attributes, optionally using a Data Dictionary's semantic Content Types to pick
realistic fake-rs fakers.

Use cases: test fixtures, demos, sharing dataset shape without real/PII data.

How it works

Per-column generator model, picked by construction-time precedence:

FrequencyWeighted — when frequency fully enumerates the column (no
"Other" catch-all bucket), sample the real value set with real frequency
weights. Reproduces cardinality, weights, and repetition structure exactly.
Wins even over a faker mapping (a state column with 50 enumerated values
emits the real 50).
Faker — when the dictionary content_type maps to a fake-rs faker.
Bounded-cardinality columns pre-generate a fixed pool of distinct fake
values once and sample from it (the consistency mechanism — same logical
value maps to the same fake value).
NumericQuantile / DateQuantile — quartile-bucketed sampling
reproduces the shape of the distribution, not just [min, max].
Boolean / LoremFallback / Empty safety nets.

Null ratios reproduced per column. --seed makes output fully reproducible
(single master StdRng threads through both selection logic and every faker
call). Cross-column correlation is out of scope for v1 — columns generated
independently.

CLI surface

qsv synthesize [options] <input>

synthesize options:
    --dictionary <file>    Data Dictionary JSON from
                           `describegpt --dictionary --infer-content-type --format JSON`.
    --infer-content-type   Build the dictionary on the fly via describegpt
                           (needs QSV_LLM_APIKEY).
    -n, --rows <n>         Number of synthetic rows.       [default: 100]
    --seed <n>             RNG seed for reproducible output.
    --locale <loc>         fake-rs locale.                 [default: EN]
    --freq-limit <n>       Frequency pool depth.           [default: 100]
    --stats-options <arg>  Extra options for the internal `stats` run.
    -j, --jobs <arg>       Jobs for `stats` / `frequency`.

Common options:
    -h, --help, -o, --output, -d, --delimiter

Examples

# Pure statistical synthesis — no dictionary needed
qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

# Layer in semantic fakers from a pre-made Data Dictionary
qsv describegpt data.csv --dictionary --infer-content-type --format JSON -o dict.json
qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

# Let synthesize build the dictionary itself (needs an LLM API key)
qsv synthesize data.csv --infer-content-type -n 1000 > synthetic.csv

Implementation notes

New module src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs.
Reuses util::run_qsv_cmd to shell out to stats / frequency / count
(the same primitive describegpt::perform_analysis uses) — no refactor of
describegpt itself was needed.
Bumped a handful of pub(super) items in src/cmd/describegpt/dictionary.rs
to pub(crate) (StatsRecord, FrequencyRecord, parse_stats_csv,
parse_frequency_csv, CONTENT_TYPE_VOCAB) plus the dictionary submodule
itself, so synthesize can consume them. No behavior changes in describegpt.
fake = "5" (features: derive, chrono, uuid, random_color) added as
an optional dep. New synthesize feature gates it and is wired into
distrib_features + qsvmcp (not lite, not datapusher_plus). Verified
via cargo tree -i rand: fake v5.1.0 uses rand 0.10 (same major as qsv),
so its fake_with_rng accepts qsv's seeded StdRng directly — no
cross-version RNG bridging needed.
Worked around a determinism bug in fake-rs v5.1.0's
Dummy<IP<L>> for String (the impl ignores the passed RNG); ip_address
uses IPv4 directly instead — same a.b.c.d format, deterministic.
Made the help-markdown and MCP-skills generators support module-dir
commands (src/cmd/<name>/mod.rs), not just flat src/cmd/<name>.rs.
Regen also picked up genuine doc drift in qsv-stats.json,
qsv-frequency.json, and docs/help/frequency.md (Apache DataSketches
big-endian notes) — included.

Out of scope / future work

Cross-column correlation (independent generation in v1).
Multi-locale support (each fake-rs locale is a distinct type, so multi-
locale dispatch needs macro expansion). --locale is accepted but only
EN is supported; non-EN errors with a clear message.
--preserve-values mode that streams the source CSV and keys a real
HashMap<original, fake> per column (the literal "same original → same
fake" behavior, requires source re-read and ties output row count to source
row count). The ColumnGenerator::Faker variant is already shaped to slot
this in without restructuring.

Tests

12 unit tests in src/cmd/synthesize/ cover the faker map (every vocab
token has a mapping or is in {category, unknown}; deterministic with
seed), the generator model (null-ratio reproduction, frequency-weighted
ratios, quantile-in-range, faker-pool cardinality, date-in-range, seed
reproducibility), and dictionary loading (JSON null fields normalize to
unknown).
9 integration tests in tests/test_synthesize.rs cover the end-to-end
command: no-dictionary basic shape, seed reproducibility, --rows,
dictionary-driven faker for high-cardinality column, null-ratio over 5000
rows, --output, and the three error paths (bad locale, zero rows,
missing input).
cargo clippy -F all_features clean for synthesize.

Files

New:

src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs
tests/test_synthesize.rs
docs/help/synthesize.md, .claude/skills/qsv/qsv-synthesize.json (generated)

Modified:

Cargo.toml, Cargo.lock — fake dep + synthesize feature
src/cmd/describegpt/dictionary.rs, src/cmd/describegpt.rs — visibility bumps only
src/cmd/mod.rs, src/main.rs, tests/tests.rs — gated registration
src/help_markdown_gen.rs, src/mcp_skills_gen.rs — module-dir support
README.md, docs/help/TableOfContents.md — command row + TOC
docs/help/frequency.md, .claude/skills/qsv/qsv-{stats,frequency}.json — drift catches from regen

🤖 Generated with Claude Code

…faithful synthetic CSVs `qsv synthesize <input.csv>` runs `stats` + `frequency` + `count` on the source and emits N rows of synthetic data that reproduce the source's per-column attributes: * Categorical / low-cardinality columns are reproduced by frequency-weighted sampling of the *real* value set — cardinality, weights and repetition structure preserved exactly. * Numeric and date/datetime columns use quartile-bucketed generation so the *shape* of the distribution is preserved, not just `[min, max]`. * Null ratios are reproduced per column. * `--seed` makes output fully reproducible (single master `StdRng` threads through both selection logic and every faker call). * `--dictionary <file>` layers in semantic Content Types from `describegpt --dictionary --infer-content-type` — each token maps to a `fake-rs` faker (40-token vocabulary covers names, emails, addresses, UUIDs, etc.). Bounded-cardinality faker columns sample from a fixed pre-generated pool of distinct fake values (the consistency mechanism, so a given logical value maps consistently). * `--infer-content-type` runs `describegpt` internally to build the dictionary on the fly (needs `QSV_LLM_APIKEY`). Cross-column correlation is explicitly out of scope for v1; columns are generated independently. Implementation notes: * New module `src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs`. Reuses `util::run_qsv_cmd` to shell out to `stats` / `frequency` / `count` (the same primitive `describegpt::perform_analysis` uses) — no refactor of describegpt itself was needed. * Bumped a handful of `pub(super)` items in `src/cmd/describegpt/dictionary.rs` to `pub(crate)` (`StatsRecord`, `FrequencyRecord`, `parse_stats_csv`, `parse_frequency_csv`, `CONTENT_TYPE_VOCAB`) and made the `dictionary` submodule `pub(crate)` so `synthesize` can consume them. * `fake = "5"` (features: `derive`, `chrono`, `uuid`, `random_color`) added as an optional dep. The new `synthesize` feature gates the dep and is wired into `distrib_features` and `qsvmcp` (not `lite`, not `datapusher_plus`). `fake` v5.1.0 uses `rand 0.10` (same as qsv), so its `fake_with_rng` accepts qsv's seeded `StdRng` directly — no RNG version-bridging needed. * Worked around a determinism bug in fake-rs's `Dummy<IP<L>> for String` (ignores the passed RNG); `ip_address` uses `IPv4` directly instead. * Made the help-markdown and MCP-skills generators support module-dir commands (`src/cmd/<name>/mod.rs`), not just flat `src/cmd/<name>.rs`. * Regen also picked up genuine doc drift in `qsv-stats.json`, `qsv-frequency.json`, and `docs/help/frequency.md` (Apache DataSketches big-endian notes) — included. v1 is English-only; `--locale` is reserved for future multi-locale support (each fake-rs locale is a distinct type, so multi-locale dispatch needs macro expansion). Tests: 12 unit tests + 9 integration tests, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-15T03:10:29Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 160 complexity

Metric Results

Complexity 160

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds a new qsv synthesize command (feature-gated) to generate statistically-faithful synthetic CSVs, optionally layering semantic fakers from a describegpt data dictionary. Also updates the help/MCP-skill generators to support module-directory commands (src/cmd/<name>/mod.rs) and regenerates related docs/skill JSON.

Changes:

Introduce src/cmd/synthesize/ with dictionary loading/inference, faker mapping, and per-column generator model.
Wire the new command into CLI dispatch, docs/TOC/README, MCP skills generation, and integration test registration.
Add fake as an optional dependency behind a new synthesize feature; include it in distrib_features/qsvmcp.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/tests.rs	Registers synthesize integration tests under `synthesize + feature_capable`.
tests/test_synthesize.rs	Adds end-to-end integration coverage for synthesize CLI behaviors and error paths.
src/mcp_skills_gen.rs	Adds synthesize to MCP categories/command list and supports module-dir command sources.
src/main.rs	Adds synthesize to help text and command dispatch enum (feature-gated).
src/help_markdown_gen.rs	Supports module-dir command sources for help markdown generation.
src/cmd/synthesize/mod.rs	Implements CLI, runs internal `stats`/`frequency`/`count`, builds generators, and emits CSV.
src/cmd/synthesize/generator.rs	Implements per-column generator strategies (frequency-weighted, faker, quantile, etc.).
src/cmd/synthesize/faker_map.rs	Maps `content_type` tokens to `fake-rs` generators (EN-only v1).
src/cmd/synthesize/dictionary.rs	Loads/infer field `content_type` mappings from dictionary JSON or describegpt.
src/cmd/mod.rs	Registers the new synthesize module (feature-gated).
src/cmd/describegpt/dictionary.rs	Exposes dictionary/stat/frequency parsing pieces for reuse by synthesize.
src/cmd/describegpt.rs	Makes describegpt `dictionary` module visible to crate for reuse.
README.md	Adds synthesize to the command list.
docs/help/TableOfContents.md	Adds synthesize entry to help TOC.
docs/help/synthesize.md	Adds generated synthesize help page.
docs/help/frequency.md	Doc regen drift update (DataSketches big-endian note).
Cargo.toml	Adds optional `fake` dep and `synthesize` feature; wires into feature groups.
Cargo.lock	Locks new dependency graph for `fake` and transitive deps.
.claude/skills/qsv/qsv-synthesize.json	Generated MCP skill definition for synthesize.
.claude/skills/qsv/qsv-stats.json	Regenerated skills JSON reflecting stats doc drift.
.claude/skills/qsv/qsv-frequency.json	Regenerated skills JSON reflecting frequency doc drift.

15 DevSkim code-scanning alerts on the synthesize PR, all false positives: * DS148264 ("weak/non-cryptographic RNG") — 12 hits on `StdRng::seed_from_u64` in *test code*. The whole point of these tests is determinism with a fixed seed; synthesize generates *fake data*, not security tokens. Added `// DevSkim: ignore DS148264` inline (matches the existing pattern in `select.rs:142`, `sample.rs:406`, `sort.rs:202-214`, `pragmastat.rs:668`). * DS126858 ("weak/broken hash algorithm") — 3 hits on the variable name `cmd2` in `tests/test_synthesize.rs`. The substring `md2` matched DevSkim's legacy-hash regex (md2/md4/md5/sha0/sha1). Renamed `cmd1`/`cmd2` → `first_cmd`/`second_cmd` — clearer anyway, no behavior change. Tests: 12 unit + 9 integration, all still passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mpling Three valid issues caught by Copilot's PR review: * `--jobs` was passed to `qsv count`, which doesn't accept it. Split the analysis args into `delim_only_args` (for `count`, which accepts `-d` but not `--jobs`) and `analysis_common_args` (for `stats` / `frequency`, which accept both). `index` continues to get no extra args. * Best-effort indexing was effectively dead code: `Config::index_files()` returns `Ok(None)` (not `Err`) when no index exists, so `.is_err()` was false for the common unindexed case and the `qsv index` subprocess never fired. Switched to `!matches!(..., Ok(Some(_)))` so we now actually create the index when one is missing. Confirmed end-to-end — the "Indexed input" status line now appears for unindexed inputs. * `Date` columns severely under-sampled the max date: `stats` min/max/q* values for `Date` are at midnight, so sampling uniformly over epoch seconds put only a single tick of the max-day in the last bucket (the rest of that day fell outside any bucket). Now `parse_epoch` returns *whole days since the UNIX epoch* for `Date` columns (and still seconds for `DateTime`); the `DateQuantile::next` arm samples a whole day uniformly within the bucket and multiplies back to seconds for formatting. `DateTime` sampling is unchanged. Tests: 12 unit + 9 integration, all passing. Verified `qsv synthesize ... --jobs 4` works end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tions 5 findings on the initial commit (4d706e1) — 1 HIGH already addressed in 899de97 (index_files() Ok(None) check), 4 still open: * MEDIUM (generator.rs `parse_f64`): `parse_f64("NaN")` returned `Some(f64::NAN)`, which slipped past both the `hi < lo` short-circuit and the quartile sanity check, and would then panic in `rng.random_range`. Added `.filter(|v| v.is_finite())` so non-finite endpoints fail at parse time. Same path now rejects `±Infinity`. * LOW (USAGE / docs): "logical value maps consistently" was unqualified, but for cardinality > CARDINALITY_POOL_CAP (100k) we generate a fresh fake per row. Spelled this out in the USAGE. * LOW (mod.rs USAGE Examples block): the second example had two `$ qsv` lines under one comment, so the MCP-skill generator gave both the same description. Split into two comments — "First, generate the Data Dictionary with describegpt" / "Then layer in semantic fakers from the dictionary" — and regenerated qsv-synthesize.json + synthesize.md. * LOW (qsv-synthesize.json missing `--jobs`): not addressed by design. `--jobs` is intentionally skipped from MCP skills as "infrastructure setting" by mcp_skills_gen.rs:367 — verified the same flag is omitted from qsv-stats.json (29 options, no jobs), qsv-frequency.json (35 options, no jobs). This is a project-wide convention, not a synthesize-specific bug. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Declare a mutable String outside the loop and reuse it to avoid repeated allocations, adding #[allow(unused_assignments)] to suppress warnings. Replace the match on faker_map::content_type_to_value(...) with a direct call using the ? operator to return None early on failure, reducing nesting and simplifying control flow in build_faker_pool.

The prior commit hoisted `let mut value = String::new()` outside the loop under the banner of "reuse String buffer", but `content_type_to_value` returns a freshly-allocated `String` each call (the underlying fake-rs fakers always materialize an owned `String`). Reassigning the hoisted variable just drops the previous allocation and moves in the new one — no allocations are amortized. The `#[allow(unused_assignments)]` and the extra block were also pure noise. Dropped the hoist, the attribute, and the block. Behavior unchanged: let value = faker_map::content_type_to_value(content_type, rng)?; if seen.insert(value.clone()) { pool.push(value); } Actual buffer reuse would require changing `content_type_to_value` to write into a `&mut String`, but that wouldn't help either — fake-rs's `fake_with_rng::<String, _>` still allocates internally on every call. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

jqnatividad requested a review from Copilot May 15, 2026 03:18

Copilot started reviewing on behalf of jqnatividad May 15, 2026 03:18 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread src/cmd/synthesize/mod.rs

Comment thread src/cmd/synthesize/mod.rs Outdated

Comment thread src/cmd/synthesize/generator.rs

jqnatividad and others added 5 commits May 15, 2026 06:40

jqnatividad merged commit 3df17e5 into master May 15, 2026
26 of 27 checks passed

jqnatividad deleted the synthesize-command branch May 15, 2026 11:47

jqnatividad mentioned this pull request May 15, 2026

synthesize command: schema-informed synthetic data generator #235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(synthesize): generate statistically-faithful synthetic CSVs#3854

feat(synthesize): generate statistically-faithful synthetic CSVs#3854
jqnatividad merged 6 commits into
masterfrom
synthesize-command

jqnatividad commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jqnatividad commented May 15, 2026

What

How it works

CLI surface

Examples

Implementation notes

Out of scope / future work

Tests

Files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codacy-production Bot commented May 15, 2026 •

edited

Loading