Skip to content

feat(synthesize): generate statistically-faithful synthetic CSVs#3854

Merged
jqnatividad merged 6 commits into
masterfrom
synthesize-command
May 15, 2026
Merged

feat(synthesize): generate statistically-faithful synthetic CSVs#3854
jqnatividad merged 6 commits into
masterfrom
synthesize-command

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

What

Adds qsv synthesize <input.csv> — a new command that generates a synthetic CSV
that is statistically faithful to a source CSV. It runs stats + frequency +
count on the source and emits N rows that reproduce the source's per-column
attributes, optionally using a Data Dictionary's semantic Content Types to pick
realistic fake-rs fakers.

Use cases: test fixtures, demos, sharing dataset shape without real/PII data.

How it works

Per-column generator model, picked by construction-time precedence:

  1. FrequencyWeighted — when frequency fully enumerates the column (no
    "Other" catch-all bucket), sample the real value set with real frequency
    weights. Reproduces cardinality, weights, and repetition structure exactly.
    Wins even over a faker mapping (a state column with 50 enumerated values
    emits the real 50).
  2. Faker — when the dictionary content_type maps to a fake-rs faker.
    Bounded-cardinality columns pre-generate a fixed pool of distinct fake
    values once and sample from it (the consistency mechanism — same logical
    value maps to the same fake value).
  3. NumericQuantile / DateQuantile — quartile-bucketed sampling
    reproduces the shape of the distribution, not just [min, max].
  4. Boolean / LoremFallback / Empty safety nets.

Null ratios reproduced per column. --seed makes output fully reproducible
(single master StdRng threads through both selection logic and every faker
call). Cross-column correlation is out of scope for v1 — columns generated
independently.

CLI surface

qsv synthesize [options] <input>

synthesize options:
    --dictionary <file>    Data Dictionary JSON from
                           `describegpt --dictionary --infer-content-type --format JSON`.
    --infer-content-type   Build the dictionary on the fly via describegpt
                           (needs QSV_LLM_APIKEY).
    -n, --rows <n>         Number of synthetic rows.       [default: 100]
    --seed <n>             RNG seed for reproducible output.
    --locale <loc>         fake-rs locale.                 [default: EN]
    --freq-limit <n>       Frequency pool depth.           [default: 100]
    --stats-options <arg>  Extra options for the internal `stats` run.
    -j, --jobs <arg>       Jobs for `stats` / `frequency`.

Common options:
    -h, --help, -o, --output, -d, --delimiter

Examples

# Pure statistical synthesis — no dictionary needed
qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

# Layer in semantic fakers from a pre-made Data Dictionary
qsv describegpt data.csv --dictionary --infer-content-type --format JSON -o dict.json
qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

# Let synthesize build the dictionary itself (needs an LLM API key)
qsv synthesize data.csv --infer-content-type -n 1000 > synthetic.csv

Implementation notes

  • New module src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs.
  • Reuses util::run_qsv_cmd to shell out to stats / frequency / count
    (the same primitive describegpt::perform_analysis uses) — no refactor of
    describegpt itself was needed.
  • Bumped a handful of pub(super) items in src/cmd/describegpt/dictionary.rs
    to pub(crate) (StatsRecord, FrequencyRecord, parse_stats_csv,
    parse_frequency_csv, CONTENT_TYPE_VOCAB) plus the dictionary submodule
    itself, so synthesize can consume them. No behavior changes in describegpt.
  • fake = "5" (features: derive, chrono, uuid, random_color) added as
    an optional dep. New synthesize feature gates it and is wired into
    distrib_features + qsvmcp (not lite, not datapusher_plus). Verified
    via cargo tree -i rand: fake v5.1.0 uses rand 0.10 (same major as qsv),
    so its fake_with_rng accepts qsv's seeded StdRng directly — no
    cross-version RNG bridging needed.
  • Worked around a determinism bug in fake-rs v5.1.0's
    Dummy<IP<L>> for String (the impl ignores the passed RNG); ip_address
    uses IPv4 directly instead — same a.b.c.d format, deterministic.
  • Made the help-markdown and MCP-skills generators support module-dir
    commands (src/cmd/<name>/mod.rs), not just flat src/cmd/<name>.rs.
  • Regen also picked up genuine doc drift in qsv-stats.json,
    qsv-frequency.json, and docs/help/frequency.md (Apache DataSketches
    big-endian notes) — included.

Out of scope / future work

  • Cross-column correlation (independent generation in v1).
  • Multi-locale support (each fake-rs locale is a distinct type, so multi-
    locale dispatch needs macro expansion). --locale is accepted but only
    EN is supported; non-EN errors with a clear message.
  • --preserve-values mode that streams the source CSV and keys a real
    HashMap<original, fake> per column (the literal "same original → same
    fake" behavior, requires source re-read and ties output row count to source
    row count). The ColumnGenerator::Faker variant is already shaped to slot
    this in without restructuring.

Tests

  • 12 unit tests in src/cmd/synthesize/ cover the faker map (every vocab
    token has a mapping or is in {category, unknown}; deterministic with
    seed), the generator model (null-ratio reproduction, frequency-weighted
    ratios, quantile-in-range, faker-pool cardinality, date-in-range, seed
    reproducibility), and dictionary loading (JSON null fields normalize to
    unknown).
  • 9 integration tests in tests/test_synthesize.rs cover the end-to-end
    command: no-dictionary basic shape, seed reproducibility, --rows,
    dictionary-driven faker for high-cardinality column, null-ratio over 5000
    rows, --output, and the three error paths (bad locale, zero rows,
    missing input).
  • cargo clippy -F all_features clean for synthesize.

Files

New:

  • src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs
  • tests/test_synthesize.rs
  • docs/help/synthesize.md, .claude/skills/qsv/qsv-synthesize.json (generated)

Modified:

  • Cargo.toml, Cargo.lockfake dep + synthesize feature
  • src/cmd/describegpt/dictionary.rs, src/cmd/describegpt.rs — visibility bumps only
  • src/cmd/mod.rs, src/main.rs, tests/tests.rs — gated registration
  • src/help_markdown_gen.rs, src/mcp_skills_gen.rs — module-dir support
  • README.md, docs/help/TableOfContents.md — command row + TOC
  • docs/help/frequency.md, .claude/skills/qsv/qsv-{stats,frequency}.json — drift catches from regen

🤖 Generated with Claude Code

…faithful synthetic CSVs

`qsv synthesize <input.csv>` runs `stats` + `frequency` + `count` on the source
and emits N rows of synthetic data that reproduce the source's per-column
attributes:

* Categorical / low-cardinality columns are reproduced by frequency-weighted
  sampling of the *real* value set — cardinality, weights and repetition
  structure preserved exactly.
* Numeric and date/datetime columns use quartile-bucketed generation so the
  *shape* of the distribution is preserved, not just `[min, max]`.
* Null ratios are reproduced per column.
* `--seed` makes output fully reproducible (single master `StdRng` threads
  through both selection logic and every faker call).
* `--dictionary <file>` layers in semantic Content Types from
  `describegpt --dictionary --infer-content-type` — each token maps to a
  `fake-rs` faker (40-token vocabulary covers names, emails, addresses,
  UUIDs, etc.). Bounded-cardinality faker columns sample from a fixed
  pre-generated pool of distinct fake values (the consistency mechanism, so a
  given logical value maps consistently).
* `--infer-content-type` runs `describegpt` internally to build the dictionary
  on the fly (needs `QSV_LLM_APIKEY`).

Cross-column correlation is explicitly out of scope for v1; columns are
generated independently.

Implementation notes:

* New module `src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs`.
  Reuses `util::run_qsv_cmd` to shell out to `stats` / `frequency` / `count`
  (the same primitive `describegpt::perform_analysis` uses) — no refactor of
  describegpt itself was needed.
* Bumped a handful of `pub(super)` items in `src/cmd/describegpt/dictionary.rs`
  to `pub(crate)` (`StatsRecord`, `FrequencyRecord`, `parse_stats_csv`,
  `parse_frequency_csv`, `CONTENT_TYPE_VOCAB`) and made the `dictionary`
  submodule `pub(crate)` so `synthesize` can consume them.
* `fake = "5"` (features: `derive`, `chrono`, `uuid`, `random_color`) added
  as an optional dep. The new `synthesize` feature gates the dep and is wired
  into `distrib_features` and `qsvmcp` (not `lite`, not `datapusher_plus`).
  `fake` v5.1.0 uses `rand 0.10` (same as qsv), so its `fake_with_rng` accepts
  qsv's seeded `StdRng` directly — no RNG version-bridging needed.
* Worked around a determinism bug in fake-rs's `Dummy<IP<L>> for String`
  (ignores the passed RNG); `ip_address` uses `IPv4` directly instead.
* Made the help-markdown and MCP-skills generators support module-dir
  commands (`src/cmd/<name>/mod.rs`), not just flat `src/cmd/<name>.rs`.
* Regen also picked up genuine doc drift in `qsv-stats.json`,
  `qsv-frequency.json`, and `docs/help/frequency.md` (Apache DataSketches
  big-endian notes) — included.

v1 is English-only; `--locale` is reserved for future multi-locale support
(each fake-rs locale is a distinct type, so multi-locale dispatch needs
macro expansion).

Tests: 12 unit tests + 9 integration tests, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/synthesize/faker_map.rs Fixed
Comment thread src/cmd/synthesize/faker_map.rs Fixed
Comment thread src/cmd/synthesize/faker_map.rs Fixed
Comment thread src/cmd/synthesize/faker_map.rs Fixed
Comment thread src/cmd/synthesize/generator.rs Fixed
Comment thread src/cmd/synthesize/generator.rs Fixed
Comment thread src/cmd/synthesize/generator.rs Fixed
Comment thread tests/test_synthesize.rs Fixed
Comment thread tests/test_synthesize.rs Fixed
Comment thread tests/test_synthesize.rs Fixed
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 15, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 160 complexity

Metric Results
Complexity 160

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new qsv synthesize command (feature-gated) to generate statistically-faithful synthetic CSVs, optionally layering semantic fakers from a describegpt data dictionary. Also updates the help/MCP-skill generators to support module-directory commands (src/cmd/<name>/mod.rs) and regenerates related docs/skill JSON.

Changes:

  • Introduce src/cmd/synthesize/ with dictionary loading/inference, faker mapping, and per-column generator model.
  • Wire the new command into CLI dispatch, docs/TOC/README, MCP skills generation, and integration test registration.
  • Add fake as an optional dependency behind a new synthesize feature; include it in distrib_features/qsvmcp.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/tests.rs Registers synthesize integration tests under synthesize + feature_capable.
tests/test_synthesize.rs Adds end-to-end integration coverage for synthesize CLI behaviors and error paths.
src/mcp_skills_gen.rs Adds synthesize to MCP categories/command list and supports module-dir command sources.
src/main.rs Adds synthesize to help text and command dispatch enum (feature-gated).
src/help_markdown_gen.rs Supports module-dir command sources for help markdown generation.
src/cmd/synthesize/mod.rs Implements CLI, runs internal stats/frequency/count, builds generators, and emits CSV.
src/cmd/synthesize/generator.rs Implements per-column generator strategies (frequency-weighted, faker, quantile, etc.).
src/cmd/synthesize/faker_map.rs Maps content_type tokens to fake-rs generators (EN-only v1).
src/cmd/synthesize/dictionary.rs Loads/infer field content_type mappings from dictionary JSON or describegpt.
src/cmd/mod.rs Registers the new synthesize module (feature-gated).
src/cmd/describegpt/dictionary.rs Exposes dictionary/stat/frequency parsing pieces for reuse by synthesize.
src/cmd/describegpt.rs Makes describegpt dictionary module visible to crate for reuse.
README.md Adds synthesize to the command list.
docs/help/TableOfContents.md Adds synthesize entry to help TOC.
docs/help/synthesize.md Adds generated synthesize help page.
docs/help/frequency.md Doc regen drift update (DataSketches big-endian note).
Cargo.toml Adds optional fake dep and synthesize feature; wires into feature groups.
Cargo.lock Locks new dependency graph for fake and transitive deps.
.claude/skills/qsv/qsv-synthesize.json Generated MCP skill definition for synthesize.
.claude/skills/qsv/qsv-stats.json Regenerated skills JSON reflecting stats doc drift.
.claude/skills/qsv/qsv-frequency.json Regenerated skills JSON reflecting frequency doc drift.

Comment thread src/cmd/synthesize/mod.rs
Comment thread src/cmd/synthesize/mod.rs Outdated
Comment thread src/cmd/synthesize/generator.rs
jqnatividad and others added 5 commits May 15, 2026 06:40
15 DevSkim code-scanning alerts on the synthesize PR, all false positives:

* DS148264 ("weak/non-cryptographic RNG") — 12 hits on `StdRng::seed_from_u64`
  in *test code*. The whole point of these tests is determinism with a fixed
  seed; synthesize generates *fake data*, not security tokens. Added
  `// DevSkim: ignore DS148264` inline (matches the existing pattern in
  `select.rs:142`, `sample.rs:406`, `sort.rs:202-214`, `pragmastat.rs:668`).

* DS126858 ("weak/broken hash algorithm") — 3 hits on the variable name `cmd2`
  in `tests/test_synthesize.rs`. The substring `md2` matched DevSkim's
  legacy-hash regex (md2/md4/md5/sha0/sha1). Renamed `cmd1`/`cmd2` →
  `first_cmd`/`second_cmd` — clearer anyway, no behavior change.

Tests: 12 unit + 9 integration, all still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mpling

Three valid issues caught by Copilot's PR review:

* `--jobs` was passed to `qsv count`, which doesn't accept it. Split the
  analysis args into `delim_only_args` (for `count`, which accepts `-d` but
  not `--jobs`) and `analysis_common_args` (for `stats` / `frequency`, which
  accept both). `index` continues to get no extra args.

* Best-effort indexing was effectively dead code: `Config::index_files()`
  returns `Ok(None)` (not `Err`) when no index exists, so `.is_err()` was
  false for the common unindexed case and the `qsv index` subprocess never
  fired. Switched to `!matches!(..., Ok(Some(_)))` so we now actually create
  the index when one is missing. Confirmed end-to-end — the "Indexed input"
  status line now appears for unindexed inputs.

* `Date` columns severely under-sampled the max date: `stats` min/max/q*
  values for `Date` are at midnight, so sampling uniformly over epoch seconds
  put only a single tick of the max-day in the last bucket (the rest of that
  day fell outside any bucket). Now `parse_epoch` returns *whole days since
  the UNIX epoch* for `Date` columns (and still seconds for `DateTime`); the
  `DateQuantile::next` arm samples a whole day uniformly within the bucket
  and multiplies back to seconds for formatting. `DateTime` sampling is
  unchanged.

Tests: 12 unit + 9 integration, all passing. Verified `qsv synthesize ...
--jobs 4` works end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tions

5 findings on the initial commit (4d706e1) — 1 HIGH already addressed in
899de97 (index_files() Ok(None) check), 4 still open:

* MEDIUM (generator.rs `parse_f64`): `parse_f64("NaN")` returned
  `Some(f64::NAN)`, which slipped past both the `hi < lo` short-circuit and
  the quartile sanity check, and would then panic in `rng.random_range`.
  Added `.filter(|v| v.is_finite())` so non-finite endpoints fail at parse
  time. Same path now rejects `±Infinity`.

* LOW (USAGE / docs): "logical value maps consistently" was unqualified, but
  for cardinality > CARDINALITY_POOL_CAP (100k) we generate a fresh fake per
  row. Spelled this out in the USAGE.

* LOW (mod.rs USAGE Examples block): the second example had two `$ qsv`
  lines under one comment, so the MCP-skill generator gave both the same
  description. Split into two comments — "First, generate the Data Dictionary
  with describegpt" / "Then layer in semantic fakers from the dictionary" —
  and regenerated qsv-synthesize.json + synthesize.md.

* LOW (qsv-synthesize.json missing `--jobs`): not addressed by design.
  `--jobs` is intentionally skipped from MCP skills as "infrastructure
  setting" by mcp_skills_gen.rs:367 — verified the same flag is omitted from
  qsv-stats.json (29 options, no jobs), qsv-frequency.json (35 options, no
  jobs). This is a project-wide convention, not a synthesize-specific bug.

Tests: 12 unit + 9 integration, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Declare a mutable String outside the loop and reuse it to avoid repeated allocations, adding #[allow(unused_assignments)] to suppress warnings. Replace the match on faker_map::content_type_to_value(...) with a direct call using the ? operator to return None early on failure, reducing nesting and simplifying control flow in build_faker_pool.
The prior commit hoisted `let mut value = String::new()` outside the loop
under the banner of "reuse String buffer", but `content_type_to_value`
returns a freshly-allocated `String` each call (the underlying fake-rs
fakers always materialize an owned `String`). Reassigning the hoisted
variable just drops the previous allocation and moves in the new one —
no allocations are amortized. The `#[allow(unused_assignments)]` and the
extra block were also pure noise.

Dropped the hoist, the attribute, and the block. Behavior unchanged:

    let value = faker_map::content_type_to_value(content_type, rng)?;
    if seen.insert(value.clone()) {
        pool.push(value);
    }

Actual buffer reuse would require changing `content_type_to_value` to
write into a `&mut String`, but that wouldn't help either — fake-rs's
`fake_with_rng::<String, _>` still allocates internally on every call.

Tests: 12 unit + 9 integration, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 3df17e5 into master May 15, 2026
26 of 27 checks passed
@jqnatividad jqnatividad deleted the synthesize-command branch May 15, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants