feat(synthesize): generate statistically-faithful synthetic CSVs#3854
Merged
Conversation
…faithful synthetic CSVs
`qsv synthesize <input.csv>` runs `stats` + `frequency` + `count` on the source
and emits N rows of synthetic data that reproduce the source's per-column
attributes:
* Categorical / low-cardinality columns are reproduced by frequency-weighted
sampling of the *real* value set — cardinality, weights and repetition
structure preserved exactly.
* Numeric and date/datetime columns use quartile-bucketed generation so the
*shape* of the distribution is preserved, not just `[min, max]`.
* Null ratios are reproduced per column.
* `--seed` makes output fully reproducible (single master `StdRng` threads
through both selection logic and every faker call).
* `--dictionary <file>` layers in semantic Content Types from
`describegpt --dictionary --infer-content-type` — each token maps to a
`fake-rs` faker (40-token vocabulary covers names, emails, addresses,
UUIDs, etc.). Bounded-cardinality faker columns sample from a fixed
pre-generated pool of distinct fake values (the consistency mechanism, so a
given logical value maps consistently).
* `--infer-content-type` runs `describegpt` internally to build the dictionary
on the fly (needs `QSV_LLM_APIKEY`).
Cross-column correlation is explicitly out of scope for v1; columns are
generated independently.
Implementation notes:
* New module `src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs`.
Reuses `util::run_qsv_cmd` to shell out to `stats` / `frequency` / `count`
(the same primitive `describegpt::perform_analysis` uses) — no refactor of
describegpt itself was needed.
* Bumped a handful of `pub(super)` items in `src/cmd/describegpt/dictionary.rs`
to `pub(crate)` (`StatsRecord`, `FrequencyRecord`, `parse_stats_csv`,
`parse_frequency_csv`, `CONTENT_TYPE_VOCAB`) and made the `dictionary`
submodule `pub(crate)` so `synthesize` can consume them.
* `fake = "5"` (features: `derive`, `chrono`, `uuid`, `random_color`) added
as an optional dep. The new `synthesize` feature gates the dep and is wired
into `distrib_features` and `qsvmcp` (not `lite`, not `datapusher_plus`).
`fake` v5.1.0 uses `rand 0.10` (same as qsv), so its `fake_with_rng` accepts
qsv's seeded `StdRng` directly — no RNG version-bridging needed.
* Worked around a determinism bug in fake-rs's `Dummy<IP<L>> for String`
(ignores the passed RNG); `ip_address` uses `IPv4` directly instead.
* Made the help-markdown and MCP-skills generators support module-dir
commands (`src/cmd/<name>/mod.rs`), not just flat `src/cmd/<name>.rs`.
* Regen also picked up genuine doc drift in `qsv-stats.json`,
`qsv-frequency.json`, and `docs/help/frequency.md` (Apache DataSketches
big-endian notes) — included.
v1 is English-only; `--locale` is reserved for future multi-locale support
(each fake-rs locale is a distinct type, so multi-locale dispatch needs
macro expansion).
Tests: 12 unit tests + 9 integration tests, all passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 160 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new qsv synthesize command (feature-gated) to generate statistically-faithful synthetic CSVs, optionally layering semantic fakers from a describegpt data dictionary. Also updates the help/MCP-skill generators to support module-directory commands (src/cmd/<name>/mod.rs) and regenerates related docs/skill JSON.
Changes:
- Introduce
src/cmd/synthesize/with dictionary loading/inference, faker mapping, and per-column generator model. - Wire the new command into CLI dispatch, docs/TOC/README, MCP skills generation, and integration test registration.
- Add
fakeas an optional dependency behind a newsynthesizefeature; include it indistrib_features/qsvmcp.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/tests.rs | Registers synthesize integration tests under synthesize + feature_capable. |
| tests/test_synthesize.rs | Adds end-to-end integration coverage for synthesize CLI behaviors and error paths. |
| src/mcp_skills_gen.rs | Adds synthesize to MCP categories/command list and supports module-dir command sources. |
| src/main.rs | Adds synthesize to help text and command dispatch enum (feature-gated). |
| src/help_markdown_gen.rs | Supports module-dir command sources for help markdown generation. |
| src/cmd/synthesize/mod.rs | Implements CLI, runs internal stats/frequency/count, builds generators, and emits CSV. |
| src/cmd/synthesize/generator.rs | Implements per-column generator strategies (frequency-weighted, faker, quantile, etc.). |
| src/cmd/synthesize/faker_map.rs | Maps content_type tokens to fake-rs generators (EN-only v1). |
| src/cmd/synthesize/dictionary.rs | Loads/infer field content_type mappings from dictionary JSON or describegpt. |
| src/cmd/mod.rs | Registers the new synthesize module (feature-gated). |
| src/cmd/describegpt/dictionary.rs | Exposes dictionary/stat/frequency parsing pieces for reuse by synthesize. |
| src/cmd/describegpt.rs | Makes describegpt dictionary module visible to crate for reuse. |
| README.md | Adds synthesize to the command list. |
| docs/help/TableOfContents.md | Adds synthesize entry to help TOC. |
| docs/help/synthesize.md | Adds generated synthesize help page. |
| docs/help/frequency.md | Doc regen drift update (DataSketches big-endian note). |
| Cargo.toml | Adds optional fake dep and synthesize feature; wires into feature groups. |
| Cargo.lock | Locks new dependency graph for fake and transitive deps. |
| .claude/skills/qsv/qsv-synthesize.json | Generated MCP skill definition for synthesize. |
| .claude/skills/qsv/qsv-stats.json | Regenerated skills JSON reflecting stats doc drift. |
| .claude/skills/qsv/qsv-frequency.json | Regenerated skills JSON reflecting frequency doc drift. |
15 DevSkim code-scanning alerts on the synthesize PR, all false positives:
* DS148264 ("weak/non-cryptographic RNG") — 12 hits on `StdRng::seed_from_u64`
in *test code*. The whole point of these tests is determinism with a fixed
seed; synthesize generates *fake data*, not security tokens. Added
`// DevSkim: ignore DS148264` inline (matches the existing pattern in
`select.rs:142`, `sample.rs:406`, `sort.rs:202-214`, `pragmastat.rs:668`).
* DS126858 ("weak/broken hash algorithm") — 3 hits on the variable name `cmd2`
in `tests/test_synthesize.rs`. The substring `md2` matched DevSkim's
legacy-hash regex (md2/md4/md5/sha0/sha1). Renamed `cmd1`/`cmd2` →
`first_cmd`/`second_cmd` — clearer anyway, no behavior change.
Tests: 12 unit + 9 integration, all still passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mpling Three valid issues caught by Copilot's PR review: * `--jobs` was passed to `qsv count`, which doesn't accept it. Split the analysis args into `delim_only_args` (for `count`, which accepts `-d` but not `--jobs`) and `analysis_common_args` (for `stats` / `frequency`, which accept both). `index` continues to get no extra args. * Best-effort indexing was effectively dead code: `Config::index_files()` returns `Ok(None)` (not `Err`) when no index exists, so `.is_err()` was false for the common unindexed case and the `qsv index` subprocess never fired. Switched to `!matches!(..., Ok(Some(_)))` so we now actually create the index when one is missing. Confirmed end-to-end — the "Indexed input" status line now appears for unindexed inputs. * `Date` columns severely under-sampled the max date: `stats` min/max/q* values for `Date` are at midnight, so sampling uniformly over epoch seconds put only a single tick of the max-day in the last bucket (the rest of that day fell outside any bucket). Now `parse_epoch` returns *whole days since the UNIX epoch* for `Date` columns (and still seconds for `DateTime`); the `DateQuantile::next` arm samples a whole day uniformly within the bucket and multiplies back to seconds for formatting. `DateTime` sampling is unchanged. Tests: 12 unit + 9 integration, all passing. Verified `qsv synthesize ... --jobs 4` works end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tions 5 findings on the initial commit (4d706e1) — 1 HIGH already addressed in 899de97 (index_files() Ok(None) check), 4 still open: * MEDIUM (generator.rs `parse_f64`): `parse_f64("NaN")` returned `Some(f64::NAN)`, which slipped past both the `hi < lo` short-circuit and the quartile sanity check, and would then panic in `rng.random_range`. Added `.filter(|v| v.is_finite())` so non-finite endpoints fail at parse time. Same path now rejects `±Infinity`. * LOW (USAGE / docs): "logical value maps consistently" was unqualified, but for cardinality > CARDINALITY_POOL_CAP (100k) we generate a fresh fake per row. Spelled this out in the USAGE. * LOW (mod.rs USAGE Examples block): the second example had two `$ qsv` lines under one comment, so the MCP-skill generator gave both the same description. Split into two comments — "First, generate the Data Dictionary with describegpt" / "Then layer in semantic fakers from the dictionary" — and regenerated qsv-synthesize.json + synthesize.md. * LOW (qsv-synthesize.json missing `--jobs`): not addressed by design. `--jobs` is intentionally skipped from MCP skills as "infrastructure setting" by mcp_skills_gen.rs:367 — verified the same flag is omitted from qsv-stats.json (29 options, no jobs), qsv-frequency.json (35 options, no jobs). This is a project-wide convention, not a synthesize-specific bug. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Declare a mutable String outside the loop and reuse it to avoid repeated allocations, adding #[allow(unused_assignments)] to suppress warnings. Replace the match on faker_map::content_type_to_value(...) with a direct call using the ? operator to return None early on failure, reducing nesting and simplifying control flow in build_faker_pool.
The prior commit hoisted `let mut value = String::new()` outside the loop
under the banner of "reuse String buffer", but `content_type_to_value`
returns a freshly-allocated `String` each call (the underlying fake-rs
fakers always materialize an owned `String`). Reassigning the hoisted
variable just drops the previous allocation and moves in the new one —
no allocations are amortized. The `#[allow(unused_assignments)]` and the
extra block were also pure noise.
Dropped the hoist, the attribute, and the block. Behavior unchanged:
let value = faker_map::content_type_to_value(content_type, rng)?;
if seen.insert(value.clone()) {
pool.push(value);
}
Actual buffer reuse would require changing `content_type_to_value` to
write into a `&mut String`, but that wouldn't help either — fake-rs's
`fake_with_rng::<String, _>` still allocates internally on every call.
Tests: 12 unit + 9 integration, all passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
qsv synthesize <input.csv>— a new command that generates a synthetic CSVthat is statistically faithful to a source CSV. It runs
stats+frequency+counton the source and emits N rows that reproduce the source's per-columnattributes, optionally using a Data Dictionary's semantic Content Types to pick
realistic fake-rs fakers.
Use cases: test fixtures, demos, sharing dataset shape without real/PII data.
How it works
Per-column generator model, picked by construction-time precedence:
FrequencyWeighted— whenfrequencyfully enumerates the column (no"Other" catch-all bucket), sample the real value set with real frequency
weights. Reproduces cardinality, weights, and repetition structure exactly.
Wins even over a faker mapping (a
statecolumn with 50 enumerated valuesemits the real 50).
Faker— when the dictionarycontent_typemaps to afake-rsfaker.Bounded-cardinality columns pre-generate a fixed pool of distinct fake
values once and sample from it (the consistency mechanism — same logical
value maps to the same fake value).
NumericQuantile/DateQuantile— quartile-bucketed samplingreproduces the shape of the distribution, not just
[min, max].Boolean/LoremFallback/Emptysafety nets.Null ratios reproduced per column.
--seedmakes output fully reproducible(single master
StdRngthreads through both selection logic and every fakercall). Cross-column correlation is out of scope for v1 — columns generated
independently.
CLI surface
Examples
Implementation notes
src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs.util::run_qsv_cmdto shell out tostats/frequency/count(the same primitive
describegpt::perform_analysisuses) — no refactor ofdescribegpt itself was needed.
pub(super)items insrc/cmd/describegpt/dictionary.rsto
pub(crate)(StatsRecord,FrequencyRecord,parse_stats_csv,parse_frequency_csv,CONTENT_TYPE_VOCAB) plus thedictionarysubmoduleitself, so
synthesizecan consume them. No behavior changes in describegpt.fake = "5"(features:derive,chrono,uuid,random_color) added asan optional dep. New
synthesizefeature gates it and is wired intodistrib_features+qsvmcp(notlite, notdatapusher_plus). Verifiedvia
cargo tree -i rand:fakev5.1.0 usesrand 0.10(same major as qsv),so its
fake_with_rngaccepts qsv's seededStdRngdirectly — nocross-version RNG bridging needed.
Dummy<IP<L>> for String(the impl ignores the passed RNG);ip_addressuses
IPv4directly instead — samea.b.c.dformat, deterministic.commands (
src/cmd/<name>/mod.rs), not just flatsrc/cmd/<name>.rs.qsv-stats.json,qsv-frequency.json, anddocs/help/frequency.md(Apache DataSketchesbig-endian notes) — included.
Out of scope / future work
locale dispatch needs macro expansion).
--localeis accepted but onlyENis supported; non-EN errors with a clear message.--preserve-valuesmode that streams the source CSV and keys a realHashMap<original, fake>per column (the literal "same original → samefake" behavior, requires source re-read and ties output row count to source
row count). The
ColumnGenerator::Fakervariant is already shaped to slotthis in without restructuring.
Tests
src/cmd/synthesize/cover the faker map (every vocabtoken has a mapping or is in
{category, unknown}; deterministic withseed), the generator model (null-ratio reproduction, frequency-weighted
ratios, quantile-in-range, faker-pool cardinality, date-in-range, seed
reproducibility), and dictionary loading (JSON
nullfields normalize tounknown).tests/test_synthesize.rscover the end-to-endcommand: no-dictionary basic shape, seed reproducibility,
--rows,dictionary-driven faker for high-cardinality column, null-ratio over 5000
rows,
--output, and the three error paths (bad locale, zero rows,missing input).
cargo clippy -F all_featuresclean for synthesize.Files
New:
src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rstests/test_synthesize.rsdocs/help/synthesize.md,.claude/skills/qsv/qsv-synthesize.json(generated)Modified:
Cargo.toml,Cargo.lock—fakedep +synthesizefeaturesrc/cmd/describegpt/dictionary.rs,src/cmd/describegpt.rs— visibility bumps onlysrc/cmd/mod.rs,src/main.rs,tests/tests.rs— gated registrationsrc/help_markdown_gen.rs,src/mcp_skills_gen.rs— module-dir supportREADME.md,docs/help/TableOfContents.md— command row + TOCdocs/help/frequency.md,.claude/skills/qsv/qsv-{stats,frequency}.json— drift catches from regen🤖 Generated with Claude Code