feat(synthesize): --consistent-fakes for stable source→fake mapping#3865
Merged
Conversation
When set, structured-faker columns (first_name, email, city, etc.) with bounded cardinality build a deterministic source-value -> fake-value map at build time. The output then samples a source value by its real frequency and emits the paired fake — same source value always maps to the same fake, while the source distribution is preserved. Opt-in via --consistent-fakes (default off, no behavior change for existing users). The new path runs BEFORE try_frequency_weighted for structured fakers so it overrides the "emit real values when frequency- enumerated" default — that's the whole point of the flag: emit fakes (deidentified), not real values, with a stable mapping. Has no effect on unstructured columns (lorem_*, free_text, unknown), all-unique columns, or non-faker columns. Useful for deidentified synthesis where you want stable joins on the faked columns. Also regenerated MCP skill JSONs via --update-mcp-skills; the qsv-describegpt.json delta is unrelated upstream drift from prior commits brought back in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 40 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an opt-in --consistent-fakes mode to qsv synthesize to produce deterministic source→fake mappings for structured faker columns when the column is fully enumerated by frequency, enabling stable joins across runs while preserving the source distribution.
Changes:
- Introduces
ColumnGenerator::FakerMapped+try_faker_mappedand inserts it ahead ofFrequencyWeightedwhen--consistent-fakesis enabled. - Threads a new
--consistent-fakesCLI flag through synthesize argument parsing into generator construction, and updates help artifacts. - Adds integration tests covering mapping stability, distribution preservation, seed reproducibility, default behavior when the flag is off, and unstructured-text passthrough.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/cmd/synthesize/mod.rs |
Adds --consistent-fakes flag, plumbs it into generator construction. |
src/cmd/synthesize/generator.rs |
Implements FakerMapped generator path and selection precedence updates. |
tests/test_synthesize.rs |
Adds end-to-end tests for consistent fakes behavior and regressions. |
docs/help/synthesize.md |
Regenerated help docs to include the new flag. |
.claude/skills/qsv/qsv-synthesize.json |
Regenerated MCP skill JSON for synthesize, including new flag and updated locale docs. |
.claude/skills/qsv/qsv-describegpt.json |
Regenerated MCP skill JSON drift sync (includes --two-pass docs). |
…lues * `try_faker_mapped`: add `CARDINALITY_POOL_CAP` guard so a user passing `--freq-limit 0` on a high-cardinality column does not allocate a huge `HashSet`/`Vec` or spin in a `source_count * 20` attempt loop. Above the cap, fall through to the regular `Faker` variant (per-row gen, no pre-allocated map). Matches `build_faker_pool`'s existing safety bound. * `synthesize_consistent_fakes_*` tests: prefix source values with `SRC_` (e.g. `SRC_Michael`) so they cannot collide with anything the `first_name` faker emits, keeping the "no real value leaked" assertion robust to fake-rs locale-dataset / version drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--consistent-fakestoqsv synthesize: an opt-in mode where structured-faker columns (first_name, email, city, street_name, etc.) with bounded cardinality build a deterministicsource_value → fake_valuemap at build time. The same source value always emits the same fake; the source frequency distribution is preserved.FakerMappedColumnGeneratorvariant +try_faker_mappedhelper. Runs beforetry_frequency_weightedfor structured fakers when the flag is set — that overrides the "emit real values when frequency-enumerated" default, which is the whole point: emit fakes, not real values, with stable mapping.lorem_*,free_text,unknown), all-unique columns, or non-faker columns.Useful for deidentified synthesis where you want stable joins on the faked columns (e.g.
customer_id's synthesized email always points to the same fake email across runs).Example
Source CSV with
Michael × 3, Sarah × 2, Tom × 1, content typefirst_name:Output preserves the 3:2:1 distribution but every "Michael" maps to the same fake (e.g. "John"), every "Sarah" to the same fake (e.g. "Emma"), etc. — and the mapping is
--seed-reproducible across runs.Construction precedence (new)
FakerMapped(new, only when--consistent-fakesis set + structured faker + frequency-enumerated)FrequencyWeighted— emits real values with real weightsFaker— bounded-pool or per-row fresh fakeFiles changed
src/cmd/synthesize/mod.rs—--consistent-fakesUSAGE entry,Argsfield, thread intoColumnGenerator::build.src/cmd/synthesize/generator.rs—FakerMappedvariant,try_faker_mappedhelper,build()gate,next()arm, module-level docstring updated, 8 inline test call-sites adjusted.tests/test_synthesize.rs— 5 new tests.docs/help/synthesize.md+.claude/skills/qsv/qsv-synthesize.json— regenerated viaqsv --generate-help-mdandqsv --update-mcp-skills..claude/skills/qsv/qsv-describegpt.json— unrelated upstream drift from prior commits brought back in sync by--update-mcp-skills(same atomic regen step).Test plan
cargo build --locked --bin qsv -F all_features— passescargo +nightly fmt— cleancargo clippy --bin qsv -F all_features— no warningscargo t synthesize -F all_features— 53 tests pass (35 unit + 18 integration including 5 new)synthesize_consistent_fakes_stable_mapping— distinct source values map to distinct fakes; no real values leaksynthesize_consistent_fakes_distribution_preserved— 6/4/2 source distribution reproduced within ±15% over 1200 rowssynthesize_consistent_fakes_seed_reproducible— same seed → byte-identical outputsynthesize_consistent_fakes_off_preserves_default— without the flag, frequency-enumeration still emits real values (regression guard)synthesize_consistent_fakes_unstructured_passthrough—free_textcolumns still hit the length-stat truncation pathqsv synthesize --helpshows the new flag🤖 Generated with Claude Code