Skip to content

feat(synthesize): --consistent-fakes for stable source→fake mapping#3865

Merged
jqnatividad merged 2 commits into
masterfrom
synthesize-consistent-fakes
May 17, 2026
Merged

feat(synthesize): --consistent-fakes for stable source→fake mapping#3865
jqnatividad merged 2 commits into
masterfrom
synthesize-consistent-fakes

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Adds --consistent-fakes to qsv synthesize: an opt-in mode where structured-faker columns (first_name, email, city, street_name, etc.) with bounded cardinality build a deterministic source_value → fake_value map at build time. The same source value always emits the same fake; the source frequency distribution is preserved.
  • Implemented as a new FakerMapped ColumnGenerator variant + try_faker_mapped helper. Runs before try_frequency_weighted for structured fakers when the flag is set — that overrides the "emit real values when frequency-enumerated" default, which is the whole point: emit fakes, not real values, with stable mapping.
  • Default OFF — zero behavior change for existing users. Has no effect on unstructured columns (lorem_*, free_text, unknown), all-unique columns, or non-faker columns.

Useful for deidentified synthesis where you want stable joins on the faked columns (e.g. customer_id's synthesized email always points to the same fake email across runs).

Example

Source CSV with Michael × 3, Sarah × 2, Tom × 1, content type first_name:

$ qsv synthesize data.csv --infer-content-type --consistent-fakes \
    --seed 42 -n 6 -o out.csv

Output preserves the 3:2:1 distribution but every "Michael" maps to the same fake (e.g. "John"), every "Sarah" to the same fake (e.g. "Emma"), etc. — and the mapping is --seed-reproducible across runs.

Construction precedence (new)

  1. FakerMapped (new, only when --consistent-fakes is set + structured faker + frequency-enumerated)
  2. FrequencyWeighted — emits real values with real weights
  3. Faker — bounded-pool or per-row fresh fake
  4. Type-based — numeric/date quantile, boolean, lorem fallback

Files changed

  • src/cmd/synthesize/mod.rs--consistent-fakes USAGE entry, Args field, thread into ColumnGenerator::build.
  • src/cmd/synthesize/generator.rsFakerMapped variant, try_faker_mapped helper, build() gate, next() arm, module-level docstring updated, 8 inline test call-sites adjusted.
  • tests/test_synthesize.rs — 5 new tests.
  • docs/help/synthesize.md + .claude/skills/qsv/qsv-synthesize.json — regenerated via qsv --generate-help-md and qsv --update-mcp-skills.
  • .claude/skills/qsv/qsv-describegpt.json — unrelated upstream drift from prior commits brought back in sync by --update-mcp-skills (same atomic regen step).

Test plan

  • cargo build --locked --bin qsv -F all_features — passes
  • cargo +nightly fmt — clean
  • cargo clippy --bin qsv -F all_features — no warnings
  • cargo t synthesize -F all_features53 tests pass (35 unit + 18 integration including 5 new)
  • New tests:
    • synthesize_consistent_fakes_stable_mapping — distinct source values map to distinct fakes; no real values leak
    • synthesize_consistent_fakes_distribution_preserved — 6/4/2 source distribution reproduced within ±15% over 1200 rows
    • synthesize_consistent_fakes_seed_reproducible — same seed → byte-identical output
    • synthesize_consistent_fakes_off_preserves_default — without the flag, frequency-enumeration still emits real values (regression guard)
    • synthesize_consistent_fakes_unstructured_passthroughfree_text columns still hit the length-stat truncation path
  • Manual: qsv synthesize --help shows the new flag
  • Help/MCP skill JSONs regenerated and verified

🤖 Generated with Claude Code

When set, structured-faker columns (first_name, email, city, etc.) with
bounded cardinality build a deterministic source-value -> fake-value map
at build time. The output then samples a source value by its real
frequency and emits the paired fake — same source value always maps to
the same fake, while the source distribution is preserved.

Opt-in via --consistent-fakes (default off, no behavior change for
existing users). The new path runs BEFORE try_frequency_weighted for
structured fakers so it overrides the "emit real values when frequency-
enumerated" default — that's the whole point of the flag: emit fakes
(deidentified), not real values, with a stable mapping. Has no effect
on unstructured columns (lorem_*, free_text, unknown), all-unique
columns, or non-faker columns.

Useful for deidentified synthesis where you want stable joins on the
faked columns.

Also regenerated MCP skill JSONs via --update-mcp-skills; the
qsv-describegpt.json delta is unrelated upstream drift from prior
commits brought back in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 17, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 40 complexity

Metric Results
Complexity 40

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in --consistent-fakes mode to qsv synthesize to produce deterministic source→fake mappings for structured faker columns when the column is fully enumerated by frequency, enabling stable joins across runs while preserving the source distribution.

Changes:

  • Introduces ColumnGenerator::FakerMapped + try_faker_mapped and inserts it ahead of FrequencyWeighted when --consistent-fakes is enabled.
  • Threads a new --consistent-fakes CLI flag through synthesize argument parsing into generator construction, and updates help artifacts.
  • Adds integration tests covering mapping stability, distribution preservation, seed reproducibility, default behavior when the flag is off, and unstructured-text passthrough.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/cmd/synthesize/mod.rs Adds --consistent-fakes flag, plumbs it into generator construction.
src/cmd/synthesize/generator.rs Implements FakerMapped generator path and selection precedence updates.
tests/test_synthesize.rs Adds end-to-end tests for consistent fakes behavior and regressions.
docs/help/synthesize.md Regenerated help docs to include the new flag.
.claude/skills/qsv/qsv-synthesize.json Regenerated MCP skill JSON for synthesize, including new flag and updated locale docs.
.claude/skills/qsv/qsv-describegpt.json Regenerated MCP skill JSON drift sync (includes --two-pass docs).

Comment thread src/cmd/synthesize/generator.rs
Comment thread tests/test_synthesize.rs Outdated
…lues

* `try_faker_mapped`: add `CARDINALITY_POOL_CAP` guard so a user passing
  `--freq-limit 0` on a high-cardinality column does not allocate a huge
  `HashSet`/`Vec` or spin in a `source_count * 20` attempt loop. Above
  the cap, fall through to the regular `Faker` variant (per-row gen, no
  pre-allocated map). Matches `build_faker_pool`'s existing safety bound.
* `synthesize_consistent_fakes_*` tests: prefix source values with `SRC_`
  (e.g. `SRC_Michael`) so they cannot collide with anything the
  `first_name` faker emits, keeping the "no real value leaked" assertion
  robust to fake-rs locale-dataset / version drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 0ce1213 into master May 17, 2026
26 of 28 checks passed
@jqnatividad jqnatividad deleted the synthesize-consistent-fakes branch May 17, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants