feat(synthesize): preserve inter-column relationships#3888
Conversation
`synthesize` previously generated every column independently, so synthetic
rows were individually plausible but jointly unrealistic — a Created Date
could land after a Closed Date, impossible city/state/zip combinations
appeared, correlated measures drifted apart.
Add a relationship framework: the data dictionary may now declare a
`relationships` array whose named columns are generated jointly by a
GroupGenerator instead of independent per-column generators. Three classes:
* joint — frequency-weighted sampling of whole observed value-tuples;
only real co-occurring combinations are emitted
* ordered — anchor column + non-negative gaps learned from the source,
so a monotonic chain (created_date <= closed_date) always holds
* correlated — Gaussian copula over each column's own quartile marginal,
reproducing Spearman correlation without distorting marginals
New flags: --no-relationships, --joint-cardinality-cap, --correlation-threshold,
--strict-relationships. Output stays fully reproducible with --seed; the copula
math (Cholesky-with-ridge, erf, Box-Muller) is hand-rolled — no new dependency.
Relationships are read from a hand-authored or describegpt-produced dictionary;
automatic LLM inference of the array is a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`describegpt --dictionary --infer-content-type` now asks the LLM to detect inter-column relationships and emits them as a top-level `relationships` array in the dictionary JSON — the same array `synthesize` consumes to preserve inter-column structure. The dictionary prompt explains the three relationship kinds (joint, ordered, correlated) and asks for a `relationships` array alongside the per-field output. `parse_llm_relationships` extracts and structurally validates each entry against the real field names — dropping unknown kinds, unknown columns, and groups with fewer than two members. `synthesize` re-validates every relationship against the data before use, so this stage only guarantees structural soundness. Relationships flow through both the single-pass and `--two-pass` dictionary paths; two-pass takes them from the relationship-aware first pass. `format_dictionary_json` gains a `relationships` parameter and emits the array only when non-empty, so dictionaries without relationships stay byte-identical to the previous output. Tested against a live local LLM (gated behind QSV_TEST_DESCRIBEGPT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 187 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
Adds an inter-column “relationship” framework to qsv synthesize, allowing specified column groups to be generated jointly (functional-dependency tuples, ordered chains, and correlated numeric groups) instead of independently. This improves realism of synthetic rows and extends describegpt --dictionary --infer-content-type to optionally infer and emit a relationships array in the generated dictionary.
Changes:
- Implement relationship resolution + a deterministic row-emission schedule, with new
GroupGeneratorimplementations forjoint,ordered, andcorrelated. - Extend
synthesizedictionary handling to load/inferrelationships, add new CLI flags controlling relationship behavior, and update docs/README. - Add integration/unit tests covering relationship behavior and dictionary relationship emission.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_synthesize.rs | Adds integration tests for joint/ordered/correlated relationship behavior and flags. |
| tests/test_describegpt.rs | Adds an integration test validating relationships appear structurally in dictionary JSON output. |
| src/cmd/synthesize/relationships.rs | New module: resolves dictionary relationships, learns parameters from the source CSV, builds emit schedule. |
| src/cmd/synthesize/mod.rs | Wires relationship scheduling into synthesize; adds new CLI flags and updates help text. |
| src/cmd/synthesize/group_generator.rs | New module: emits multi-column tuples for joint/ordered/correlated groups. |
| src/cmd/synthesize/generator.rs | Exposes helper functions (parse_f64, parse_epoch, bucket builders, null helpers) for relationship code reuse. |
| src/cmd/synthesize/dictionary.rs | Extends dictionary parsing/loading/inference to include relationships. |
| src/cmd/describegpt/formatters.rs | Adds optional emission of top-level relationships in dictionary JSON output. |
| src/cmd/describegpt/dictionary.rs | Adds parse_llm_relationships to extract/validate relationship declarations from LLM output. |
| src/cmd/describegpt.rs | Plumbs inferred relationships through dictionary build/format paths (including two-pass mode). |
| resources/describegpt_defaults.toml | Updates dictionary prompt template to ask the LLM for inter-column relationships. |
| README.md | Updates synthesize description to mention relationship preservation. |
| docs/help/synthesize.md | Updates help docs with relationship semantics and new flags. |
| .claude/skills/qsv/qsv-synthesize.json | Adds new synthesize flags/examples to the skill metadata. |
…ptions Copilot review feedback on PR #3888: - ordered groups: collapse `OrderedKind::{Integer,Float}` into a single `Numeric` domain and track per-member integer formatting via `is_int: Vec<bool>`, so a mixed Integer/Float ordered group no longer drifts an Integer column to a float string (matches how `Correlated` already works). - correlated groups: replace the mean group-level `null_ratio` with a per-member `null_ratios` vector drawn independently, so each column keeps its own marginal null ratio and nulls no longer all co-occur. - synthesize USAGE: put the `--joint-cardinality-cap` / `--correlation-threshold` / `--strict-relationships` descriptions on the same line as the flag so `--update-mcp-skills` extracts them; regenerated the skill JSON and help docs. - tests: relationship tests now use `read_stdout_on_success`, and the describegpt relationships test asserts a zero exit before parsing stdout, so a failing run can't be masked by parseable output. Added a mixed Integer/Float ordered test covering the per-member typing fix. `ordered` still nulls the whole chain when the anchor is null — that is intentional: a monotonic chain `m[i] = m[i-1] + gap` cannot have a non-null member after a null predecessor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem
synthesizegenerated every column independently, so synthetic rows were individually plausible but jointly unrealistic — aCreated Datecould land after aClosed Date, impossiblecity/state/zipcombinations appeared, and correlated measures drifted apart. The USAGE text even admitted "cross-column correlation is not modeled."What this does
Adds a relationship framework. The data dictionary may now declare a
relationshipsarray whose named columns are generated jointly by aGroupGeneratorinstead of independent per-column generators. Three relationship classes:kindjointorderedcreated_date <= closed_date,subtotal <= total)>= 0so the order always holdscorrelatedRelationships come from the dictionary's
relationshipsarray, which is either hand-authored or inferred bydescribegpt:describegpt --dictionary --infer-content-typenow asks the LLM to detect relationships and emits the array in the dictionary JSON.synthesizere-validates every relationship against the real data before use.The copula math (Cholesky-with-ridge,
erf, Box-Muller) is hand-rolled — no new dependency. Output stays fully reproducible with--seed;--no-relationshipsreproduces the legacy independent-generation behaviour exactly.New
synthesizeflags--no-relationships— disable relationship modeling--joint-cardinality-cap <n>— distinct-tuple cap forjointgroups (default 100000)--correlation-threshold <f>— minimum |Spearman| for a pair to stay in acorrelatedgroup (default 0.3)--strict-relationships— abort instead of degrading when a relationship fails validationTesting
synthesizeintegration tests + 3 newdescribegptunit tests + 1 live-LLM integration test (gated behindQSV_TEST_DESCRIBEGPT).synthesize --infer-content-typeon 1000 rows produced 0created>closedviolations, 0subtotal>totalviolations, and only realcity/statepairs.🤖 Generated with Claude Code