feat(synthesize): preserve inter-column relationships by jqnatividad · Pull Request #3888 · dathere/qsv

jqnatividad · 2026-05-22T16:03:37Z

Problem

synthesize generated every column independently, so synthetic rows were individually plausible but jointly unrealistic — a Created Date could land after a Closed Date, impossible city/state/zip combinations appeared, and correlated measures drifted apart. The USAGE text even admitted "cross-column correlation is not modeled."

What this does

Adds a relationship framework. The data dictionary may now declare a relationships array whose named columns are generated jointly by a GroupGenerator instead of independent per-column generators. Three relationship classes:

`kind`	Preserves	Mechanism
`joint`	functional dependencies (city/state/zip)	frequency-weighted sampling of whole observed value-tuples — only real co-occurring combinations are emitted
`ordered`	monotonic chains (`created_date <= closed_date`, `subtotal <= total`)	anchor column + non-negative gaps learned from the source; gaps clamped `>= 0` so the order always holds
`correlated`	numeric correlation	Gaussian copula over each column's own quartile marginal — reproduces Spearman correlation without distorting marginals

Relationships come from the dictionary's relationships array, which is either hand-authored or inferred by describegpt: describegpt --dictionary --infer-content-type now asks the LLM to detect relationships and emits the array in the dictionary JSON. synthesize re-validates every relationship against the real data before use.

The copula math (Cholesky-with-ridge, erf, Box-Muller) is hand-rolled — no new dependency. Output stays fully reproducible with --seed; --no-relationships reproduces the legacy independent-generation behaviour exactly.

New `synthesize` flags

--no-relationships — disable relationship modeling
--joint-cardinality-cap <n> — distinct-tuple cap for joint groups (default 100000)
--correlation-threshold <f> — minimum |Spearman| for a pair to stay in a correlated group (default 0.3)
--strict-relationships — abort instead of degrading when a relationship fails validation

Testing

13 new synthesize integration tests + 3 new describegpt unit tests + 1 live-LLM integration test (gated behind QSV_TEST_DESCRIBEGPT).
All 33 synthesize + 81 describegpt unit tests pass; clippy clean.
Verified end-to-end against a local LLM (LM Studio): synthesize --infer-content-type on 1000 rows produced 0 created>closed violations, 0 subtotal>total violations, and only real city/state pairs.

🤖 Generated with Claude Code

`synthesize` previously generated every column independently, so synthetic rows were individually plausible but jointly unrealistic — a Created Date could land after a Closed Date, impossible city/state/zip combinations appeared, correlated measures drifted apart. Add a relationship framework: the data dictionary may now declare a `relationships` array whose named columns are generated jointly by a GroupGenerator instead of independent per-column generators. Three classes: * joint — frequency-weighted sampling of whole observed value-tuples; only real co-occurring combinations are emitted * ordered — anchor column + non-negative gaps learned from the source, so a monotonic chain (created_date <= closed_date) always holds * correlated — Gaussian copula over each column's own quartile marginal, reproducing Spearman correlation without distorting marginals New flags: --no-relationships, --joint-cardinality-cap, --correlation-threshold, --strict-relationships. Output stays fully reproducible with --seed; the copula math (Cholesky-with-ridge, erf, Box-Muller) is hand-rolled — no new dependency. Relationships are read from a hand-authored or describegpt-produced dictionary; automatic LLM inference of the array is a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`describegpt --dictionary --infer-content-type` now asks the LLM to detect inter-column relationships and emits them as a top-level `relationships` array in the dictionary JSON — the same array `synthesize` consumes to preserve inter-column structure. The dictionary prompt explains the three relationship kinds (joint, ordered, correlated) and asks for a `relationships` array alongside the per-field output. `parse_llm_relationships` extracts and structurally validates each entry against the real field names — dropping unknown kinds, unknown columns, and groups with fewer than two members. `synthesize` re-validates every relationship against the data before use, so this stage only guarantees structural soundness. Relationships flow through both the single-pass and `--two-pass` dictionary paths; two-pass takes them from the relationship-aware first pass. `format_dictionary_json` gains a `relationships` parameter and emits the array only when non-empty, so dictionaries without relationships stay byte-identical to the previous output. Tested against a live local LLM (gated behind QSV_TEST_DESCRIBEGPT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-22T16:05:22Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 187 complexity

Metric Results

Complexity 187

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds an inter-column “relationship” framework to qsv synthesize, allowing specified column groups to be generated jointly (functional-dependency tuples, ordered chains, and correlated numeric groups) instead of independently. This improves realism of synthetic rows and extends describegpt --dictionary --infer-content-type to optionally infer and emit a relationships array in the generated dictionary.

Changes:

Implement relationship resolution + a deterministic row-emission schedule, with new GroupGenerator implementations for joint, ordered, and correlated.
Extend synthesize dictionary handling to load/infer relationships, add new CLI flags controlling relationship behavior, and update docs/README.
Add integration/unit tests covering relationship behavior and dictionary relationship emission.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/test_synthesize.rs	Adds integration tests for `joint`/`ordered`/`correlated` relationship behavior and flags.
tests/test_describegpt.rs	Adds an integration test validating `relationships` appear structurally in dictionary JSON output.
src/cmd/synthesize/relationships.rs	New module: resolves dictionary relationships, learns parameters from the source CSV, builds emit schedule.
src/cmd/synthesize/mod.rs	Wires relationship scheduling into synthesize; adds new CLI flags and updates help text.
src/cmd/synthesize/group_generator.rs	New module: emits multi-column tuples for `joint`/`ordered`/`correlated` groups.
src/cmd/synthesize/generator.rs	Exposes helper functions (`parse_f64`, `parse_epoch`, bucket builders, null helpers) for relationship code reuse.
src/cmd/synthesize/dictionary.rs	Extends dictionary parsing/loading/inference to include `relationships`.
src/cmd/describegpt/formatters.rs	Adds optional emission of top-level `relationships` in dictionary JSON output.
src/cmd/describegpt/dictionary.rs	Adds `parse_llm_relationships` to extract/validate relationship declarations from LLM output.
src/cmd/describegpt.rs	Plumbs inferred relationships through dictionary build/format paths (including two-pass mode).
resources/describegpt_defaults.toml	Updates dictionary prompt template to ask the LLM for inter-column relationships.
README.md	Updates synthesize description to mention relationship preservation.
docs/help/synthesize.md	Updates help docs with relationship semantics and new flags.
.claude/skills/qsv/qsv-synthesize.json	Adds new synthesize flags/examples to the skill metadata.

…ptions Copilot review feedback on PR #3888: - ordered groups: collapse `OrderedKind::{Integer,Float}` into a single `Numeric` domain and track per-member integer formatting via `is_int: Vec<bool>`, so a mixed Integer/Float ordered group no longer drifts an Integer column to a float string (matches how `Correlated` already works). - correlated groups: replace the mean group-level `null_ratio` with a per-member `null_ratios` vector drawn independently, so each column keeps its own marginal null ratio and nulls no longer all co-occur. - synthesize USAGE: put the `--joint-cardinality-cap` / `--correlation-threshold` / `--strict-relationships` descriptions on the same line as the flag so `--update-mcp-skills` extracts them; regenerated the skill JSON and help docs. - tests: relationship tests now use `read_stdout_on_success`, and the describegpt relationships test asserts a zero exit before parsing stdout, so a failing run can't be masked by parseable output. Added a mixed Integer/Float ordered test covering the per-member typing fix. `ordered` still nulls the whole chain when the anchor is null — that is intentional: a monotonic chain `m[i] = m[i-1] + gap` cannot have a non-null member after a null predecessor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 2 commits May 22, 2026 08:03

jqnatividad requested a review from Copilot May 22, 2026 16:06

Copilot started reviewing on behalf of jqnatividad May 22, 2026 16:06 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

jqnatividad merged commit c5dfb20 into master May 22, 2026
29 checks passed

jqnatividad deleted the synthesize-relationships branch May 22, 2026 17:38

jqnatividad mentioned this pull request May 22, 2026

fix(describegpt): honor QSV_LLM_BASE_URL env var; repair stale describegpt tests #3889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(synthesize): preserve inter-column relationships#3888

feat(synthesize): preserve inter-column relationships#3888
jqnatividad merged 3 commits into
masterfrom
synthesize-relationships

jqnatividad commented May 22, 2026

Uh oh!

codacy-production Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 22, 2026

Problem

What this does

New synthesize flags

Testing

Uh oh!

codacy-production Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New `synthesize` flags

codacy-production Bot commented May 22, 2026 •

edited

Loading