Skip to content

feat(synthesize): preserve inter-column relationships#3888

Merged
jqnatividad merged 3 commits into
masterfrom
synthesize-relationships
May 22, 2026
Merged

feat(synthesize): preserve inter-column relationships#3888
jqnatividad merged 3 commits into
masterfrom
synthesize-relationships

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Problem

synthesize generated every column independently, so synthetic rows were individually plausible but jointly unrealistic — a Created Date could land after a Closed Date, impossible city/state/zip combinations appeared, and correlated measures drifted apart. The USAGE text even admitted "cross-column correlation is not modeled."

What this does

Adds a relationship framework. The data dictionary may now declare a relationships array whose named columns are generated jointly by a GroupGenerator instead of independent per-column generators. Three relationship classes:

kind Preserves Mechanism
joint functional dependencies (city/state/zip) frequency-weighted sampling of whole observed value-tuples — only real co-occurring combinations are emitted
ordered monotonic chains (created_date <= closed_date, subtotal <= total) anchor column + non-negative gaps learned from the source; gaps clamped >= 0 so the order always holds
correlated numeric correlation Gaussian copula over each column's own quartile marginal — reproduces Spearman correlation without distorting marginals

Relationships come from the dictionary's relationships array, which is either hand-authored or inferred by describegpt: describegpt --dictionary --infer-content-type now asks the LLM to detect relationships and emits the array in the dictionary JSON. synthesize re-validates every relationship against the real data before use.

The copula math (Cholesky-with-ridge, erf, Box-Muller) is hand-rolled — no new dependency. Output stays fully reproducible with --seed; --no-relationships reproduces the legacy independent-generation behaviour exactly.

New synthesize flags

  • --no-relationships — disable relationship modeling
  • --joint-cardinality-cap <n> — distinct-tuple cap for joint groups (default 100000)
  • --correlation-threshold <f> — minimum |Spearman| for a pair to stay in a correlated group (default 0.3)
  • --strict-relationships — abort instead of degrading when a relationship fails validation

Testing

  • 13 new synthesize integration tests + 3 new describegpt unit tests + 1 live-LLM integration test (gated behind QSV_TEST_DESCRIBEGPT).
  • All 33 synthesize + 81 describegpt unit tests pass; clippy clean.
  • Verified end-to-end against a local LLM (LM Studio): synthesize --infer-content-type on 1000 rows produced 0 created>closed violations, 0 subtotal>total violations, and only real city/state pairs.

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 22, 2026 08:03
`synthesize` previously generated every column independently, so synthetic
rows were individually plausible but jointly unrealistic — a Created Date
could land after a Closed Date, impossible city/state/zip combinations
appeared, correlated measures drifted apart.

Add a relationship framework: the data dictionary may now declare a
`relationships` array whose named columns are generated jointly by a
GroupGenerator instead of independent per-column generators. Three classes:

  * joint      — frequency-weighted sampling of whole observed value-tuples;
                 only real co-occurring combinations are emitted
  * ordered    — anchor column + non-negative gaps learned from the source,
                 so a monotonic chain (created_date <= closed_date) always holds
  * correlated — Gaussian copula over each column's own quartile marginal,
                 reproducing Spearman correlation without distorting marginals

New flags: --no-relationships, --joint-cardinality-cap, --correlation-threshold,
--strict-relationships. Output stays fully reproducible with --seed; the copula
math (Cholesky-with-ridge, erf, Box-Muller) is hand-rolled — no new dependency.

Relationships are read from a hand-authored or describegpt-produced dictionary;
automatic LLM inference of the array is a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`describegpt --dictionary --infer-content-type` now asks the LLM to detect
inter-column relationships and emits them as a top-level `relationships`
array in the dictionary JSON — the same array `synthesize` consumes to
preserve inter-column structure.

The dictionary prompt explains the three relationship kinds (joint, ordered,
correlated) and asks for a `relationships` array alongside the per-field
output. `parse_llm_relationships` extracts and structurally validates each
entry against the real field names — dropping unknown kinds, unknown
columns, and groups with fewer than two members. `synthesize` re-validates
every relationship against the data before use, so this stage only
guarantees structural soundness.

Relationships flow through both the single-pass and `--two-pass` dictionary
paths; two-pass takes them from the relationship-aware first pass.
`format_dictionary_json` gains a `relationships` parameter and emits the
array only when non-empty, so dictionaries without relationships stay
byte-identical to the previous output.

Tested against a live local LLM (gated behind QSV_TEST_DESCRIBEGPT).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 22, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 187 complexity

Metric Results
Complexity 187

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an inter-column “relationship” framework to qsv synthesize, allowing specified column groups to be generated jointly (functional-dependency tuples, ordered chains, and correlated numeric groups) instead of independently. This improves realism of synthetic rows and extends describegpt --dictionary --infer-content-type to optionally infer and emit a relationships array in the generated dictionary.

Changes:

  • Implement relationship resolution + a deterministic row-emission schedule, with new GroupGenerator implementations for joint, ordered, and correlated.
  • Extend synthesize dictionary handling to load/infer relationships, add new CLI flags controlling relationship behavior, and update docs/README.
  • Add integration/unit tests covering relationship behavior and dictionary relationship emission.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_synthesize.rs Adds integration tests for joint/ordered/correlated relationship behavior and flags.
tests/test_describegpt.rs Adds an integration test validating relationships appear structurally in dictionary JSON output.
src/cmd/synthesize/relationships.rs New module: resolves dictionary relationships, learns parameters from the source CSV, builds emit schedule.
src/cmd/synthesize/mod.rs Wires relationship scheduling into synthesize; adds new CLI flags and updates help text.
src/cmd/synthesize/group_generator.rs New module: emits multi-column tuples for joint/ordered/correlated groups.
src/cmd/synthesize/generator.rs Exposes helper functions (parse_f64, parse_epoch, bucket builders, null helpers) for relationship code reuse.
src/cmd/synthesize/dictionary.rs Extends dictionary parsing/loading/inference to include relationships.
src/cmd/describegpt/formatters.rs Adds optional emission of top-level relationships in dictionary JSON output.
src/cmd/describegpt/dictionary.rs Adds parse_llm_relationships to extract/validate relationship declarations from LLM output.
src/cmd/describegpt.rs Plumbs inferred relationships through dictionary build/format paths (including two-pass mode).
resources/describegpt_defaults.toml Updates dictionary prompt template to ask the LLM for inter-column relationships.
README.md Updates synthesize description to mention relationship preservation.
docs/help/synthesize.md Updates help docs with relationship semantics and new flags.
.claude/skills/qsv/qsv-synthesize.json Adds new synthesize flags/examples to the skill metadata.

Comment thread src/cmd/synthesize/relationships.rs Outdated
Comment thread src/cmd/synthesize/relationships.rs
Comment thread src/cmd/synthesize/relationships.rs Outdated
Comment thread tests/test_synthesize.rs Outdated
Comment thread .claude/skills/qsv/qsv-synthesize.json
Comment thread .claude/skills/qsv/qsv-synthesize.json Outdated
Comment thread .claude/skills/qsv/qsv-synthesize.json Outdated
Comment thread tests/test_describegpt.rs Outdated
…ptions

Copilot review feedback on PR #3888:

- ordered groups: collapse `OrderedKind::{Integer,Float}` into a single
  `Numeric` domain and track per-member integer formatting via `is_int:
  Vec<bool>`, so a mixed Integer/Float ordered group no longer drifts an
  Integer column to a float string (matches how `Correlated` already works).
- correlated groups: replace the mean group-level `null_ratio` with a
  per-member `null_ratios` vector drawn independently, so each column keeps
  its own marginal null ratio and nulls no longer all co-occur.
- synthesize USAGE: put the `--joint-cardinality-cap` /
  `--correlation-threshold` / `--strict-relationships` descriptions on the
  same line as the flag so `--update-mcp-skills` extracts them; regenerated
  the skill JSON and help docs.
- tests: relationship tests now use `read_stdout_on_success`, and the
  describegpt relationships test asserts a zero exit before parsing stdout,
  so a failing run can't be masked by parseable output. Added a mixed
  Integer/Float ordered test covering the per-member typing fix.

`ordered` still nulls the whole chain when the anchor is null — that is
intentional: a monotonic chain `m[i] = m[i-1] + gap` cannot have a non-null
member after a null predecessor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit c5dfb20 into master May 22, 2026
29 checks passed
@jqnatividad jqnatividad deleted the synthesize-relationships branch May 22, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants