feat(describegpt): deterministic `unique_id` Content Type for ALL_UNIQUE fields by jqnatividad · Pull Request #3862 · dathere/qsv

jqnatividad · 2026-05-16T17:26:55Z

Summary

Adds a unique_id token to the Content Type vocabulary used by describegpt --infer-content-type, classifying fields where every row has a distinct non-null value (primary keys, surrogate keys, sequence numbers).
Detection is deterministic: generate_code_based_dictionary keys off qsv's <ALL_UNIQUE> frequency sentinel (which frequency emits exactly when cardinality == rowcount), so no separate rowcount lookup is needed. combine_dictionary_entries preserves the pre-set value so code-derived facts always win over the LLM's guess.
Aligns synthesize's faker map: unique_id is added to NON_FAKER_TOKENS so the vocab-coverage invariant passes and synthesize falls back to type+min/max generation (a dedicated per-row-unique generator is left as a follow-up).

Test plan

cargo test -F all_features --bin qsv — 194 unit tests pass, including 3 new dictionary tests:
- generate_marks_all_unique_field_as_unique_id
- generate_skips_unique_id_when_infer_content_type_off
- combine_preserves_preset_unique_id_over_llm_value
cargo test -F all_features --test tests test_describegpt — 64 integration tests pass.
cargo test -F all_features --test tests test_synthesize — 12 integration tests pass.
cargo clippy --bin qsv -F all_features — clean on touched files.
qsv --generate-help-md re-run; only docs/help/describegpt.md updated, reflecting the new help text.

🤖 Generated with Claude Code

…QUE fields Adds a "unique_id" token to the Content Type vocabulary used by `describegpt --infer-content-type`, and deterministically classifies any field whose `cardinality == rowcount` (i.e. every row carries a distinct non-null value — primary keys, surrogate keys, sequence numbers) with that token, overriding whatever the LLM returned for the field. Detection keys off qsv's `<ALL_UNIQUE>` frequency sentinel, so no separate rowcount lookup is needed. `combine_dictionary_entries` now preserves any pre-set `content_type` so code-derived facts always win over LLM guesses. The prompt template and `--infer-content-type` help text were updated to document the override. In synthesize's faker map, `unique_id` is added to `NON_FAKER_TOKENS` (alongside `category` / `unknown`) so the vocab-coverage invariant passes; synthesize falls back to type+min/max generation for these fields. A dedicated per-row-unique generator is left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-16T17:27:34Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results

Complexity 17

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds deterministic unique_id handling to describegpt --infer-content-type and aligns synthesize’s content-type fallback behavior.

Changes:

Adds unique_id to the describegpt Content Type vocabulary.
Pre-sets unique_id for fields with <ALL_UNIQUE> frequency rows and preserves that value over LLM output.
Updates synthesize non-faker handling and generated help text.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/cmd/describegpt/dictionary.rs`	Adds `unique_id` vocabulary, deterministic dictionary generation, merge behavior, and unit tests.
`src/cmd/describegpt.rs`	Passes `--infer-content-type` state into dictionary generation and updates CLI help.
`src/cmd/synthesize/faker_map.rs`	Treats `unique_id` as a non-faker token.
`resources/describegpt_defaults.toml`	Updates the LLM prompt guidance for `unique_id`.
`docs/help/describegpt.md`	Updates generated help documentation for the new behavior.

Comments suppressed due to low confidence (1)

src/cmd/describegpt/dictionary.rs:485

This check treats any frequency row whose value is literally <ALL_UNIQUE> as the sentinel. A real column whose repeated value is <ALL_UNIQUE> would be mislabeled unique_id, while a user-supplied --freq-options --all-unique-text ... sentinel would be missed; include the sentinel semantics (for example rank/percentage plus cardinality == count) instead of only exact text.

        let content_type =
            if infer_content_type && field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>") {

Three Copilot review findings on #3862: 1. dictionary.rs: adding `unique_id` to the vocab let the LLM smuggle it into `parse_llm_dictionary_response`, which `combine_dictionary_entries` would then copy onto fields whose code-generated `content_type` was empty — breaking the "deterministic-only" contract for non-ALL_UNIQUE fields. Now rejected at both layers: - parser drops literal/cased `unique_id` from LLM output (returns "") - combine refuses to copy `unique_id` from `LlmDictField` regardless Prompt template updated to tell the LLM not to use `unique_id`. Added two tests covering parser rejection (incl. casing) and combine guard. 2. faker_map.rs module docs: updated to list `unique_id` alongside `category`/`unknown` in the non-faker set and to explain why (no per-row-unique fake-rs faker). 3. faker_map.rs NON_FAKER_TOKENS comment: corrected the inaccurate "sequential integers" claim — `build_numeric` samples from min/max and can emit duplicates, so `unique_id` round-trips through the dictionary but synthesize's output is not guaranteed unique today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous detection — `field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>")` — had two real issues flagged in review: 1. False positives: a field whose values literally contain the string "<ALL_UNIQUE>" (a constant of that value, or one bucket among many) would be misclassified as unique_id. 2. False negatives: `frequency --all-unique-text <CUSTOM>` would emit a different sentinel string and never match. Replaced with a structural check on the frequency table: stats.cardinality > 1 AND stats.nullcount == 0 AND field_frequencies.len() == 1 AND field_frequencies[0].count == stats.cardinality This invariant is exactly what qsv frequency emits for ALL_UNIQUE (one row, count == row_count == cardinality), so it: - doesn't reference the literal sentinel text - excludes HIGH_CARDINALITY (single row but count > cardinality) - excludes a constant whose value happens to be "<ALL_UNIQUE>" (cardinality == 1) - excludes a mixed field with "<ALL_UNIQUE>" among its values (more than one frequency row) - enforces the no-nulls part of the semantic contract explicitly Added 4 tests covering each rejection path and the custom-sentinel-text acceptance path. Updated doc comments on CONTENT_TYPE_VOCAB and generate_code_based_dictionary to drop the now-inaccurate "matches the <ALL_UNIQUE> sentinel" wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad · 2026-05-16T18:28:53Z

Re: Copilot's suppressed (low-confidence) comment on dictionary.rs:485

The comment flagged a real issue worth fixing even though Copilot suppressed it: the original detection — field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>") — was text-coupled to frequency's default sentinel and could produce both false positives and false negatives.

Replaced with a structural check in d727221d3 (no literal text matching):

let is_all_unique = stats_record.cardinality > 1
    && stats_record.nullcount == 0
    && field_frequencies.len() == 1
    && field_frequencies[0].count == stats_record.cardinality;

This invariant is exactly what qsv frequency guarantees for ALL_UNIQUE rows (one row, count == row_count == cardinality), so it now:

Scenario	Old check	New check
Genuine ALL_UNIQUE field	✓	✓
HIGH_CARDINALITY (same single-row, 100% shape)	matched (bug)	excluded — `cardinality < count`
Constant field with value literally `<ALL_UNIQUE>`	false positive	excluded — `cardinality == 1`
Mixed field with `<ALL_UNIQUE>` as one bucket	false positive	excluded — `len() > 1`
Custom `frequency --all-unique-text` sentinel	false negative	detected — text-independent
Field with nulls but unique non-nulls	relied on freq to skip	explicitly rejected

4 tests added covering each rejection path plus the custom-sentinel-text acceptance path:

generate_does_not_mislabel_constant_field_with_all_unique_value
generate_detects_unique_id_with_custom_all_unique_text
generate_does_not_misclassify_high_cardinality_as_unique_id
generate_does_not_mislabel_unique_id_when_nulls_present

jqnatividad requested a review from Copilot May 16, 2026 17:27

Copilot started reviewing on behalf of jqnatividad May 16, 2026 17:28 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread src/cmd/describegpt/dictionary.rs

Comment thread src/cmd/synthesize/faker_map.rs

Comment thread src/cmd/synthesize/faker_map.rs Outdated

jqnatividad and others added 2 commits May 16, 2026 13:44

jqnatividad merged commit cbe9abc into master May 16, 2026
18 checks passed

jqnatividad deleted the describegpt-unique-id-content-type branch May 16, 2026 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(describegpt): deterministic `unique_id` Content Type for ALL_UNIQUE fields#3862

feat(describegpt): deterministic `unique_id` Content Type for ALL_UNIQUE fields#3862
jqnatividad merged 3 commits into
masterfrom
describegpt-unique-id-content-type

jqnatividad commented May 16, 2026

Uh oh!

codacy-production Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jqnatividad commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 16, 2026

Summary

Test plan

Uh oh!

codacy-production Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jqnatividad commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented May 16, 2026 •

edited

Loading