Skip to content

feat(describegpt): deterministic unique_id Content Type for ALL_UNIQUE fields#3862

Merged
jqnatividad merged 3 commits into
masterfrom
describegpt-unique-id-content-type
May 16, 2026
Merged

feat(describegpt): deterministic unique_id Content Type for ALL_UNIQUE fields#3862
jqnatividad merged 3 commits into
masterfrom
describegpt-unique-id-content-type

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a unique_id token to the Content Type vocabulary used by describegpt --infer-content-type, classifying fields where every row has a distinct non-null value (primary keys, surrogate keys, sequence numbers).
  • Detection is deterministic: generate_code_based_dictionary keys off qsv's <ALL_UNIQUE> frequency sentinel (which frequency emits exactly when cardinality == rowcount), so no separate rowcount lookup is needed. combine_dictionary_entries preserves the pre-set value so code-derived facts always win over the LLM's guess.
  • Aligns synthesize's faker map: unique_id is added to NON_FAKER_TOKENS so the vocab-coverage invariant passes and synthesize falls back to type+min/max generation (a dedicated per-row-unique generator is left as a follow-up).

Test plan

  • cargo test -F all_features --bin qsv — 194 unit tests pass, including 3 new dictionary tests:
    • generate_marks_all_unique_field_as_unique_id
    • generate_skips_unique_id_when_infer_content_type_off
    • combine_preserves_preset_unique_id_over_llm_value
  • cargo test -F all_features --test tests test_describegpt — 64 integration tests pass.
  • cargo test -F all_features --test tests test_synthesize — 12 integration tests pass.
  • cargo clippy --bin qsv -F all_features — clean on touched files.
  • qsv --generate-help-md re-run; only docs/help/describegpt.md updated, reflecting the new help text.

🤖 Generated with Claude Code

…QUE fields

Adds a "unique_id" token to the Content Type vocabulary used by
`describegpt --infer-content-type`, and deterministically classifies any
field whose `cardinality == rowcount` (i.e. every row carries a distinct
non-null value — primary keys, surrogate keys, sequence numbers) with
that token, overriding whatever the LLM returned for the field. Detection
keys off qsv's `<ALL_UNIQUE>` frequency sentinel, so no separate
rowcount lookup is needed.

`combine_dictionary_entries` now preserves any pre-set `content_type` so
code-derived facts always win over LLM guesses. The prompt template and
`--infer-content-type` help text were updated to document the override.

In synthesize's faker map, `unique_id` is added to `NON_FAKER_TOKENS`
(alongside `category` / `unknown`) so the vocab-coverage invariant
passes; synthesize falls back to type+min/max generation for these
fields. A dedicated per-row-unique generator is left for a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad requested a review from Copilot May 16, 2026 17:27
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 16, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results
Complexity 17

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds deterministic unique_id handling to describegpt --infer-content-type and aligns synthesize’s content-type fallback behavior.

Changes:

  • Adds unique_id to the describegpt Content Type vocabulary.
  • Pre-sets unique_id for fields with <ALL_UNIQUE> frequency rows and preserves that value over LLM output.
  • Updates synthesize non-faker handling and generated help text.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/cmd/describegpt/dictionary.rs Adds unique_id vocabulary, deterministic dictionary generation, merge behavior, and unit tests.
src/cmd/describegpt.rs Passes --infer-content-type state into dictionary generation and updates CLI help.
src/cmd/synthesize/faker_map.rs Treats unique_id as a non-faker token.
resources/describegpt_defaults.toml Updates the LLM prompt guidance for unique_id.
docs/help/describegpt.md Updates generated help documentation for the new behavior.
Comments suppressed due to low confidence (1)

src/cmd/describegpt/dictionary.rs:485

  • This check treats any frequency row whose value is literally <ALL_UNIQUE> as the sentinel. A real column whose repeated value is <ALL_UNIQUE> would be mislabeled unique_id, while a user-supplied --freq-options --all-unique-text ... sentinel would be missed; include the sentinel semantics (for example rank/percentage plus cardinality == count) instead of only exact text.
        let content_type =
            if infer_content_type && field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>") {

Comment thread src/cmd/describegpt/dictionary.rs
Comment thread src/cmd/synthesize/faker_map.rs
Comment thread src/cmd/synthesize/faker_map.rs Outdated
jqnatividad and others added 2 commits May 16, 2026 13:44
Three Copilot review findings on #3862:

1. dictionary.rs: adding `unique_id` to the vocab let the LLM smuggle it
   into `parse_llm_dictionary_response`, which `combine_dictionary_entries`
   would then copy onto fields whose code-generated `content_type` was
   empty — breaking the "deterministic-only" contract for non-ALL_UNIQUE
   fields. Now rejected at both layers:
   - parser drops literal/cased `unique_id` from LLM output (returns "")
   - combine refuses to copy `unique_id` from `LlmDictField` regardless
   Prompt template updated to tell the LLM not to use `unique_id`. Added
   two tests covering parser rejection (incl. casing) and combine guard.

2. faker_map.rs module docs: updated to list `unique_id` alongside
   `category`/`unknown` in the non-faker set and to explain why
   (no per-row-unique fake-rs faker).

3. faker_map.rs NON_FAKER_TOKENS comment: corrected the inaccurate
   "sequential integers" claim — `build_numeric` samples from min/max
   and can emit duplicates, so `unique_id` round-trips through the
   dictionary but synthesize's output is not guaranteed unique today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous detection — `field_frequencies.iter().any(|f| f.value ==
"<ALL_UNIQUE>")` — had two real issues flagged in review:

1. False positives: a field whose values literally contain the string
   "<ALL_UNIQUE>" (a constant of that value, or one bucket among many)
   would be misclassified as unique_id.
2. False negatives: `frequency --all-unique-text <CUSTOM>` would emit a
   different sentinel string and never match.

Replaced with a structural check on the frequency table:
    stats.cardinality > 1
    AND stats.nullcount == 0
    AND field_frequencies.len() == 1
    AND field_frequencies[0].count == stats.cardinality

This invariant is exactly what qsv frequency emits for ALL_UNIQUE
(one row, count == row_count == cardinality), so it:
- doesn't reference the literal sentinel text
- excludes HIGH_CARDINALITY (single row but count > cardinality)
- excludes a constant whose value happens to be "<ALL_UNIQUE>"
  (cardinality == 1)
- excludes a mixed field with "<ALL_UNIQUE>" among its values
  (more than one frequency row)
- enforces the no-nulls part of the semantic contract explicitly

Added 4 tests covering each rejection path and the custom-sentinel-text
acceptance path. Updated doc comments on CONTENT_TYPE_VOCAB and
generate_code_based_dictionary to drop the now-inaccurate "matches the
<ALL_UNIQUE> sentinel" wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad
Copy link
Copy Markdown
Collaborator Author

Re: Copilot's suppressed (low-confidence) comment on dictionary.rs:485

The comment flagged a real issue worth fixing even though Copilot suppressed it: the original detection — field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>") — was text-coupled to frequency's default sentinel and could produce both false positives and false negatives.

Replaced with a structural check in d727221d3 (no literal text matching):

let is_all_unique = stats_record.cardinality > 1
    && stats_record.nullcount == 0
    && field_frequencies.len() == 1
    && field_frequencies[0].count == stats_record.cardinality;

This invariant is exactly what qsv frequency guarantees for ALL_UNIQUE rows (one row, count == row_count == cardinality), so it now:

Scenario Old check New check
Genuine ALL_UNIQUE field
HIGH_CARDINALITY (same single-row, 100% shape) matched (bug) excluded — cardinality < count
Constant field with value literally <ALL_UNIQUE> false positive excluded — cardinality == 1
Mixed field with <ALL_UNIQUE> as one bucket false positive excluded — len() > 1
Custom frequency --all-unique-text sentinel false negative detected — text-independent
Field with nulls but unique non-nulls relied on freq to skip explicitly rejected

4 tests added covering each rejection path plus the custom-sentinel-text acceptance path:

  • generate_does_not_mislabel_constant_field_with_all_unique_value
  • generate_detects_unique_id_with_custom_all_unique_text
  • generate_does_not_misclassify_high_cardinality_as_unique_id
  • generate_does_not_mislabel_unique_id_when_nulls_present

@jqnatividad jqnatividad merged commit cbe9abc into master May 16, 2026
18 checks passed
@jqnatividad jqnatividad deleted the describegpt-unique-id-content-type branch May 16, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants