feat(describegpt): deterministic unique_id Content Type for ALL_UNIQUE fields#3862
Conversation
…QUE fields Adds a "unique_id" token to the Content Type vocabulary used by `describegpt --infer-content-type`, and deterministically classifies any field whose `cardinality == rowcount` (i.e. every row carries a distinct non-null value — primary keys, surrogate keys, sequence numbers) with that token, overriding whatever the LLM returned for the field. Detection keys off qsv's `<ALL_UNIQUE>` frequency sentinel, so no separate rowcount lookup is needed. `combine_dictionary_entries` now preserves any pre-set `content_type` so code-derived facts always win over LLM guesses. The prompt template and `--infer-content-type` help text were updated to document the override. In synthesize's faker map, `unique_id` is added to `NON_FAKER_TOKENS` (alongside `category` / `unknown`) so the vocab-coverage invariant passes; synthesize falls back to type+min/max generation for these fields. A dedicated per-row-unique generator is left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 17 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
Adds deterministic unique_id handling to describegpt --infer-content-type and aligns synthesize’s content-type fallback behavior.
Changes:
- Adds
unique_idto the describegpt Content Type vocabulary. - Pre-sets
unique_idfor fields with<ALL_UNIQUE>frequency rows and preserves that value over LLM output. - Updates synthesize non-faker handling and generated help text.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/cmd/describegpt/dictionary.rs |
Adds unique_id vocabulary, deterministic dictionary generation, merge behavior, and unit tests. |
src/cmd/describegpt.rs |
Passes --infer-content-type state into dictionary generation and updates CLI help. |
src/cmd/synthesize/faker_map.rs |
Treats unique_id as a non-faker token. |
resources/describegpt_defaults.toml |
Updates the LLM prompt guidance for unique_id. |
docs/help/describegpt.md |
Updates generated help documentation for the new behavior. |
Comments suppressed due to low confidence (1)
src/cmd/describegpt/dictionary.rs:485
- This check treats any frequency row whose value is literally
<ALL_UNIQUE>as the sentinel. A real column whose repeated value is<ALL_UNIQUE>would be mislabeledunique_id, while a user-supplied--freq-options --all-unique-text ...sentinel would be missed; include the sentinel semantics (for example rank/percentage pluscardinality == count) instead of only exact text.
let content_type =
if infer_content_type && field_frequencies.iter().any(|f| f.value == "<ALL_UNIQUE>") {
Three Copilot review findings on #3862: 1. dictionary.rs: adding `unique_id` to the vocab let the LLM smuggle it into `parse_llm_dictionary_response`, which `combine_dictionary_entries` would then copy onto fields whose code-generated `content_type` was empty — breaking the "deterministic-only" contract for non-ALL_UNIQUE fields. Now rejected at both layers: - parser drops literal/cased `unique_id` from LLM output (returns "") - combine refuses to copy `unique_id` from `LlmDictField` regardless Prompt template updated to tell the LLM not to use `unique_id`. Added two tests covering parser rejection (incl. casing) and combine guard. 2. faker_map.rs module docs: updated to list `unique_id` alongside `category`/`unknown` in the non-faker set and to explain why (no per-row-unique fake-rs faker). 3. faker_map.rs NON_FAKER_TOKENS comment: corrected the inaccurate "sequential integers" claim — `build_numeric` samples from min/max and can emit duplicates, so `unique_id` round-trips through the dictionary but synthesize's output is not guaranteed unique today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous detection — `field_frequencies.iter().any(|f| f.value ==
"<ALL_UNIQUE>")` — had two real issues flagged in review:
1. False positives: a field whose values literally contain the string
"<ALL_UNIQUE>" (a constant of that value, or one bucket among many)
would be misclassified as unique_id.
2. False negatives: `frequency --all-unique-text <CUSTOM>` would emit a
different sentinel string and never match.
Replaced with a structural check on the frequency table:
stats.cardinality > 1
AND stats.nullcount == 0
AND field_frequencies.len() == 1
AND field_frequencies[0].count == stats.cardinality
This invariant is exactly what qsv frequency emits for ALL_UNIQUE
(one row, count == row_count == cardinality), so it:
- doesn't reference the literal sentinel text
- excludes HIGH_CARDINALITY (single row but count > cardinality)
- excludes a constant whose value happens to be "<ALL_UNIQUE>"
(cardinality == 1)
- excludes a mixed field with "<ALL_UNIQUE>" among its values
(more than one frequency row)
- enforces the no-nulls part of the semantic contract explicitly
Added 4 tests covering each rejection path and the custom-sentinel-text
acceptance path. Updated doc comments on CONTENT_TYPE_VOCAB and
generate_code_based_dictionary to drop the now-inaccurate "matches the
<ALL_UNIQUE> sentinel" wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re: Copilot's suppressed (low-confidence) comment on The comment flagged a real issue worth fixing even though Copilot suppressed it: the original detection — Replaced with a structural check in let is_all_unique = stats_record.cardinality > 1
&& stats_record.nullcount == 0
&& field_frequencies.len() == 1
&& field_frequencies[0].count == stats_record.cardinality;This invariant is exactly what qsv
4 tests added covering each rejection path plus the custom-sentinel-text acceptance path:
|
Summary
unique_idtoken to the Content Type vocabulary used bydescribegpt --infer-content-type, classifying fields where every row has a distinct non-null value (primary keys, surrogate keys, sequence numbers).generate_code_based_dictionarykeys off qsv's<ALL_UNIQUE>frequency sentinel (whichfrequencyemits exactly whencardinality == rowcount), so no separate rowcount lookup is needed.combine_dictionary_entriespreserves the pre-set value so code-derived facts always win over the LLM's guess.unique_idis added toNON_FAKER_TOKENSso the vocab-coverage invariant passes and synthesize falls back to type+min/max generation (a dedicated per-row-unique generator is left as a follow-up).Test plan
cargo test -F all_features --bin qsv— 194 unit tests pass, including 3 new dictionary tests:generate_marks_all_unique_field_as_unique_idgenerate_skips_unique_id_when_infer_content_type_offcombine_preserves_preset_unique_id_over_llm_valuecargo test -F all_features --test tests test_describegpt— 64 integration tests pass.cargo test -F all_features --test tests test_synthesize— 12 integration tests pass.cargo clippy --bin qsv -F all_features— clean on touched files.qsv --generate-help-mdre-run; onlydocs/help/describegpt.mdupdated, reflecting the new help text.🤖 Generated with Claude Code