feat(describegpt): add --format semanticmd output#3933
Conversation
Add a new `--format semanticmd` value to describegpt that emits the Data Dictionary as a Semantic Markdown document (https://semanticmd.org/) — human readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. The default Markdown format is unchanged. Like jsonschema, the format is dictionary-centric: it requires the dictionary inference phase (--dictionary/--all) and rejects --prompt. The --description result becomes the `# Dataset` description (attribution footer stripped) and --tags are embedded in the YAML frontmatter (requested as a clean JSON array). Rendering uses a user-overridable MiniJinja template (semanticmd_md_body_template in describegpt_md_defaults.toml). The document emits a Dataset section, a Schema table (backticked codes, `required` prefix for non-null columns) with a heuristically-inferred Primary key, per-column subsections (Validation/Choices), and a Resource section with Statistics and per-column Frequency tables. Frequency tables carry Choice, Frequency, Percentage and Rank, sourced from qsv's own frequency computation via a new structured `freq_details` field on DictionaryEntry (the flat `examples` string only retained value+count). The field is `#[serde(default)]` for cache compatibility and survives the two-pass merge. Aggregation buckets (Other…/(NULL)…) render with a blank rank. Includes unit + integration tests, regenerated help docs, and an NYC 311 example doc linked from the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Primary key inference: replace the estimated max(cardinality+null_count) row-count heuristic with the deterministic `<ALL_UNIQUE>` examples sentinel (+ null_count == 0), so a merely highest-cardinality column is never falsely inferred as a primary key (Medium). - Markdown tables: pipe-escape column names, the primary key, and resource_name in Schema/Statistics/Primary-key cells so a header like `category|raw` can't break the tables; headings keep the literal name (Medium). - Frontmatter tags: emit each tag as a YAML scalar, double-quoting and escaping values that need it (colons, spaces, #, quotes, newlines) while leaving plain lowercase_underscore tags bare (Low). Adds regression tests for pipe-in-header escaping, high-cardinality non-unique PK rejection, and YAML tag escaping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) The previous fix inferred the primary key from `examples == "<ALL_UNIQUE>"`, but that sentinel is overloaded: generate_code_based_dictionary sets it for any frequency row at 100% — including constant-value and HIGH_CARDINALITY columns that are explicitly NOT unique ids — so a non-null non-unique column could still be emitted as the SemanticMd primary key. Carry the deterministic `is_all_unique` classification (cardinality == rowcount, no nulls, single freq row with count == cardinality) onto DictionaryEntry as a new `is_unique_id` field (#[serde(default)] for cache compatibility) and infer the primary key from it instead of the examples sentinel. Adds a regression that builds entries through generate_code_based_dictionary with HIGH_CARDINALITY and constant-value frequency rows and asserts neither is inferred as a primary key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) The structural `is_all_unique` detector required only a single frequency row with `count == cardinality`. Truncated or custom frequency data (e.g. `--limit 1 --no-other` or a `file:` frequency CSV) can emit one top row whose count coincidentally equals the column cardinality while `percentage < 100`, which would mark a non-unique column as `unique_id` and (since #2664) infer it as the SemanticMd primary key. Add a `percentage ≈ 100.0` guard so the lone row must cover the whole column. This also tightens the pre-existing content_type unique_id stamping. Adds a regression for a single-row frequency where count == cardinality but percentage < 100. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 12 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
Adds a new --format semanticmd output mode to describegpt that emits the inferred Data Dictionary as a Semantic Markdown document (frontmatter + Dataset/Schema/Resource sections + per-column subsections + Frequency tables). The format is dictionary-centric (incompatible with --prompt, requires --dictionary/--all), folds --description into the # Dataset body and --tags into the YAML frontmatter, and is rendered via a user-overridable MiniJinja template. New structured freq_details and is_unique_id fields on DictionaryEntry back richer Frequency tables and deterministic primary-key inference; the <ALL_UNIQUE> content-type heuristic is also tightened to require percentage == 100.0 to avoid false positives on truncated frequency input.
Changes:
- New
SemanticMdOutputFormatvariant wired through dispatch, validation, finalize, and a newrender_semanticmd_body+ template defaults. DictionaryEntrygainsfreq_detailsandis_unique_id;generate_code_based_dictionarypopulates them and tightens the unique-id detector.- Tests and docs (README, TableOfContents, describegpt help, NYC 311 example) updated for the new format.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/cmd/describegpt.rs | Adds SemanticMd variant, render/finalize paths, tag-frontmatter + attribution helpers, CLI validation, tests. |
| src/cmd/describegpt/formatters.rs | Adds semanticmd_type, SemanticMd* render-data structs, build_semanticmd_data, frequency row mapping, and unit tests. |
| src/cmd/describegpt/dictionary.rs | Adds FreqDetail, is_unique_id; tightens is_all_unique to require percentage == 100; adds regression test. |
| resources/describegpt_md_defaults.toml | New semanticmd_md_body_template default. |
| tests/test_describegpt.rs | Integration tests for validation rejection and end-to-end semanticmd dictionary render. |
| docs/help/describegpt.md, docs/help/TableOfContents.md, README.md | Help/README updates referencing the new format and example doc. |
| docs/describegpt/nyc311-describegpt-semanticmd.md | New example output document. |
Closes #3735.
Adds a new
--format semanticmdvalue todescribegptthat emits the Data Dictionary as a Semantic Markdown document — human-readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. The defaultMarkdownformat is unchanged.Behavior
Like
jsonschema, the format is dictionary-centric:--dictionary/--all); rejects--prompt.--descriptionresult becomes the# Datasetdescription (attribution footer stripped).--tagsare embedded in the YAML frontmatter (requested as a clean JSON array, each tag emitted as a safe YAML scalar).The document is rendered via a user-overridable MiniJinja template (
semanticmd_md_body_templateinresources/describegpt_md_defaults.toml) and contains:# Datasetsection (frontmatter + description + Resource/Schema/Title table)# Schematable — backticked, pipe-escaped codes,requiredprefix for non-null columns — plus a deterministically-inferred Primary key## Columnsubsections with### Validation(numeric mins) and### Choices(enumerations)# Resourcesection with## Statisticsand per-column### Frequencytables carrying Choice, Frequency, Percentage and RankPercentage/Rank come from qsv's own
frequencycomputation via a new structuredfreq_detailsfield onDictionaryEntry(the flatexamplesstring only retained value+count). Primary-key inference uses the deterministic structural unique-id signal (cardinality == rowcount, no nulls, single frequency row at 100%) — never a row-count estimate or the overloaded<ALL_UNIQUE>sentinel — so constant-value, HIGH_CARDINALITY, and truncated-frequency columns are not falsely flagged.Testing
--dictionary,--promptincompatibility, full document structure, frequency percentage/rank, pipe-in-header escaping, YAML tag escaping, and primary-key edge cases (ambiguous, high-cardinality, constant, truncated frequency).🤖 Generated with Claude Code