feat(describegpt): add --format semanticmd output by jqnatividad · Pull Request #3933 · dathere/qsv

jqnatividad · 2026-06-02T12:55:24Z

Closes #3735.

Adds a new --format semanticmd value to describegpt that emits the Data Dictionary as a Semantic Markdown document — human-readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. The default Markdown format is unchanged.

Behavior

Like jsonschema, the format is dictionary-centric:

Requires the dictionary inference phase (--dictionary / --all); rejects --prompt.
The --description result becomes the # Dataset description (attribution footer stripped).
--tags are embedded in the YAML frontmatter (requested as a clean JSON array, each tag emitted as a safe YAML scalar).

The document is rendered via a user-overridable MiniJinja template (semanticmd_md_body_template in resources/describegpt_md_defaults.toml) and contains:

A # Dataset section (frontmatter + description + Resource/Schema/Title table)
A # Schema table — backticked, pipe-escaped codes, required prefix for non-null columns — plus a deterministically-inferred Primary key
Per-column ## Column subsections with ### Validation (numeric mins) and ### Choices (enumerations)
A # Resource section with ## Statistics and per-column ### Frequency tables carrying Choice, Frequency, Percentage and Rank

Percentage/Rank come from qsv's own frequency computation via a new structured freq_details field on DictionaryEntry (the flat examples string only retained value+count). Primary-key inference uses the deterministic structural unique-id signal (cardinality == rowcount, no nulls, single frequency row at 100%) — never a row-count estimate or the overloaded <ALL_UNIQUE> sentinel — so constant-value, HIGH_CARDINALITY, and truncated-frequency columns are not falsely flagged.

Testing

Unit + integration tests for: rejection without --dictionary, --prompt incompatibility, full document structure, frequency percentage/rank, pipe-in-header escaping, YAML tag escaping, and primary-key edge cases (ambiguous, high-cardinality, constant, truncated frequency).
Regenerated help docs and added an NYC 311 example doc linked from the README.
Verified end-to-end against a local LLM.

🤖 Generated with Claude Code

Add a new `--format semanticmd` value to describegpt that emits the Data Dictionary as a Semantic Markdown document (https://semanticmd.org/) — human readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. The default Markdown format is unchanged. Like jsonschema, the format is dictionary-centric: it requires the dictionary inference phase (--dictionary/--all) and rejects --prompt. The --description result becomes the `# Dataset` description (attribution footer stripped) and --tags are embedded in the YAML frontmatter (requested as a clean JSON array). Rendering uses a user-overridable MiniJinja template (semanticmd_md_body_template in describegpt_md_defaults.toml). The document emits a Dataset section, a Schema table (backticked codes, `required` prefix for non-null columns) with a heuristically-inferred Primary key, per-column subsections (Validation/Choices), and a Resource section with Statistics and per-column Frequency tables. Frequency tables carry Choice, Frequency, Percentage and Rank, sourced from qsv's own frequency computation via a new structured `freq_details` field on DictionaryEntry (the flat `examples` string only retained value+count). The field is `#[serde(default)]` for cache compatibility and survives the two-pass merge. Aggregation buckets (Other…/(NULL)…) render with a blank rank. Includes unit + integration tests, regenerated help docs, and an NYC 311 example doc linked from the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Primary key inference: replace the estimated max(cardinality+null_count) row-count heuristic with the deterministic `<ALL_UNIQUE>` examples sentinel (+ null_count == 0), so a merely highest-cardinality column is never falsely inferred as a primary key (Medium). - Markdown tables: pipe-escape column names, the primary key, and resource_name in Schema/Statistics/Primary-key cells so a header like `category|raw` can't break the tables; headings keep the literal name (Medium). - Frontmatter tags: emit each tag as a YAML scalar, double-quoting and escaping values that need it (colons, spaces, #, quotes, newlines) while leaving plain lowercase_underscore tags bare (Low). Adds regression tests for pipe-in-header escaping, high-cardinality non-unique PK rejection, and YAML tag escaping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) The previous fix inferred the primary key from `examples == "<ALL_UNIQUE>"`, but that sentinel is overloaded: generate_code_based_dictionary sets it for any frequency row at 100% — including constant-value and HIGH_CARDINALITY columns that are explicitly NOT unique ids — so a non-null non-unique column could still be emitted as the SemanticMd primary key. Carry the deterministic `is_all_unique` classification (cardinality == rowcount, no nulls, single freq row with count == cardinality) onto DictionaryEntry as a new `is_unique_id` field (#[serde(default)] for cache compatibility) and infer the primary key from it instead of the examples sentinel. Adds a regression that builds entries through generate_code_based_dictionary with HIGH_CARDINALITY and constant-value frequency rows and asserts neither is inferred as a primary key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) The structural `is_all_unique` detector required only a single frequency row with `count == cardinality`. Truncated or custom frequency data (e.g. `--limit 1 --no-other` or a `file:` frequency CSV) can emit one top row whose count coincidentally equals the column cardinality while `percentage < 100`, which would mark a non-unique column as `unique_id` and (since #2664) infer it as the SemanticMd primary key. Add a `percentage ≈ 100.0` guard so the lone row must cover the whole column. This also tightens the pre-existing content_type unique_id stamping. Adds a regression for a single-row frequency where count == cardinality but percentage < 100. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codacy-production · 2026-06-02T12:56:52Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 12 complexity

Metric Results

Complexity 12

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds a new --format semanticmd output mode to describegpt that emits the inferred Data Dictionary as a Semantic Markdown document (frontmatter + Dataset/Schema/Resource sections + per-column subsections + Frequency tables). The format is dictionary-centric (incompatible with --prompt, requires --dictionary/--all), folds --description into the # Dataset body and --tags into the YAML frontmatter, and is rendered via a user-overridable MiniJinja template. New structured freq_details and is_unique_id fields on DictionaryEntry back richer Frequency tables and deterministic primary-key inference; the <ALL_UNIQUE> content-type heuristic is also tightened to require percentage == 100.0 to avoid false positives on truncated frequency input.

Changes:

New SemanticMd OutputFormat variant wired through dispatch, validation, finalize, and a new render_semanticmd_body + template defaults.
DictionaryEntry gains freq_details and is_unique_id; generate_code_based_dictionary populates them and tightens the unique-id detector.
Tests and docs (README, TableOfContents, describegpt help, NYC 311 example) updated for the new format.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/cmd/describegpt.rs	Adds `SemanticMd` variant, render/finalize paths, tag-frontmatter + attribution helpers, CLI validation, tests.
src/cmd/describegpt/formatters.rs	Adds `semanticmd_type`, `SemanticMd*` render-data structs, `build_semanticmd_data`, frequency row mapping, and unit tests.
src/cmd/describegpt/dictionary.rs	Adds `FreqDetail`, `is_unique_id`; tightens `is_all_unique` to require percentage == 100; adds regression test.
resources/describegpt_md_defaults.toml	New `semanticmd_md_body_template` default.
tests/test_describegpt.rs	Integration tests for validation rejection and end-to-end semanticmd dictionary render.
docs/help/describegpt.md, docs/help/TableOfContents.md, README.md	Help/README updates referencing the new format and example doc.
docs/describegpt/nyc311-describegpt-semanticmd.md	New example output document.

jqnatividad and others added 4 commits June 2, 2026 08:03

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Comment thread src/cmd/describegpt.rs Dismissed

Comment thread src/cmd/describegpt.rs Dismissed

jqnatividad requested a review from Copilot June 2, 2026 12:56

Copilot started reviewing on behalf of jqnatividad June 2, 2026 12:56 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

jqnatividad merged commit 12ceb8c into master Jun 2, 2026
17 of 20 checks passed

jqnatividad deleted the describegpt-semantic-md branch June 2, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(describegpt): add --format semanticmd output#3933

feat(describegpt): add --format semanticmd output#3933
jqnatividad merged 4 commits into
masterfrom
describegpt-semantic-md

jqnatividad commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jqnatividad commented Jun 2, 2026

Behavior

Testing

Uh oh!

Uh oh!

Uh oh!

codacy-production Bot commented Jun 2, 2026

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants