Skip to content

feat(describegpt): add --format semanticmd output#3933

Merged
jqnatividad merged 4 commits into
masterfrom
describegpt-semantic-md
Jun 2, 2026
Merged

feat(describegpt): add --format semanticmd output#3933
jqnatividad merged 4 commits into
masterfrom
describegpt-semantic-md

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Closes #3735.

Adds a new --format semanticmd value to describegpt that emits the Data Dictionary as a Semantic Markdown document — human-readable markdown with light, agent-parseable conventions that a companion converter turns into JSON. The default Markdown format is unchanged.

Behavior

Like jsonschema, the format is dictionary-centric:

  • Requires the dictionary inference phase (--dictionary / --all); rejects --prompt.
  • The --description result becomes the # Dataset description (attribution footer stripped).
  • --tags are embedded in the YAML frontmatter (requested as a clean JSON array, each tag emitted as a safe YAML scalar).

The document is rendered via a user-overridable MiniJinja template (semanticmd_md_body_template in resources/describegpt_md_defaults.toml) and contains:

  • A # Dataset section (frontmatter + description + Resource/Schema/Title table)
  • A # Schema table — backticked, pipe-escaped codes, required prefix for non-null columns — plus a deterministically-inferred Primary key
  • Per-column ## Column subsections with ### Validation (numeric mins) and ### Choices (enumerations)
  • A # Resource section with ## Statistics and per-column ### Frequency tables carrying Choice, Frequency, Percentage and Rank

Percentage/Rank come from qsv's own frequency computation via a new structured freq_details field on DictionaryEntry (the flat examples string only retained value+count). Primary-key inference uses the deterministic structural unique-id signal (cardinality == rowcount, no nulls, single frequency row at 100%) — never a row-count estimate or the overloaded <ALL_UNIQUE> sentinel — so constant-value, HIGH_CARDINALITY, and truncated-frequency columns are not falsely flagged.

Testing

  • Unit + integration tests for: rejection without --dictionary, --prompt incompatibility, full document structure, frequency percentage/rank, pipe-in-header escaping, YAML tag escaping, and primary-key edge cases (ambiguous, high-cardinality, constant, truncated frequency).
  • Regenerated help docs and added an NYC 311 example doc linked from the README.
  • Verified end-to-end against a local LLM.

🤖 Generated with Claude Code

jqnatividad and others added 4 commits June 2, 2026 08:03
Add a new `--format semanticmd` value to describegpt that emits the Data
Dictionary as a Semantic Markdown document (https://semanticmd.org/) — human
readable markdown with light, agent-parseable conventions that a companion
converter turns into JSON. The default Markdown format is unchanged.

Like jsonschema, the format is dictionary-centric: it requires the dictionary
inference phase (--dictionary/--all) and rejects --prompt. The --description
result becomes the `# Dataset` description (attribution footer stripped) and
--tags are embedded in the YAML frontmatter (requested as a clean JSON array).

Rendering uses a user-overridable MiniJinja template (semanticmd_md_body_template
in describegpt_md_defaults.toml). The document emits a Dataset section, a Schema
table (backticked codes, `required` prefix for non-null columns) with a
heuristically-inferred Primary key, per-column subsections (Validation/Choices),
and a Resource section with Statistics and per-column Frequency tables.

Frequency tables carry Choice, Frequency, Percentage and Rank, sourced from
qsv's own frequency computation via a new structured `freq_details` field on
DictionaryEntry (the flat `examples` string only retained value+count). The
field is `#[serde(default)]` for cache compatibility and survives the two-pass
merge. Aggregation buckets (Other…/(NULL)…) render with a blank rank.

Includes unit + integration tests, regenerated help docs, and an NYC 311
example doc linked from the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Primary key inference: replace the estimated max(cardinality+null_count)
  row-count heuristic with the deterministic `<ALL_UNIQUE>` examples sentinel
  (+ null_count == 0), so a merely highest-cardinality column is never falsely
  inferred as a primary key (Medium).
- Markdown tables: pipe-escape column names, the primary key, and resource_name
  in Schema/Statistics/Primary-key cells so a header like `category|raw` can't
  break the tables; headings keep the literal name (Medium).
- Frontmatter tags: emit each tag as a YAML scalar, double-quoting and escaping
  values that need it (colons, spaces, #, quotes, newlines) while leaving plain
  lowercase_underscore tags bare (Low).

Adds regression tests for pipe-in-header escaping, high-cardinality non-unique
PK rejection, and YAML tag escaping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

The previous fix inferred the primary key from `examples == "<ALL_UNIQUE>"`,
but that sentinel is overloaded: generate_code_based_dictionary sets it for any
frequency row at 100% — including constant-value and HIGH_CARDINALITY columns
that are explicitly NOT unique ids — so a non-null non-unique column could still
be emitted as the SemanticMd primary key.

Carry the deterministic `is_all_unique` classification (cardinality == rowcount,
no nulls, single freq row with count == cardinality) onto DictionaryEntry as a
new `is_unique_id` field (#[serde(default)] for cache compatibility) and infer
the primary key from it instead of the examples sentinel.

Adds a regression that builds entries through generate_code_based_dictionary
with HIGH_CARDINALITY and constant-value frequency rows and asserts neither is
inferred as a primary key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

The structural `is_all_unique` detector required only a single frequency row
with `count == cardinality`. Truncated or custom frequency data (e.g.
`--limit 1 --no-other` or a `file:` frequency CSV) can emit one top row whose
count coincidentally equals the column cardinality while `percentage < 100`,
which would mark a non-unique column as `unique_id` and (since #2664) infer it
as the SemanticMd primary key.

Add a `percentage ≈ 100.0` guard so the lone row must cover the whole column.
This also tightens the pre-existing content_type unique_id stamping. Adds a
regression for a single-row frequency where count == cardinality but
percentage < 100.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/describegpt.rs Dismissed
Comment thread src/cmd/describegpt.rs Dismissed
@jqnatividad jqnatividad requested a review from Copilot June 2, 2026 12:56
@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 12 complexity

Metric Results
Complexity 12

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new --format semanticmd output mode to describegpt that emits the inferred Data Dictionary as a Semantic Markdown document (frontmatter + Dataset/Schema/Resource sections + per-column subsections + Frequency tables). The format is dictionary-centric (incompatible with --prompt, requires --dictionary/--all), folds --description into the # Dataset body and --tags into the YAML frontmatter, and is rendered via a user-overridable MiniJinja template. New structured freq_details and is_unique_id fields on DictionaryEntry back richer Frequency tables and deterministic primary-key inference; the <ALL_UNIQUE> content-type heuristic is also tightened to require percentage == 100.0 to avoid false positives on truncated frequency input.

Changes:

  • New SemanticMd OutputFormat variant wired through dispatch, validation, finalize, and a new render_semanticmd_body + template defaults.
  • DictionaryEntry gains freq_details and is_unique_id; generate_code_based_dictionary populates them and tightens the unique-id detector.
  • Tests and docs (README, TableOfContents, describegpt help, NYC 311 example) updated for the new format.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/cmd/describegpt.rs Adds SemanticMd variant, render/finalize paths, tag-frontmatter + attribution helpers, CLI validation, tests.
src/cmd/describegpt/formatters.rs Adds semanticmd_type, SemanticMd* render-data structs, build_semanticmd_data, frequency row mapping, and unit tests.
src/cmd/describegpt/dictionary.rs Adds FreqDetail, is_unique_id; tightens is_all_unique to require percentage == 100; adds regression test.
resources/describegpt_md_defaults.toml New semanticmd_md_body_template default.
tests/test_describegpt.rs Integration tests for validation rejection and end-to-end semanticmd dictionary render.
docs/help/describegpt.md, docs/help/TableOfContents.md, README.md Help/README updates referencing the new format and example doc.
docs/describegpt/nyc311-describegpt-semanticmd.md New example output document.

@jqnatividad jqnatividad merged commit 12ceb8c into master Jun 2, 2026
17 of 20 checks passed
@jqnatividad jqnatividad deleted the describegpt-semantic-md branch June 2, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

describegpt: Use Semantic Markdown (WIP)

3 participants