feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs by jqnatividad · Pull Request #3935 · dathere/qsv

jqnatividad · 2026-06-02T16:49:55Z

Context

describegpt --format semanticmd emitted a thin Data Dictionary (label/description, a min/max/cardinality/null table, choices, frequency). It was readable for humans but weak for the two use-cases we actually want to serve:

An AI agent generating Jupyter notebooks to query the dataset — no grain, no dimension/measure classification, no join keys, no example queries, no rich stats.
An AI agent correlating/joining the dataset with OTHER datasets in a catalog — no catalog-wide semantic identity on columns, so it couldn't tell that Incident Zip here and a ZIP column elsewhere denote the same thing.

This PR redesigns the format to serve both, generated deterministically from qsv stats/frequency plus the existing LLM label/description pass (every statistic surfaced was already computed).

What's new

Per column

Concept — a catalog-wide, namespaced semantic identity (geo.zip_code, time.event_timestamp, id.surrogate_key, nyc.bbl, …) from a new CONCEPT_VOCAB. Two columns in different datasets sharing a concept are join candidates (shared-vocabulary linking — no explicit foreign keys).
Role — dimension / measure / identifier / timestamp.
Join hint (PK / FK?) + cardinality class (1:1 / N:1) + nullability.
Quality flags — placeholder-dates, PII, PII-location, sparse.
Rich statistics block (mean, median, stddev, q1/q3, skew, inner fences, sparsity) and extended validation (numeric min/max; text length).

Per dataset

A grain statement (one row = one …).
A self-contained front-matter envelope: row_count, temporal_coverage, WGS84 spatial, and provenance via new --ds-source / --ds-updated / --ds-license flags (no DCAT/profile dependency).
A sorted concepts: index (cheap catalog-scan surface).
A # Example Queries section — ready-to-run DuckDB SQL + pandas, seeded from the inferred roles (group-by per dimension, monthly time-bucket, measure summary, and a cross-dataset join template keyed on a shared concept).

How it works

concept/role added to DictionaryEntry + LlmDictField (#[serde(default)] for cache back-compat). concept is deterministically seeded from content_type and refined by the LLM; parse_llm_grain mirrors parse_llm_relationships.
--format semanticmd auto-enables --infer-content-type so Concept/Role/grain populate by default ("SemanticMd implies --infer-content-type"). The --two-pass refine pass inherits Concept/Role/grain from the first pass via the baseline-preserving merge, so the refine prompt (and its synced Rust const) is untouched.
semanticmd retains a richer addl_cols stats set by default so the per-column statistics block has data without --addl-cols.
dictionary_prompt asks the LLM for role/concept per field and a top-level grain.

Also fixes a pre-existing strip_attribution_block bug that left a dangling *Attribution: line when folding the --description response into the semanticmd doc (its test only covered the ---\nGenerated by form, not the real *Attribution: Generated by …* markdown footer).

Reference example

docs/describegpt/nyc311-describegpt-semanticmd.md was regenerated with a local LLM (openai/gpt-oss-20b) — authentic output, not hand-authored.

Testing

All 77 describegpt unit + integration tests pass (cargo test -F all_features describegpt), including new tests for role/concept parsing & validation, parse_llm_grain, deterministic concept/role seeding, is_linkable_concept, and the rewritten semanticmd render assertions.
qsv and qsvmcp compile clean; src clippy clean; cargo +nightly fmt applied.
Help (docs/help/describegpt.md) and the MCP skill JSON regenerated.

Notes / deviations from the original plan (all reduce risk, preserve intent)

Gating uses the existing --infer-content-type flag rather than a separate --no-infer-concepts (avoids threading a second boolean through ~15 call sites). The terse path remains --format markdown.
The refine prompt is intentionally left unchanged; two-pass still gets Concept/Role/grain from the first pass.
A --concept-vocab extensibility flag is deferred; CONCEPT_VOCAB is built-in for now (the existing --tag-vocab loader is the natural future hook).

🤖 Generated with Claude Code

…alogs Redesign `describegpt --format semanticmd` into a reference Semantic Markdown Data Dictionary aimed at two use-cases: AI agents generating notebooks/queries over a dataset, and agents correlating/joining datasets across a catalog. Per-column additions: - Concept: catalog-wide, namespaced semantic identity (geo.*, time.*, id.*, org.*, pii.*, nyc.*, ...) from a new CONCEPT_VOCAB. Matching concept IDs across datasets are join candidates (shared-vocabulary linking, no explicit FKs). - Role: dimension / measure / identifier / timestamp. - Join hint (PK / FK?) + cardinality class (1:1 / N:1) + nullability. - Data-quality flags (placeholder-dates, PII, PII-location, sparse). - Richer stats block (mean, median, stddev, q1/q3, skew, fences, sparsity) and extended validation (numeric min/max, text length). Dataset-level additions: grain statement, self-contained front-matter envelope (row_count, temporal_coverage, WGS84 spatial, source/updated/license via new --ds-source/--ds-updated/--ds-license flags), a sorted concepts index, and a deterministic "# Example Queries" section (DuckDB SQL + pandas) seeded from the roles. Implementation: - concept/role on DictionaryEntry + LlmDictField (#[serde(default)] for cache back-compat); deterministically seeded from content_type and refined by the LLM; parse_llm_grain mirrors parse_llm_relationships. - --format semanticmd auto-enables --infer-content-type so concept/role/grain populate by default; the refine pass inherits them via the baseline merge. - semanticmd retains a richer addl_cols stats set by default. - dictionary_prompt asks for role/concept and a top-level grain. Also fixes strip_attribution_block leaving a dangling "*Attribution:" line when folding the --description response into the semanticmd doc. The nyc311 gold example was regenerated with a local LLM (openai/gpt-oss-20b). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codacy-production · 2026-06-02T16:51:16Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 83 complexity

Metric Results

Complexity 83

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Resolve three findings from roborev job 2668: - Medium (frontmatter YAML escaping): register `yaml_scalar` as a Mini Jinja filter and route every semanticmd frontmatter string scalar (id, title, grain, ds_source/updated/license, temporal_coverage + spatial column/value fields, concept index) through it, so quotes/colons/newlines/leading indicators can't produce invalid or structurally different frontmatter. Same helper already used for `tags:`. - Medium (concept backfill): after the LLM content_type merge, coerce_role_concept now backfills an empty concept from concept_from_content_type(content_type) before falling back to "unknown", in both single-pass and baseline/refine merge paths. Recovers the deterministic mapping when a model (or older cached response) returns a valid content_type but omits concept. - Low (example-query escaping): escape the table path as a SQL string literal, column names as SQL identifiers (doubled `"`), and pandas references as Python string literals; the pandas read_csv path uses a Python-escaped resource name. Adds unit tests for the backfill and the SQL/pandas escaping helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve two Medium findings from the prior fix: - dictionary.rs (concept backfill): coerce_role_concept now backfills the deterministic concept mapping when the concept is empty OR the literal "unknown" (a valid vocab token a model may emit), so a response with content_type "zip_code" and concept "unknown" still recovers geo.zip_code. - describegpt.rs (yaml_scalar): tighten the YAML scalar emitter so values that look plain but are YAML implicit-typed (123, 1.5, true/false/yes/no/null/~, YYYY-MM-DD timestamps) are double-quoted, preventing frontmatter string values (e.g. updated dates, all-digit ids/tags) from being parsed as numbers, bools, nulls, or dates. Shared with the tags: block, so numeric/typed tags are now quoted too. Adds a yaml_scalar implicit-type test and an "unknown"-concept backfill test; updates the tags frontmatter test to expect the quoted numeric tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve the Low finding: is_yaml_implicit_typed (used by yaml_scalar for semantic-md frontmatter and tags) relied on Rust i64/f64 parsing, which misses YAML integer spellings that still pass the plain-scalar char check — radix prefixes (0xFF, 0o17, 0b1010) and `_` digit separators (1_000, 1_000.5). Those would be emitted bare and parsed by YAML consumers as numbers instead of strings. Now detected and quoted, while non-numeric identifiers (v_1, 0xZZ_label) stay plain. Extends the yaml_scalar test accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR redesigns describegpt --format semanticmd to produce a richer, more agent-usable Semantic Markdown “data dictionary” by adding catalog-wide concept identifiers, analytical roles, dataset grain/envelope metadata, richer per-column stats/validation blocks, and seeded example queries. It also tightens frontmatter YAML handling and improves attribution stripping so generated docs fold cleanly.

Changes:

Added concept + role vocabularies, deterministic seeding/merging logic, linkable-concept detection, and dataset-level grain parsing in the describegpt dictionary pipeline.
Expanded SemanticMd rendering to include dataset envelope (row count, grain, temporal/spatial coverage, provenance), per-column join/quality/stats/validation blocks, and example queries.
Improved YAML scalar escaping/quoting for SemanticMd frontmatter and fixed attribution block stripping edge cases.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/cmd/describegpt/formatters.rs	Builds richer SemanticMd render model (join hints, quality flags, stats/validation blocks, row_count inference, example queries).
src/cmd/describegpt/dictionary.rs	Adds concept/role vocab + parsing/merging, deterministic seeding, and dataset grain extraction.
src/cmd/describegpt.rs	Wires SemanticMd envelope fields into templates, adds `--ds-*` flags, YAML scalar filter, richer default addl-cols for SemanticMd, and attribution stripping fix.
resources/describegpt_md_defaults.toml	Updates SemanticMd Mini-Jinja template to render the new envelope/schema/column blocks and example queries.
resources/describegpt_defaults.toml	Updates dictionary prompt to request `role`, `concept`, and top-level `grain`.
docs/help/describegpt.md	Documents SemanticMd enrichments and new `--ds-*` flags.
docs/describegpt/nyc311-describegpt-semanticmd.md	Regenerated reference SemanticMd output example.
.claude/skills/qsv/qsv-describegpt.json	Updates skill metadata and flags to include SemanticMd + `--ds-*`.

…ation gating + tests) (#3936) * address review: semanticmd stats/validation gating & reference frontmatter - has_stats now considers stats.sparsity, so a column whose only retained stat is sparsity (custom --addl-cols-list) still renders the Statistics block - Validation length constraint renders gracefully when only one of min_length/max_length is present, avoiding an empty ### Validation block - reference SemanticMd doc frontmatter now matches yaml_scalar output: URL, spaced title, license, and YYYY-MM-DD date quoted; plain lat/lon unquoted Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(describegpt): cover semanticmd sparsity-only stats & single-sided length (job 2672) Adds regression tests for the review-driven edge cases in 818e868: - formatters: build_semanticmd_entry sets has_stats for a numeric column whose only retained stat is sparsity, and has_validation for a text column with only one of min_length/max_length - describegpt: a full semanticmd render asserting the ### Statistics block appears for a sparsity-only numeric column and the elif length branches emit `- Length >= N` / `- Length <= N` (never an empty ### Validation block) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jqnatividad and others added 3 commits June 2, 2026 13:11

jqnatividad requested a review from Copilot June 2, 2026 19:16

Copilot started reviewing on behalf of jqnatividad June 2, 2026 19:16 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/cmd/describegpt/formatters.rs

Comment thread resources/describegpt_md_defaults.toml

Comment thread docs/describegpt/nyc311-describegpt-semanticmd.md

jqnatividad merged commit 1fe11cc into master Jun 2, 2026
29 of 30 checks passed

jqnatividad deleted the feat/semanticmd-reference-dictionary branch June 2, 2026 20:01

jqnatividad mentioned this pull request Jun 2, 2026

fix(describegpt): semanticmd review follow-ups for #3935 (stats/validation gating + tests) #3936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs#3935

feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs#3935
jqnatividad merged 4 commits into
masterfrom
feat/semanticmd-reference-dictionary

jqnatividad commented Jun 2, 2026

Uh oh!

codacy-production Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented Jun 2, 2026

Context

What's new

How it works

Reference example

Testing

Notes / deviations from the original plan (all reduce risk, preserve intent)

Uh oh!

codacy-production Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented Jun 2, 2026 •

edited

Loading