feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs#3935
Conversation
…alogs Redesign `describegpt --format semanticmd` into a reference Semantic Markdown Data Dictionary aimed at two use-cases: AI agents generating notebooks/queries over a dataset, and agents correlating/joining datasets across a catalog. Per-column additions: - Concept: catalog-wide, namespaced semantic identity (geo.*, time.*, id.*, org.*, pii.*, nyc.*, ...) from a new CONCEPT_VOCAB. Matching concept IDs across datasets are join candidates (shared-vocabulary linking, no explicit FKs). - Role: dimension / measure / identifier / timestamp. - Join hint (PK / FK?) + cardinality class (1:1 / N:1) + nullability. - Data-quality flags (placeholder-dates, PII, PII-location, sparse). - Richer stats block (mean, median, stddev, q1/q3, skew, fences, sparsity) and extended validation (numeric min/max, text length). Dataset-level additions: grain statement, self-contained front-matter envelope (row_count, temporal_coverage, WGS84 spatial, source/updated/license via new --ds-source/--ds-updated/--ds-license flags), a sorted concepts index, and a deterministic "# Example Queries" section (DuckDB SQL + pandas) seeded from the roles. Implementation: - concept/role on DictionaryEntry + LlmDictField (#[serde(default)] for cache back-compat); deterministically seeded from content_type and refined by the LLM; parse_llm_grain mirrors parse_llm_relationships. - --format semanticmd auto-enables --infer-content-type so concept/role/grain populate by default; the refine pass inherits them via the baseline merge. - semanticmd retains a richer addl_cols stats set by default. - dictionary_prompt asks for role/concept and a top-level grain. Also fixes strip_attribution_block leaving a dangling "*Attribution:" line when folding the --description response into the semanticmd doc. The nyc311 gold example was regenerated with a local LLM (openai/gpt-oss-20b). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 83 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Resolve three findings from roborev job 2668: - Medium (frontmatter YAML escaping): register `yaml_scalar` as a Mini Jinja filter and route every semanticmd frontmatter string scalar (id, title, grain, ds_source/updated/license, temporal_coverage + spatial column/value fields, concept index) through it, so quotes/colons/newlines/leading indicators can't produce invalid or structurally different frontmatter. Same helper already used for `tags:`. - Medium (concept backfill): after the LLM content_type merge, coerce_role_concept now backfills an empty concept from concept_from_content_type(content_type) before falling back to "unknown", in both single-pass and baseline/refine merge paths. Recovers the deterministic mapping when a model (or older cached response) returns a valid content_type but omits concept. - Low (example-query escaping): escape the table path as a SQL string literal, column names as SQL identifiers (doubled `"`), and pandas references as Python string literals; the pandas read_csv path uses a Python-escaped resource name. Adds unit tests for the backfill and the SQL/pandas escaping helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve two Medium findings from the prior fix: - dictionary.rs (concept backfill): coerce_role_concept now backfills the deterministic concept mapping when the concept is empty OR the literal "unknown" (a valid vocab token a model may emit), so a response with content_type "zip_code" and concept "unknown" still recovers geo.zip_code. - describegpt.rs (yaml_scalar): tighten the YAML scalar emitter so values that look plain but are YAML implicit-typed (123, 1.5, true/false/yes/no/null/~, YYYY-MM-DD timestamps) are double-quoted, preventing frontmatter string values (e.g. updated dates, all-digit ids/tags) from being parsed as numbers, bools, nulls, or dates. Shared with the tags: block, so numeric/typed tags are now quoted too. Adds a yaml_scalar implicit-type test and an "unknown"-concept backfill test; updates the tags frontmatter test to expect the quoted numeric tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve the Low finding: is_yaml_implicit_typed (used by yaml_scalar for semantic-md frontmatter and tags) relied on Rust i64/f64 parsing, which misses YAML integer spellings that still pass the plain-scalar char check — radix prefixes (0xFF, 0o17, 0b1010) and `_` digit separators (1_000, 1_000.5). Those would be emitted bare and parsed by YAML consumers as numbers instead of strings. Now detected and quoted, while non-numeric identifiers (v_1, 0xZZ_label) stay plain. Extends the yaml_scalar test accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR redesigns describegpt --format semanticmd to produce a richer, more agent-usable Semantic Markdown “data dictionary” by adding catalog-wide concept identifiers, analytical roles, dataset grain/envelope metadata, richer per-column stats/validation blocks, and seeded example queries. It also tightens frontmatter YAML handling and improves attribution stripping so generated docs fold cleanly.
Changes:
- Added
concept+rolevocabularies, deterministic seeding/merging logic, linkable-concept detection, and dataset-levelgrainparsing in the describegpt dictionary pipeline. - Expanded SemanticMd rendering to include dataset envelope (row count, grain, temporal/spatial coverage, provenance), per-column join/quality/stats/validation blocks, and example queries.
- Improved YAML scalar escaping/quoting for SemanticMd frontmatter and fixed attribution block stripping edge cases.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cmd/describegpt/formatters.rs | Builds richer SemanticMd render model (join hints, quality flags, stats/validation blocks, row_count inference, example queries). |
| src/cmd/describegpt/dictionary.rs | Adds concept/role vocab + parsing/merging, deterministic seeding, and dataset grain extraction. |
| src/cmd/describegpt.rs | Wires SemanticMd envelope fields into templates, adds --ds-* flags, YAML scalar filter, richer default addl-cols for SemanticMd, and attribution stripping fix. |
| resources/describegpt_md_defaults.toml | Updates SemanticMd Mini-Jinja template to render the new envelope/schema/column blocks and example queries. |
| resources/describegpt_defaults.toml | Updates dictionary prompt to request role, concept, and top-level grain. |
| docs/help/describegpt.md | Documents SemanticMd enrichments and new --ds-* flags. |
| docs/describegpt/nyc311-describegpt-semanticmd.md | Regenerated reference SemanticMd output example. |
| .claude/skills/qsv/qsv-describegpt.json | Updates skill metadata and flags to include SemanticMd + --ds-*. |
…ation gating + tests) (#3936) * address review: semanticmd stats/validation gating & reference frontmatter - has_stats now considers stats.sparsity, so a column whose only retained stat is sparsity (custom --addl-cols-list) still renders the Statistics block - Validation length constraint renders gracefully when only one of min_length/max_length is present, avoiding an empty ### Validation block - reference SemanticMd doc frontmatter now matches yaml_scalar output: URL, spaced title, license, and YYYY-MM-DD date quoted; plain lat/lon unquoted Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(describegpt): cover semanticmd sparsity-only stats & single-sided length (job 2672) Adds regression tests for the review-driven edge cases in 818e868: - formatters: build_semanticmd_entry sets has_stats for a numeric column whose only retained stat is sparsity, and has_validation for a text column with only one of min_length/max_length - describegpt: a full semanticmd render asserting the ### Statistics block appears for a sparsity-only numeric column and the elif length branches emit `- Length >= N` / `- Length <= N` (never an empty ### Validation block) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Context
describegpt --format semanticmdemitted a thin Data Dictionary (label/description, a min/max/cardinality/null table, choices, frequency). It was readable for humans but weak for the two use-cases we actually want to serve:Incident Ziphere and aZIPcolumn elsewhere denote the same thing.This PR redesigns the format to serve both, generated deterministically from qsv
stats/frequencyplus the existing LLM label/description pass (every statistic surfaced was already computed).What's new
Per column
geo.zip_code,time.event_timestamp,id.surrogate_key,nyc.bbl, …) from a newCONCEPT_VOCAB. Two columns in different datasets sharing a concept are join candidates (shared-vocabulary linking — no explicit foreign keys).dimension/measure/identifier/timestamp.PK/FK?) + cardinality class (1:1/N:1) + nullability.placeholder-dates,PII,PII-location,sparse.Per dataset
one row = one …).row_count,temporal_coverage, WGS84spatial, and provenance via new--ds-source/--ds-updated/--ds-licenseflags (no DCAT/profiledependency).concepts:index (cheap catalog-scan surface).# Example Queriessection — ready-to-run DuckDB SQL + pandas, seeded from the inferred roles (group-by per dimension, monthly time-bucket, measure summary, and a cross-dataset join template keyed on a shared concept).How it works
concept/roleadded toDictionaryEntry+LlmDictField(#[serde(default)]for cache back-compat).conceptis deterministically seeded fromcontent_typeand refined by the LLM;parse_llm_grainmirrorsparse_llm_relationships.--format semanticmdauto-enables--infer-content-typeso Concept/Role/grain populate by default ("SemanticMd implies --infer-content-type"). The--two-passrefine pass inherits Concept/Role/grain from the first pass via the baseline-preserving merge, so the refine prompt (and its synced Rust const) is untouched.addl_colsstats set by default so the per-column statistics block has data without--addl-cols.dictionary_promptasks the LLM forrole/conceptper field and a top-levelgrain.Also fixes a pre-existing
strip_attribution_blockbug that left a dangling*Attribution:line when folding the--descriptionresponse into the semanticmd doc (its test only covered the---\nGenerated byform, not the real*Attribution: Generated by …*markdown footer).Reference example
docs/describegpt/nyc311-describegpt-semanticmd.mdwas regenerated with a local LLM (openai/gpt-oss-20b) — authentic output, not hand-authored.Testing
cargo test -F all_features describegpt), including new tests for role/concept parsing & validation,parse_llm_grain, deterministic concept/role seeding,is_linkable_concept, and the rewritten semanticmd render assertions.qsvandqsvmcpcompile clean; src clippy clean;cargo +nightly fmtapplied.docs/help/describegpt.md) and the MCP skill JSON regenerated.Notes / deviations from the original plan (all reduce risk, preserve intent)
--infer-content-typeflag rather than a separate--no-infer-concepts(avoids threading a second boolean through ~15 call sites). The terse path remains--format markdown.--concept-vocabextensibility flag is deferred;CONCEPT_VOCABis built-in for now (the existing--tag-vocabloader is the natural future hook).🤖 Generated with Claude Code