Skip to content

feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs#3935

Merged
jqnatividad merged 4 commits into
masterfrom
feat/semanticmd-reference-dictionary
Jun 2, 2026
Merged

feat(describegpt): richer semanticmd Data Dictionary for agents & catalogs#3935
jqnatividad merged 4 commits into
masterfrom
feat/semanticmd-reference-dictionary

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Context

describegpt --format semanticmd emitted a thin Data Dictionary (label/description, a min/max/cardinality/null table, choices, frequency). It was readable for humans but weak for the two use-cases we actually want to serve:

  1. An AI agent generating Jupyter notebooks to query the dataset — no grain, no dimension/measure classification, no join keys, no example queries, no rich stats.
  2. An AI agent correlating/joining the dataset with OTHER datasets in a catalog — no catalog-wide semantic identity on columns, so it couldn't tell that Incident Zip here and a ZIP column elsewhere denote the same thing.

This PR redesigns the format to serve both, generated deterministically from qsv stats/frequency plus the existing LLM label/description pass (every statistic surfaced was already computed).

What's new

Per column

  • Concept — a catalog-wide, namespaced semantic identity (geo.zip_code, time.event_timestamp, id.surrogate_key, nyc.bbl, …) from a new CONCEPT_VOCAB. Two columns in different datasets sharing a concept are join candidates (shared-vocabulary linking — no explicit foreign keys).
  • Roledimension / measure / identifier / timestamp.
  • Join hint (PK / FK?) + cardinality class (1:1 / N:1) + nullability.
  • Quality flagsplaceholder-dates, PII, PII-location, sparse.
  • Rich statistics block (mean, median, stddev, q1/q3, skew, inner fences, sparsity) and extended validation (numeric min/max; text length).

Per dataset

  • A grain statement (one row = one …).
  • A self-contained front-matter envelope: row_count, temporal_coverage, WGS84 spatial, and provenance via new --ds-source / --ds-updated / --ds-license flags (no DCAT/profile dependency).
  • A sorted concepts: index (cheap catalog-scan surface).
  • A # Example Queries section — ready-to-run DuckDB SQL + pandas, seeded from the inferred roles (group-by per dimension, monthly time-bucket, measure summary, and a cross-dataset join template keyed on a shared concept).

How it works

  • concept/role added to DictionaryEntry + LlmDictField (#[serde(default)] for cache back-compat). concept is deterministically seeded from content_type and refined by the LLM; parse_llm_grain mirrors parse_llm_relationships.
  • --format semanticmd auto-enables --infer-content-type so Concept/Role/grain populate by default ("SemanticMd implies --infer-content-type"). The --two-pass refine pass inherits Concept/Role/grain from the first pass via the baseline-preserving merge, so the refine prompt (and its synced Rust const) is untouched.
  • semanticmd retains a richer addl_cols stats set by default so the per-column statistics block has data without --addl-cols.
  • dictionary_prompt asks the LLM for role/concept per field and a top-level grain.

Also fixes a pre-existing strip_attribution_block bug that left a dangling *Attribution: line when folding the --description response into the semanticmd doc (its test only covered the ---\nGenerated by form, not the real *Attribution: Generated by …* markdown footer).

Reference example

docs/describegpt/nyc311-describegpt-semanticmd.md was regenerated with a local LLM (openai/gpt-oss-20b) — authentic output, not hand-authored.

Testing

  • All 77 describegpt unit + integration tests pass (cargo test -F all_features describegpt), including new tests for role/concept parsing & validation, parse_llm_grain, deterministic concept/role seeding, is_linkable_concept, and the rewritten semanticmd render assertions.
  • qsv and qsvmcp compile clean; src clippy clean; cargo +nightly fmt applied.
  • Help (docs/help/describegpt.md) and the MCP skill JSON regenerated.

Notes / deviations from the original plan (all reduce risk, preserve intent)

  • Gating uses the existing --infer-content-type flag rather than a separate --no-infer-concepts (avoids threading a second boolean through ~15 call sites). The terse path remains --format markdown.
  • The refine prompt is intentionally left unchanged; two-pass still gets Concept/Role/grain from the first pass.
  • A --concept-vocab extensibility flag is deferred; CONCEPT_VOCAB is built-in for now (the existing --tag-vocab loader is the natural future hook).

🤖 Generated with Claude Code

…alogs

Redesign `describegpt --format semanticmd` into a reference Semantic Markdown
Data Dictionary aimed at two use-cases: AI agents generating notebooks/queries
over a dataset, and agents correlating/joining datasets across a catalog.

Per-column additions:
- Concept: catalog-wide, namespaced semantic identity (geo.*, time.*, id.*,
  org.*, pii.*, nyc.*, ...) from a new CONCEPT_VOCAB. Matching concept IDs across
  datasets are join candidates (shared-vocabulary linking, no explicit FKs).
- Role: dimension / measure / identifier / timestamp.
- Join hint (PK / FK?) + cardinality class (1:1 / N:1) + nullability.
- Data-quality flags (placeholder-dates, PII, PII-location, sparse).
- Richer stats block (mean, median, stddev, q1/q3, skew, fences, sparsity) and
  extended validation (numeric min/max, text length).

Dataset-level additions: grain statement, self-contained front-matter envelope
(row_count, temporal_coverage, WGS84 spatial, source/updated/license via new
--ds-source/--ds-updated/--ds-license flags), a sorted concepts index, and a
deterministic "# Example Queries" section (DuckDB SQL + pandas) seeded from the
roles.

Implementation:
- concept/role on DictionaryEntry + LlmDictField (#[serde(default)] for cache
  back-compat); deterministically seeded from content_type and refined by the
  LLM; parse_llm_grain mirrors parse_llm_relationships.
- --format semanticmd auto-enables --infer-content-type so concept/role/grain
  populate by default; the refine pass inherits them via the baseline merge.
- semanticmd retains a richer addl_cols stats set by default.
- dictionary_prompt asks for role/concept and a top-level grain.

Also fixes strip_attribution_block leaving a dangling "*Attribution:" line when
folding the --description response into the semanticmd doc.

The nyc311 gold example was regenerated with a local LLM (openai/gpt-oss-20b).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Jun 2, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 83 complexity

Metric Results
Complexity 83

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

jqnatividad and others added 3 commits June 2, 2026 13:11
Resolve three findings from roborev job 2668:

- Medium (frontmatter YAML escaping): register `yaml_scalar` as a Mini Jinja
  filter and route every semanticmd frontmatter string scalar (id, title, grain,
  ds_source/updated/license, temporal_coverage + spatial column/value fields,
  concept index) through it, so quotes/colons/newlines/leading indicators can't
  produce invalid or structurally different frontmatter. Same helper already used
  for `tags:`.

- Medium (concept backfill): after the LLM content_type merge, coerce_role_concept
  now backfills an empty concept from concept_from_content_type(content_type) before
  falling back to "unknown", in both single-pass and baseline/refine merge paths.
  Recovers the deterministic mapping when a model (or older cached response) returns
  a valid content_type but omits concept.

- Low (example-query escaping): escape the table path as a SQL string literal,
  column names as SQL identifiers (doubled `"`), and pandas references as Python
  string literals; the pandas read_csv path uses a Python-escaped resource name.

Adds unit tests for the backfill and the SQL/pandas escaping helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve two Medium findings from the prior fix:

- dictionary.rs (concept backfill): coerce_role_concept now backfills the
  deterministic concept mapping when the concept is empty OR the literal
  "unknown" (a valid vocab token a model may emit), so a response with
  content_type "zip_code" and concept "unknown" still recovers geo.zip_code.

- describegpt.rs (yaml_scalar): tighten the YAML scalar emitter so values that
  look plain but are YAML implicit-typed (123, 1.5, true/false/yes/no/null/~,
  YYYY-MM-DD timestamps) are double-quoted, preventing frontmatter string values
  (e.g. updated dates, all-digit ids/tags) from being parsed as numbers, bools,
  nulls, or dates. Shared with the tags: block, so numeric/typed tags are now
  quoted too.

Adds a yaml_scalar implicit-type test and an "unknown"-concept backfill test;
updates the tags frontmatter test to expect the quoted numeric tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve the Low finding: is_yaml_implicit_typed (used by yaml_scalar for
semantic-md frontmatter and tags) relied on Rust i64/f64 parsing, which misses
YAML integer spellings that still pass the plain-scalar char check — radix
prefixes (0xFF, 0o17, 0b1010) and `_` digit separators (1_000, 1_000.5). Those
would be emitted bare and parsed by YAML consumers as numbers instead of
strings. Now detected and quoted, while non-numeric identifiers (v_1,
0xZZ_label) stay plain. Extends the yaml_scalar test accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR redesigns describegpt --format semanticmd to produce a richer, more agent-usable Semantic Markdown “data dictionary” by adding catalog-wide concept identifiers, analytical roles, dataset grain/envelope metadata, richer per-column stats/validation blocks, and seeded example queries. It also tightens frontmatter YAML handling and improves attribution stripping so generated docs fold cleanly.

Changes:

  • Added concept + role vocabularies, deterministic seeding/merging logic, linkable-concept detection, and dataset-level grain parsing in the describegpt dictionary pipeline.
  • Expanded SemanticMd rendering to include dataset envelope (row count, grain, temporal/spatial coverage, provenance), per-column join/quality/stats/validation blocks, and example queries.
  • Improved YAML scalar escaping/quoting for SemanticMd frontmatter and fixed attribution block stripping edge cases.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/cmd/describegpt/formatters.rs Builds richer SemanticMd render model (join hints, quality flags, stats/validation blocks, row_count inference, example queries).
src/cmd/describegpt/dictionary.rs Adds concept/role vocab + parsing/merging, deterministic seeding, and dataset grain extraction.
src/cmd/describegpt.rs Wires SemanticMd envelope fields into templates, adds --ds-* flags, YAML scalar filter, richer default addl-cols for SemanticMd, and attribution stripping fix.
resources/describegpt_md_defaults.toml Updates SemanticMd Mini-Jinja template to render the new envelope/schema/column blocks and example queries.
resources/describegpt_defaults.toml Updates dictionary prompt to request role, concept, and top-level grain.
docs/help/describegpt.md Documents SemanticMd enrichments and new --ds-* flags.
docs/describegpt/nyc311-describegpt-semanticmd.md Regenerated reference SemanticMd output example.
.claude/skills/qsv/qsv-describegpt.json Updates skill metadata and flags to include SemanticMd + --ds-*.

Comment thread src/cmd/describegpt/formatters.rs
Comment thread resources/describegpt_md_defaults.toml
Comment thread docs/describegpt/nyc311-describegpt-semanticmd.md
@jqnatividad jqnatividad merged commit 1fe11cc into master Jun 2, 2026
29 of 30 checks passed
@jqnatividad jqnatividad deleted the feat/semanticmd-reference-dictionary branch June 2, 2026 20:01
jqnatividad added a commit that referenced this pull request Jun 2, 2026
…ation gating + tests) (#3936)

* address review: semanticmd stats/validation gating & reference frontmatter

- has_stats now considers stats.sparsity, so a column whose only retained
  stat is sparsity (custom --addl-cols-list) still renders the Statistics block
- Validation length constraint renders gracefully when only one of
  min_length/max_length is present, avoiding an empty ### Validation block
- reference SemanticMd doc frontmatter now matches yaml_scalar output:
  URL, spaced title, license, and YYYY-MM-DD date quoted; plain lat/lon unquoted

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(describegpt): cover semanticmd sparsity-only stats & single-sided length (job 2672)

Adds regression tests for the review-driven edge cases in 818e868:
- formatters: build_semanticmd_entry sets has_stats for a numeric column
  whose only retained stat is sparsity, and has_validation for a text column
  with only one of min_length/max_length
- describegpt: a full semanticmd render asserting the ### Statistics block
  appears for a sparsity-only numeric column and the elif length branches
  emit `- Length >= N` / `- Length <= N` (never an empty ### Validation block)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants