Skip to content

AI and Documentation

Joel Natividad edited this page Jun 3, 2026 · 10 revisions

AI & Documentation

Tier: Intermediate Commands covered: describegpt, synthesize

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

The flagship here is describegpt — a neuro-symbolic data dictionary generator and SQL-RAG chat assistant. "Neuro-symbolic" means the heavy lifting (column types, ranges, cardinalities) is done deterministically by qsv's stats and frequency caches; the LLM only fills in human-friendly labels and descriptions. The result: hallucination-resistant data documentation, often produced against a local LLM (Ollama / Jan / LM Studio).

Since 20.1.0, describegpt can also tag each column with a semantic Content Type from a curated 47-token vocabulary (email, phone, street_address, job_title, credit_card, ipv6_address, …). Those tags then drive the new synthesize command, which generates a statistically-faithful fake CSV that mirrors the source's distributions and null rates while substituting realistic fakes for sensitive columns.

For the deep-dive, see docs/Describegpt.md and the output gallery (Markdown, JSON, TOON, Semantic Markdown, Spanish, Mandarin, and SQL-RAG examples).

Note

Looking for color or pro? color moved to Selection & Inspection → color (it's the colorized cousin of table). pro moved to Integrations → qsv pro bridge. Neither has an LLM/AI surface — they're listed here historically.

Quick decision table

If you want to… Use Notes
Generate a data dictionary for a CSV describegpt --dictionary Outputs deterministic stats + LLM-written labels
Tag each column with a semantic Content Type describegpt --dictionary --infer-content-type 47-token vocabulary; unique_id deterministic for ALL-UNIQUE cols (20.1.0+)
Sharpen labels with cross-field awareness describegpt --all --two-pass Second LLM pass relates fields (e.g. street_no + street_name + city + zip → mailing address) (20.1.0+)
Generate description + tags + dictionary describegpt --all The full "FAIRify" mode
Emit the data dictionary as a JSON Schema describegpt --dictionary --format jsonschema Draft 2020-12; qsv/LLM extras under x-qsv
Emit the data dictionary as Semantic Markdown describegpt --dictionary --format semanticmd Agent-parseable markdown (semanticmd.org) a converter turns into JSON (20.2.0+)
Ask a natural-language question about a CSV describegpt --prompt "..." SQL-RAG sub-mode kicks in when needed
Customize the Markdown layout describegpt --markdown-template my.toml MiniJinja TOML templates per inference kind (20.1.0+)
Get multilingual descriptions (Spanish, Mandarin, …) describegpt --lang ... LLM-driven; quality varies by model
Use a local LLM (Ollama / Jan / LM Studio) describegpt -u http://localhost:11434/v1 --model ... Recommended for sensitive data
Generate a statistically-faithful fake CSV synthesize Preserves distributions + null rates; substitutes realistic fakes (20.1.0+)

describegpt

Calls any OpenAI-compatible LLM endpoint with a configurable MiniJinja-templated prompt. The prompt is fed pre-computed summary statistics and frequency distribution from qsv's stats / frequency caches — that's the "symbolic" half. The LLM generates only the natural-language labels and descriptions — that's the "neuro" half. Result: deterministic, reproducible documentation that doesn't hallucinate column names or invent ranges.

Two modes

  • Dictionary mode (--dictionary, --description, --tags, --all) — generates documentation for the whole dataset.
  • Chat / RAG mode (--prompt "...") — answer a natural-language question. If the answer needs more than stats+frequency, qsv enters SQL-RAG sub-mode: it generates a SQL query, runs it against the data (DuckDB if QSV_DUCKDB_PATH is set, Polars SQL otherwise), and returns the deterministic answer.

Example: full data dictionary against OpenAI's default model

qsv describegpt data.csv --api-key "$OPENAI_API_KEY" --all
# Writes data.dictionary.json (or similar) and prints a Markdown summary

Example: against a local Ollama instance (no API key, no data leaves your machine)

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  --dictionary

Example: chat — ask a question the stats alone can answer

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What is the most common complaint type?"

The LLM consults the frequency table (already cached) and answers without writing any SQL.

Example: chat with SQL-RAG sub-mode

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What are the top 10 complaint types by community board and borough by year?"

Stats alone can't answer this (it needs a group-by × borough × year), so qsv:

  1. Builds a small random sample as additional LLM context.
  2. Asks the LLM to write a SQL query that answers the question.
  3. Runs the query with DuckDB (if QSV_DUCKDB_PATH is set) or Polars SQL.
  4. Returns the actual result, not a guess. See docs/describegpt/nyc311-describegpt-prompt.md for a worked example with the resulting CSV.

Example: iterative SQL-RAG session refinement

The Allegheny property sales session shows three rounds of refinement against the dataset. The final query produces a most-expensive-listings CSV — deterministic, reproducible.

Example: multilingual output

qsv describegpt NYC_311.csv --all --lang es > nyc311-describegpt-spanish.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-describegpt-mandarin.md

(See docs/describegpt/ for actual Spanish and Mandarin outputs.)

Example: controlled tag vocabulary (avoid LLM tag drift)

qsv describegpt NYC_311.csv \
  --tags \
  --tag-vocab data-tag-vocabulary.csv > nyc311-tags.md

--tag-vocab constrains the LLM to choose from a list of approved tags — useful for CKAN harmonization.

Output formats

describegpt emits Markdown (default), JSON, TOON (Toon Format — a compact JSON encoding designed for LLM prompts), JSONSchema, and SemanticMd (both dictionary-centric — see below). See docs/describegpt/ for examples of each.

JSON Schema output (--format jsonschema)

describegpt --format jsonschema emits the Data Dictionary as a JSON Schema (draft 2020-12) document instead of Markdown/JSON/TOON. Each column becomes a property carrying its inferred type plus the LLM-written Label and Description. If the description inference also ran, it becomes the schema's top-level description; tags, if generated, land under x-qsv.tags.

qsv- and LLM-specific metadata that JSON Schema has no native slot for — cardinality, null_count, weighted example counts, semantic Content Type, and the extra stats columns — is preserved under a single x-qsv annotation object per property. Unknown x- keywords are ignored by conforming validators (per the 2020-12 spec), so the schema still validates real data while keeping qsv's richer context for tooling that knows to look.

# emit a JSON Schema for customers.csv, enriched with LLM labels & descriptions
qsv describegpt customers.csv --dictionary --format jsonschema -o customers.schema.json

# then validate the data against it with qsv's own validator
qsv validate customers.csv customers.schema.json

The jsonschema format requires the dictionary phase (--dictionary or --all); the --prompt chat mode is not supported. Two flags fine-tune the output:

  • --allow-extra-cols — emit additionalProperties: true at the schema root (default is false — strict).
  • --strict-dates — emit format: date / date-time for Date/DateTime columns. Off by default because qsv's date inference is permissive (it accepts strings like June 27, 1968) while JSON Schema's date formats require RFC 3339 — set it only when your columns are guaranteed RFC 3339. Mirrors the same flag on the schema command.

This pairs naturally with validate — see Recipe: JSON Schema Validation for the validation side.

Semantic Markdown output (--format semanticmd, new in 20.2.0)

describegpt --format semanticmd emits the Data Dictionary as a Semantic Markdown document — human-readable markdown with a few lightweight, agent-parseable conventions that a companion converter turns into structured JSON. It's the middle ground between freeform English and a formal schema: a person reads it as a normal markdown report, while an agent gets concrete structural cues.

The document is organized into sections:

  • YAML frontmatter — dataset id / title / row_count / grain, an optional temporal_coverage and spatial envelope, a sorted concept index, and --tags (when run). String scalars are quoted when needed so YAML consumers don't re-type them — a 2020-12-23 value stays a string, a source URL stays text.
  • # Dataset — the --description result (when run), a ## Grain statement ("one row = …"), and a Resource / Schema / Title table.
  • # Schema — a Column / Type / Role / Concept / Join? / Null / Label table (codes wrapped in backticks, required prefix for non-null columns) and a heuristically-inferred Primary key (only when exactly one column is structurally unique — cardinality == rowcount, no nulls).
  • ## Column subsections — the LLM Description plus bold-key bullets for Concept, Role, Join, and Quality; a ### Validation block (numeric ranges or text length ranges), ### Choices (low-cardinality enumerations), and a rich ### Statistics block (mean, median, quartiles, skewness, inner fences, sparsity).
  • # Resource — a ## Statistics table (Column / Min / Max / Cardinality / Null Count) and per-column ### Frequency tables carrying Choice, Frequency, Percentage, and Rank (sourced from qsv's own frequency computation; aggregation buckets like Other… / (NULL)… render with a blank rank).

What makes it agent- & catalog-ready: every column carries a catalog-wide Concept ID (e.g. geo.zip_code, id.surrogate_key) for cross-dataset join discovery, an analytical Role (dimension / measure / identifier / timestamp), join keys with a 1:1 / N:1 cardinality class, and data-quality flags (PII, PII-location, sparse, placeholder-dates). Because Concept, Role, and grain need semantic context, semanticmd implies --infer-content-type.

Three optional flags populate the frontmatter (SemanticMd-only; omitted when unset, ignored by other formats): --ds-source (source / provenance), --ds-updated (last-updated date), and --ds-license (license).

# emit a Semantic Markdown data dictionary, enriched with LLM labels, description & tags
qsv describegpt customers.csv --all --format semanticmd \
  --ds-source "https://data.example.gov/catalog/customers" \
  --ds-updated 2026-01-15 \
  --ds-license "CC-BY-4.0" \
  -o customers-dictionary.md

Like jsonschema, the semanticmd format requires the dictionary phase (--dictionary or --all) and does not support the --prompt chat mode. When the other inference phases run, the --description result becomes the # Dataset description and --tags are embedded in the YAML frontmatter. Rendering goes through a user-overridable MiniJinja template (semanticmd_md_body_template, see --markdown-template below). See nyc311-describegpt-semanticmd.md for a full example.

Converting to JSON — the datadict.yaml schema

The emitted document's front matter opens with semantic-md: datadict.yaml, a pointer telling the semantic-md converter which schema turns the markdown into structured JSON. qsv ships that schema at docs/describegpt/datadict.yaml — a semantic-md schema that maps each section of the data dictionary (Dataset, Schema + per-column subsections, Resource statistics, and Frequencies) onto a JSON object. Inline-code cells (`column`, `concept`, `resource` …) are re-extracted so their values are stored without literal backticks.

Run the conversion with the semantic-md Python package, and you get the machine-readable JSON committed alongside the markdown as nyc311-describegpt-semanticmd.json — the same content an agent or data catalog would consume, minus the LLM round-trip:

# pip install semantic-md==0.0.2 mistletoe==1.5.1 jsonpatch==1.33 pyyaml==6.0.3
from semantic_md import convert as smd

front, body = smd.md_parse_front_matter(open("nyc311-describegpt-semanticmd.md").read())
schema = smd.Schema.read(open("datadict.yaml").read())
data = smd.to_json(smd.md_parse_body(body, schema), schema)   # -> the JSON object

A regenerate-and-verify check, check_semanticmd.py (modeled on qsv --generate-help-md), converts the markdown in-process, asserts a few structural invariants, and diffs the result against the committed JSON artifact — so a behavior change in datadict.yaml, the semantic-md package, or the semanticmd template fails loudly instead of drifting unnoticed. Run python3 check_semanticmd.py to verify, or --update to refresh the artifact. The toolchain is fully version-pinned for reproducibility.

Note: datadict.yaml carries a small two-rule workaround for an upstream semantic-md quirk where a {var|md} run reaching end-of-document under-counts consumed tokens by one (dropping qsv's trailing attribution paragraph). The extra trailer rule can be removed once that is fixed upstream.

Configurable prompts

The default prompt templates are in resources/describegpt_defaults.toml. Copy and edit to fit your organization's documentation style — describegpt --prompt-file my-prompts.toml ....

Content Types — semantic column labels (new in 20.1.0)

Pass --infer-content-type and describegpt asks the LLM to tag each column with a semantic Content Type from a curated 47-token vocabulary that covers people (first_name, last_name, full_name, job_title, …), addresses (street_no, street_name, street_address, city, state, state_abbr, zip_code, country, …), contact info (email, phone, username), technical identifiers (ipv4_address, ipv6_address, mac_address, uuid, domain, url), payment (credit_card, iban, currency_code), and more (industry, profession, license_plate, …). The full token list and the faker each token maps to lives in src/cmd/synthesize/faker_map.rs.

A Content Type column is appended to the emitted Data Dictionary. Those tags drive synthesize's realistic-fake generation but they're also useful on their own as auto-generated documentation.

Deterministic unique_id. Columns whose cardinality equals their row count (primary keys, surrogate keys, UUIDs, sequence numbers) are tagged unique_id directly by qsv, before the LLM ever sees them — so the label is 100 % reproducible and doesn't drift between LLM versions.

qsv describegpt customers.csv --dictionary --infer-content-type --format JSON -o customers.dict.json

--two-pass cross-field refinement (new in 20.1.0)

A first-pass LLM call sees each column in isolation, which means it can mislabel "this is a 2-letter string column" without realizing the next column is a zip_code (making the first one a state_abbr). Pass --two-pass and describegpt runs a second LLM call that takes the whole first-pass Data Dictionary as JSON context, then refines each field's Label, Description, and (when --infer-content-type is set) Content Type with cross-field awareness. Roughly doubles dictionary LLM cost — opt in when accuracy matters more than throughput.

This is also the option that unlocks cross-column consistency in synthesize (a synthetic state_abbr will be a real US state abbreviation matched to the row's zip_code).

qsv describegpt customers.csv --all --two-pass --infer-content-type \
  -u http://localhost:11434/v1 --model deepseek-r1:14b

--markdown-template for customizable Markdown output (new in 20.1.0)

Override the Markdown layout that describegpt emits with a TOML file of MiniJinja templates. Six optional template fields — dictionary_md_template, description_md_template, tags_md_template, custom_prompt_md_template, the per-field dictionary_md_body_template that fills the dictionary wrapper's {{ llm_response }}, and semanticmd_md_body_template (the whole --format semanticmd document, added in 20.2.0). Any omitted field falls back to the embedded default, so a minimal TOML can override just the one template you want to change. Custom Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) are pre-registered and documented inline in resources/describegpt_md_defaults.toml.

qsv describegpt customers.csv --all --markdown-template team-style.toml > customers-dictionary.md

Lower LLM cost in 20.1.0

The default description and tags prompts now inline {{ dictionary }} directly into the system prompt instead of re-sending the dictionary as a separate chat message. Measurably cuts token usage on multi-phase (--all) runs.

See also: /docs/help/describegpt.md, docs/Describegpt.md, Output gallery, Stats Cache & Caching, SQL & Polars, Claude Cowork Plugin, MCP Server, Cookbook → Stats → Insights, Cookbook → Synthesize Fake Data.

synthesize

New in 20.1.0. Generate a statistically-faithful synthetic CSV from a real one — same value mix, same distribution shape, same null rate — without any of the original records. Useful for sharing test data, populating staging environments, building demos, and benchmarking pipelines without leaking real customer data.

synthesize runs stats and frequency against the source under the hood, then emits N rows that reproduce per-column attributes:

  • Categorical / low-cardinality columns are rebuilt by frequency-weighted sampling of the real value set — cardinality, weights, and repetition structure preserved exactly.
  • Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just min/max) is preserved.
  • Null ratios are matched per column.
  • Unstructured text columns that carry string-length stats (min/max/avg/stddev) get values truncated to the source's character-length distribution.

Layer in a Data Dictionary from describegpt --dictionary --infer-content-type --format JSON and each column's Content Type picks a realistic fake-rs faker — email, phone, street_address, uuid, etc. — for non-enumerable columns. --locale picks from 14 fake-rs locales (en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr).

Important

Cross-column correlation is not modeled by default. Columns are generated independently. Use describegpt --two-pass to let the LLM detect related fields (e.g. cityzip_code) and refine Content Types — synthesize will then keep those relationships consistent in the generated rows.

Example: pure statistical synthesis — no LLM needed

qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

The --seed flag makes the output fully reproducible — same seed, same file, every time.

Example: realistic fakes via a Content-Type-tagged dictionary

# 1. Build the dictionary once (local LLM keeps the data on-device)
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON \
  -u http://localhost:11434/v1 --model deepseek-r1:14b -o customers.dict.json

# 2. Synthesize 10,000 rows that mirror customers.csv's shape, with realistic fakes
qsv synthesize customers.csv --dictionary customers.dict.json \
  --locale en --seed 42 -n 10000 > customers-fake.csv

Example: one-shot — let synthesize build the dictionary itself

qsv synthesize data.csv --infer-content-type --seed 42 -n 5000 > synthetic.csv

Requires an LLM API key in QSV_LLM_APIKEY (or -u <local-endpoint> for a local LLM).

Example: stable source → fake mapping (--consistent-fakes)

qsv synthesize customers.csv --dictionary customers.dict.json \
  --consistent-fakes --seed 42 -n 10000 > customers-fake.csv

For structured-faker columns with bounded cardinality, the same source value always produces the same fake — useful for deidentified synthesis where you want stable joins on the faked columns.

See also: /docs/help/synthesize.md, describegpt Content Types, Cookbook → Synthesize Fake Data, Stats Cache & Caching, src/cmd/synthesize/faker_map.rs — Content Type → faker mapping.

See also

Clone this wiki locally