-
Notifications
You must be signed in to change notification settings - Fork 104
AI and Documentation
Tier: Intermediate
Commands covered: describegpt, synthesize
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
The flagship here is describegpt — a neuro-symbolic data dictionary generator and SQL-RAG chat assistant. "Neuro-symbolic" means the heavy lifting (column types, ranges, cardinalities) is done deterministically by qsv's stats and frequency caches; the LLM only fills in human-friendly labels and descriptions. The result: hallucination-resistant data documentation, often produced against a local LLM (Ollama / Jan / LM Studio).
Since 20.1.0, describegpt can also tag each column with a semantic Content Type from a curated 47-token vocabulary (email, phone, street_address, job_title, credit_card, ipv6_address, …). Those tags then drive the new synthesize command, which generates a statistically-faithful fake CSV that mirrors the source's distributions and null rates while substituting realistic fakes for sensitive columns.
For the deep-dive, see docs/Describegpt.md and the output gallery (Markdown, JSON, TOON, Semantic Markdown, Spanish, Mandarin, and SQL-RAG examples).
Note
Looking for color or pro? color moved to Selection & Inspection → color (it's the colorized cousin of table). pro moved to Integrations → qsv pro bridge. Neither has an LLM/AI surface — they're listed here historically.
| If you want to… | Use | Notes |
|---|---|---|
| Generate a data dictionary for a CSV | describegpt --dictionary |
Outputs deterministic stats + LLM-written labels |
| Tag each column with a semantic Content Type | describegpt --dictionary --infer-content-type |
47-token vocabulary; unique_id deterministic for ALL-UNIQUE cols (20.1.0+) |
| Sharpen labels with cross-field awareness | describegpt --all --two-pass |
Second LLM pass relates fields (e.g. street_no + street_name + city + zip → mailing address) (20.1.0+) |
| Generate description + tags + dictionary | describegpt --all |
The full "FAIRify" mode |
| Emit the data dictionary as a JSON Schema | describegpt --dictionary --format jsonschema |
Draft 2020-12; qsv/LLM extras under x-qsv
|
| Emit the data dictionary as Semantic Markdown | describegpt --dictionary --format semanticmd |
Agent-parseable markdown (semanticmd.org) a converter turns into JSON (20.2.0+) |
| Ask a natural-language question about a CSV | describegpt --prompt "..." |
SQL-RAG sub-mode kicks in when needed |
| Customize the Markdown layout | describegpt --markdown-template my.toml |
MiniJinja TOML templates per inference kind (20.1.0+) |
| Get multilingual descriptions (Spanish, Mandarin, …) | describegpt --lang ... |
LLM-driven; quality varies by model |
| Use a local LLM (Ollama / Jan / LM Studio) | describegpt -u http://localhost:11434/v1 --model ... |
Recommended for sensitive data |
| Generate a statistically-faithful fake CSV | synthesize |
Preserves distributions + null rates; substitutes realistic fakes (20.1.0+) |
Calls any OpenAI-compatible LLM endpoint with a configurable MiniJinja-templated prompt. The prompt is fed pre-computed summary statistics and frequency distribution from qsv's stats / frequency caches — that's the "symbolic" half. The LLM generates only the natural-language labels and descriptions — that's the "neuro" half. Result: deterministic, reproducible documentation that doesn't hallucinate column names or invent ranges.
-
Dictionary mode (
--dictionary,--description,--tags,--all) — generates documentation for the whole dataset. -
Chat / RAG mode (
--prompt "...") — answer a natural-language question. If the answer needs more than stats+frequency, qsv enters SQL-RAG sub-mode: it generates a SQL query, runs it against the data (DuckDB ifQSV_DUCKDB_PATHis set, Polars SQL otherwise), and returns the deterministic answer.
qsv describegpt data.csv --api-key "$OPENAI_API_KEY" --all
# Writes data.dictionary.json (or similar) and prints a Markdown summaryqsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
-u http://localhost:11434/v1 \
--model deepseek-r1:14b \
--dictionaryqsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What is the most common complaint type?"The LLM consults the frequency table (already cached) and answers without writing any SQL.
qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What are the top 10 complaint types by community board and borough by year?"Stats alone can't answer this (it needs a group-by × borough × year), so qsv:
- Builds a small random sample as additional LLM context.
- Asks the LLM to write a SQL query that answers the question.
-
Runs the query with DuckDB (if
QSV_DUCKDB_PATHis set) or Polars SQL. - Returns the actual result, not a guess. See
docs/describegpt/nyc311-describegpt-prompt.mdfor a worked example with the resulting CSV.
The Allegheny property sales session shows three rounds of refinement against the dataset. The final query produces a most-expensive-listings CSV — deterministic, reproducible.
qsv describegpt NYC_311.csv --all --lang es > nyc311-describegpt-spanish.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-describegpt-mandarin.md(See docs/describegpt/ for actual Spanish and Mandarin outputs.)
qsv describegpt NYC_311.csv \
--tags \
--tag-vocab data-tag-vocabulary.csv > nyc311-tags.md--tag-vocab constrains the LLM to choose from a list of approved tags — useful for CKAN harmonization.
describegpt emits Markdown (default), JSON, TOON (Toon Format — a compact JSON encoding designed for LLM prompts), JSONSchema, and SemanticMd (both dictionary-centric — see below). See docs/describegpt/ for examples of each.
describegpt --format jsonschema emits the Data Dictionary as a JSON Schema (draft 2020-12) document instead of Markdown/JSON/TOON. Each column becomes a property carrying its inferred type plus the LLM-written Label and Description. If the description inference also ran, it becomes the schema's top-level description; tags, if generated, land under x-qsv.tags.
qsv- and LLM-specific metadata that JSON Schema has no native slot for — cardinality, null_count, weighted example counts, semantic Content Type, and the extra stats columns — is preserved under a single x-qsv annotation object per property. Unknown x- keywords are ignored by conforming validators (per the 2020-12 spec), so the schema still validates real data while keeping qsv's richer context for tooling that knows to look.
# emit a JSON Schema for customers.csv, enriched with LLM labels & descriptions
qsv describegpt customers.csv --dictionary --format jsonschema -o customers.schema.json
# then validate the data against it with qsv's own validator
qsv validate customers.csv customers.schema.jsonThe jsonschema format requires the dictionary phase (--dictionary or --all); the --prompt chat mode is not supported. Two flags fine-tune the output:
-
--allow-extra-cols— emitadditionalProperties: trueat the schema root (default isfalse— strict). -
--strict-dates— emitformat: date/date-timefor Date/DateTime columns. Off by default because qsv's date inference is permissive (it accepts strings likeJune 27, 1968) while JSON Schema's date formats require RFC 3339 — set it only when your columns are guaranteed RFC 3339. Mirrors the same flag on theschemacommand.
This pairs naturally with validate — see Recipe: JSON Schema Validation for the validation side.
describegpt --format semanticmd emits the Data Dictionary as a Semantic Markdown document — human-readable markdown with a few lightweight, agent-parseable conventions that a companion converter turns into structured JSON. It's the middle ground between freeform English and a formal schema: a person reads it as a normal markdown report, while an agent gets concrete structural cues.
The document is organized into sections:
-
YAML frontmatter — dataset
id/title/row_count/grain, an optionaltemporal_coverageandspatialenvelope, a sorted concept index, and--tags(when run). String scalars are quoted when needed so YAML consumers don't re-type them — a2020-12-23value stays a string, asourceURL stays text. -
# Dataset— the--descriptionresult (when run), a## Grainstatement ("one row = …"), and aResource / Schema / Titletable. -
# Schema— aColumn / Type / Role / Concept / Join? / Null / Labeltable (codes wrapped in backticks,requiredprefix for non-null columns) and a heuristically-inferred Primary key (only when exactly one column is structurally unique —cardinality == rowcount, no nulls). -
## Columnsubsections — the LLM Description plus bold-key bullets for Concept, Role, Join, and Quality; a### Validationblock (numeric ranges or text length ranges),### Choices(low-cardinality enumerations), and a rich### Statisticsblock (mean, median, quartiles, skewness, inner fences, sparsity). -
# Resource— a## Statisticstable (Column / Min / Max / Cardinality / Null Count) and per-column### Frequencytables carrying Choice, Frequency, Percentage, and Rank (sourced from qsv's ownfrequencycomputation; aggregation buckets likeOther…/(NULL)…render with a blank rank).
What makes it agent- & catalog-ready: every column carries a catalog-wide Concept ID (e.g. geo.zip_code, id.surrogate_key) for cross-dataset join discovery, an analytical Role (dimension / measure / identifier / timestamp), join keys with a 1:1 / N:1 cardinality class, and data-quality flags (PII, PII-location, sparse, placeholder-dates). Because Concept, Role, and grain need semantic context, semanticmd implies --infer-content-type.
Three optional flags populate the frontmatter (SemanticMd-only; omitted when unset, ignored by other formats): --ds-source (source / provenance), --ds-updated (last-updated date), and --ds-license (license).
# emit a Semantic Markdown data dictionary, enriched with LLM labels, description & tags
qsv describegpt customers.csv --all --format semanticmd \
--ds-source "https://data.example.gov/catalog/customers" \
--ds-updated 2026-01-15 \
--ds-license "CC-BY-4.0" \
-o customers-dictionary.mdLike jsonschema, the semanticmd format requires the dictionary phase (--dictionary or --all) and does not support the --prompt chat mode. When the other inference phases run, the --description result becomes the # Dataset description and --tags are embedded in the YAML frontmatter. Rendering goes through a user-overridable MiniJinja template (semanticmd_md_body_template, see --markdown-template below). See nyc311-describegpt-semanticmd.md for a full example.
The emitted document's front matter opens with semantic-md: datadict.yaml, a pointer telling the semantic-md converter which schema turns the markdown into structured JSON. qsv ships that schema at docs/describegpt/datadict.yaml — a semantic-md schema that maps each section of the data dictionary (Dataset, Schema + per-column subsections, Resource statistics, and Frequencies) onto a JSON object. Inline-code cells (`column`, `concept`, `resource` …) are re-extracted so their values are stored without literal backticks.
Run the conversion with the semantic-md Python package, and you get the machine-readable JSON committed alongside the markdown as nyc311-describegpt-semanticmd.json — the same content an agent or data catalog would consume, minus the LLM round-trip:
# pip install semantic-md==0.0.2 mistletoe==1.5.1 jsonpatch==1.33 pyyaml==6.0.3
from semantic_md import convert as smd
front, body = smd.md_parse_front_matter(open("nyc311-describegpt-semanticmd.md").read())
schema = smd.Schema.read(open("datadict.yaml").read())
data = smd.to_json(smd.md_parse_body(body, schema), schema) # -> the JSON objectA regenerate-and-verify check, check_semanticmd.py (modeled on qsv --generate-help-md), converts the markdown in-process, asserts a few structural invariants, and diffs the result against the committed JSON artifact — so a behavior change in datadict.yaml, the semantic-md package, or the semanticmd template fails loudly instead of drifting unnoticed. Run python3 check_semanticmd.py to verify, or --update to refresh the artifact. The toolchain is fully version-pinned for reproducibility.
Note:
datadict.yamlcarries a small two-rule workaround for an upstream semantic-md quirk where a{var|md}run reaching end-of-document under-counts consumed tokens by one (dropping qsv's trailing attribution paragraph). The extra trailer rule can be removed once that is fixed upstream.
The default prompt templates are in resources/describegpt_defaults.toml. Copy and edit to fit your organization's documentation style — describegpt --prompt-file my-prompts.toml ....
Pass --infer-content-type and describegpt asks the LLM to tag each column with a semantic Content Type from a curated 47-token vocabulary that covers people (first_name, last_name, full_name, job_title, …), addresses (street_no, street_name, street_address, city, state, state_abbr, zip_code, country, …), contact info (email, phone, username), technical identifiers (ipv4_address, ipv6_address, mac_address, uuid, domain, url), payment (credit_card, iban, currency_code), and more (industry, profession, license_plate, …). The full token list and the faker each token maps to lives in src/cmd/synthesize/faker_map.rs.
A Content Type column is appended to the emitted Data Dictionary. Those tags drive synthesize's realistic-fake generation but they're also useful on their own as auto-generated documentation.
Deterministic unique_id. Columns whose cardinality equals their row count (primary keys, surrogate keys, UUIDs, sequence numbers) are tagged unique_id directly by qsv, before the LLM ever sees them — so the label is 100 % reproducible and doesn't drift between LLM versions.
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON -o customers.dict.jsonA first-pass LLM call sees each column in isolation, which means it can mislabel "this is a 2-letter string column" without realizing the next column is a zip_code (making the first one a state_abbr). Pass --two-pass and describegpt runs a second LLM call that takes the whole first-pass Data Dictionary as JSON context, then refines each field's Label, Description, and (when --infer-content-type is set) Content Type with cross-field awareness. Roughly doubles dictionary LLM cost — opt in when accuracy matters more than throughput.
This is also the option that unlocks cross-column consistency in synthesize (a synthetic state_abbr will be a real US state abbreviation matched to the row's zip_code).
qsv describegpt customers.csv --all --two-pass --infer-content-type \
-u http://localhost:11434/v1 --model deepseek-r1:14bOverride the Markdown layout that describegpt emits with a TOML file of MiniJinja templates. Six optional template fields — dictionary_md_template, description_md_template, tags_md_template, custom_prompt_md_template, the per-field dictionary_md_body_template that fills the dictionary wrapper's {{ llm_response }}, and semanticmd_md_body_template (the whole --format semanticmd document, added in 20.2.0). Any omitted field falls back to the embedded default, so a minimal TOML can override just the one template you want to change. Custom Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) are pre-registered and documented inline in resources/describegpt_md_defaults.toml.
qsv describegpt customers.csv --all --markdown-template team-style.toml > customers-dictionary.mdThe default description and tags prompts now inline {{ dictionary }} directly into the system prompt instead of re-sending the dictionary as a separate chat message. Measurably cuts token usage on multi-phase (--all) runs.
See also: /docs/help/describegpt.md, docs/Describegpt.md, Output gallery, Stats Cache & Caching, SQL & Polars, Claude Cowork Plugin, MCP Server, Cookbook → Stats → Insights, Cookbook → Synthesize Fake Data.
New in 20.1.0. Generate a statistically-faithful synthetic CSV from a real one — same value mix, same distribution shape, same null rate — without any of the original records. Useful for sharing test data, populating staging environments, building demos, and benchmarking pipelines without leaking real customer data.
synthesize runs stats and frequency against the source under the hood, then emits N rows that reproduce per-column attributes:
- Categorical / low-cardinality columns are rebuilt by frequency-weighted sampling of the real value set — cardinality, weights, and repetition structure preserved exactly.
- Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just min/max) is preserved.
- Null ratios are matched per column.
- Unstructured text columns that carry string-length stats (min/max/avg/stddev) get values truncated to the source's character-length distribution.
Layer in a Data Dictionary from describegpt --dictionary --infer-content-type --format JSON and each column's Content Type picks a realistic fake-rs faker — email, phone, street_address, uuid, etc. — for non-enumerable columns. --locale picks from 14 fake-rs locales (en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr).
Important
Cross-column correlation is not modeled by default. Columns are generated independently. Use describegpt --two-pass to let the LLM detect related fields (e.g. city ↔ zip_code) and refine Content Types — synthesize will then keep those relationships consistent in the generated rows.
Example: pure statistical synthesis — no LLM needed
qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csvThe --seed flag makes the output fully reproducible — same seed, same file, every time.
Example: realistic fakes via a Content-Type-tagged dictionary
# 1. Build the dictionary once (local LLM keeps the data on-device)
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON \
-u http://localhost:11434/v1 --model deepseek-r1:14b -o customers.dict.json
# 2. Synthesize 10,000 rows that mirror customers.csv's shape, with realistic fakes
qsv synthesize customers.csv --dictionary customers.dict.json \
--locale en --seed 42 -n 10000 > customers-fake.csvExample: one-shot — let synthesize build the dictionary itself
qsv synthesize data.csv --infer-content-type --seed 42 -n 5000 > synthetic.csvRequires an LLM API key in QSV_LLM_APIKEY (or -u <local-endpoint> for a local LLM).
Example: stable source → fake mapping (--consistent-fakes)
qsv synthesize customers.csv --dictionary customers.dict.json \
--consistent-fakes --seed 42 -n 10000 > customers-fake.csvFor structured-faker columns with bounded cardinality, the same source value always produces the same fake — useful for deidentified synthesis where you want stable joins on the faked columns.
See also: /docs/help/synthesize.md, describegpt Content Types, Cookbook → Synthesize Fake Data, Stats Cache & Caching, src/cmd/synthesize/faker_map.rs — Content Type → faker mapping.
- Command Reference (index)
- Claude Cowork Plugin — qsv as 15 skills + 3 agents for Claude Code
- MCP Server — qsv as a Model Context Protocol server
- qsv pro Spotlight — desktop GUI companion
docs/Describegpt.md-
SQL & Polars —
describegptSQL-RAG runs throughsqlp/ DuckDB -
Stats Cache & Caching — what powers
describegpt's andsynthesize's symbolic half - Cookbook → Stats → Insights
- Cookbook → Synthesize Fake Data
- External Resources — "Have we achieved ACI?" blog series
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation