-
Notifications
You must be signed in to change notification settings - Fork 102
AI and Documentation
Tier: Intermediate
Commands covered: describegpt, synthesize
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
The flagship here is describegpt — a neuro-symbolic data dictionary generator and SQL-RAG chat assistant. "Neuro-symbolic" means the heavy lifting (column types, ranges, cardinalities) is done deterministically by qsv's stats and frequency caches; the LLM only fills in human-friendly labels and descriptions. The result: hallucination-resistant data documentation, often produced against a local LLM (Ollama / Jan / LM Studio).
Since 20.1.0, describegpt can also tag each column with a semantic Content Type from a curated 47-token vocabulary (email, phone, street_address, job_title, credit_card, ipv6_address, …). Those tags then drive the new synthesize command, which generates a statistically-faithful fake CSV that mirrors the source's distributions and null rates while substituting realistic fakes for sensitive columns.
For the deep-dive, see docs/Describegpt.md and the output gallery (Markdown, JSON, TOON, Spanish, Mandarin, and SQL-RAG examples).
Note
Looking for color or pro? color moved to Selection & Inspection → color (it's the colorized cousin of table). pro moved to Integrations → qsv pro bridge. Neither has an LLM/AI surface — they're listed here historically.
| If you want to… | Use | Notes |
|---|---|---|
| Generate a data dictionary for a CSV | describegpt --dictionary |
Outputs deterministic stats + LLM-written labels |
| Tag each column with a semantic Content Type | describegpt --dictionary --infer-content-type |
47-token vocabulary; unique_id deterministic for ALL-UNIQUE cols (20.1.0+) |
| Sharpen labels with cross-field awareness | describegpt --all --two-pass |
Second LLM pass relates fields (e.g. street_no + street_name + city + zip → mailing address) (20.1.0+) |
| Generate description + tags + dictionary | describegpt --all |
The full "FAIRify" mode |
| Emit the data dictionary as a JSON Schema | describegpt --dictionary --format jsonschema |
Draft 2020-12; qsv/LLM extras under x-qsv
|
| Ask a natural-language question about a CSV | describegpt --prompt "..." |
SQL-RAG sub-mode kicks in when needed |
| Customize the Markdown layout | describegpt --markdown-template my.toml |
MiniJinja TOML templates per inference kind (20.1.0+) |
| Get multilingual descriptions (Spanish, Mandarin, …) | describegpt --lang ... |
LLM-driven; quality varies by model |
| Use a local LLM (Ollama / Jan / LM Studio) | describegpt -u http://localhost:11434/v1 --model ... |
Recommended for sensitive data |
| Generate a statistically-faithful fake CSV | synthesize |
Preserves distributions + null rates; substitutes realistic fakes (20.1.0+) |
Calls any OpenAI-compatible LLM endpoint with a configurable MiniJinja-templated prompt. The prompt is fed pre-computed summary statistics and frequency distribution from qsv's stats / frequency caches — that's the "symbolic" half. The LLM generates only the natural-language labels and descriptions — that's the "neuro" half. Result: deterministic, reproducible documentation that doesn't hallucinate column names or invent ranges.
-
Dictionary mode (
--dictionary,--description,--tags,--all) — generates documentation for the whole dataset. -
Chat / RAG mode (
--prompt "...") — answer a natural-language question. If the answer needs more than stats+frequency, qsv enters SQL-RAG sub-mode: it generates a SQL query, runs it against the data (DuckDB ifQSV_DUCKDB_PATHis set, Polars SQL otherwise), and returns the deterministic answer.
qsv describegpt data.csv --api-key "$OPENAI_API_KEY" --all
# Writes data.dictionary.json (or similar) and prints a Markdown summaryqsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
-u http://localhost:11434/v1 \
--model deepseek-r1:14b \
--dictionaryqsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What is the most common complaint type?"The LLM consults the frequency table (already cached) and answers without writing any SQL.
qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What are the top 10 complaint types by community board and borough by year?"Stats alone can't answer this (it needs a group-by × borough × year), so qsv:
- Builds a small random sample as additional LLM context.
- Asks the LLM to write a SQL query that answers the question.
-
Runs the query with DuckDB (if
QSV_DUCKDB_PATHis set) or Polars SQL. - Returns the actual result, not a guess. See
docs/describegpt/nyc311-describegpt-prompt.mdfor a worked example with the resulting CSV.
The Allegheny property sales session shows three rounds of refinement against the dataset. The final query produces a most-expensive-listings CSV — deterministic, reproducible.
qsv describegpt NYC_311.csv --all --lang es > nyc311-describegpt-spanish.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-describegpt-mandarin.md(See docs/describegpt/ for actual Spanish and Mandarin outputs.)
qsv describegpt NYC_311.csv \
--tags \
--tag-vocab data-tag-vocabulary.csv > nyc311-tags.md--tag-vocab constrains the LLM to choose from a list of approved tags — useful for CKAN harmonization.
describegpt emits Markdown (default), JSON, TOON (Toon Format — a compact JSON encoding designed for LLM prompts), and JSONSchema (see below). See docs/describegpt/ for examples of each.
describegpt --format jsonschema emits the Data Dictionary as a JSON Schema (draft 2020-12) document instead of Markdown/JSON/TOON. Each column becomes a property carrying its inferred type plus the LLM-written Label and Description. If the description inference also ran, it becomes the schema's top-level description; tags, if generated, land under x-qsv.tags.
qsv- and LLM-specific metadata that JSON Schema has no native slot for — cardinality, null_count, weighted example counts, semantic Content Type, and the extra stats columns — is preserved under a single x-qsv annotation object per property. Unknown x- keywords are ignored by conforming validators (per the 2020-12 spec), so the schema still validates real data while keeping qsv's richer context for tooling that knows to look.
# emit a JSON Schema for customers.csv, enriched with LLM labels & descriptions
qsv describegpt customers.csv --dictionary --format jsonschema -o customers.schema.json
# then validate the data against it with qsv's own validator
qsv validate customers.csv customers.schema.jsonThe jsonschema format requires the dictionary phase (--dictionary or --all); the --prompt chat mode is not supported. Two flags fine-tune the output:
-
--allow-extra-cols— emitadditionalProperties: trueat the schema root (default isfalse— strict). -
--strict-dates— emitformat: date/date-timefor Date/DateTime columns. Off by default because qsv's date inference is permissive (it accepts strings likeJune 27, 1968) while JSON Schema's date formats require RFC 3339 — set it only when your columns are guaranteed RFC 3339. Mirrors the same flag on theschemacommand.
This pairs naturally with validate — see Recipe: JSON Schema Validation for the validation side.
The default prompt templates are in resources/describegpt_defaults.toml. Copy and edit to fit your organization's documentation style — describegpt --prompt-file my-prompts.toml ....
Pass --infer-content-type and describegpt asks the LLM to tag each column with a semantic Content Type from a curated 47-token vocabulary that covers people (first_name, last_name, full_name, job_title, …), addresses (street_no, street_name, street_address, city, state, state_abbr, zip_code, country, …), contact info (email, phone, username), technical identifiers (ipv4_address, ipv6_address, mac_address, uuid, domain, url), payment (credit_card, iban, currency_code), and more (industry, profession, license_plate, …). The full token list and the faker each token maps to lives in src/cmd/synthesize/faker_map.rs.
A Content Type column is appended to the emitted Data Dictionary. Those tags drive synthesize's realistic-fake generation but they're also useful on their own as auto-generated documentation.
Deterministic unique_id. Columns whose cardinality equals their row count (primary keys, surrogate keys, UUIDs, sequence numbers) are tagged unique_id directly by qsv, before the LLM ever sees them — so the label is 100 % reproducible and doesn't drift between LLM versions.
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON -o customers.dict.jsonA first-pass LLM call sees each column in isolation, which means it can mislabel "this is a 2-letter string column" without realizing the next column is a zip_code (making the first one a state_abbr). Pass --two-pass and describegpt runs a second LLM call that takes the whole first-pass Data Dictionary as JSON context, then refines each field's Label, Description, and (when --infer-content-type is set) Content Type with cross-field awareness. Roughly doubles dictionary LLM cost — opt in when accuracy matters more than throughput.
This is also the option that unlocks cross-column consistency in synthesize (a synthetic state_abbr will be a real US state abbreviation matched to the row's zip_code).
qsv describegpt customers.csv --all --two-pass --infer-content-type \
-u http://localhost:11434/v1 --model deepseek-r1:14bOverride the Markdown layout that describegpt emits with a TOML file of MiniJinja templates. Five optional template fields — dictionary_md_template, description_md_template, tags_md_template, custom_prompt_md_template, and the per-field dictionary_md_body_template that fills the dictionary wrapper's {{ llm_response }}. Any omitted field falls back to the embedded default, so a minimal TOML can override just the one template you want to change. Custom Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) are pre-registered and documented inline in resources/describegpt_md_defaults.toml.
qsv describegpt customers.csv --all --markdown-template team-style.toml > customers-dictionary.mdThe default description and tags prompts now inline {{ dictionary }} directly into the system prompt instead of re-sending the dictionary as a separate chat message. Measurably cuts token usage on multi-phase (--all) runs.
See also: /docs/help/describegpt.md, docs/Describegpt.md, Output gallery, Stats Cache & Caching, SQL & Polars, Claude Cowork Plugin, MCP Server, Cookbook → Stats → Insights, Cookbook → Synthesize Fake Data.
New in 20.1.0. Generate a statistically-faithful synthetic CSV from a real one — same value mix, same distribution shape, same null rate — without any of the original records. Useful for sharing test data, populating staging environments, building demos, and benchmarking pipelines without leaking real customer data.
synthesize runs stats and frequency against the source under the hood, then emits N rows that reproduce per-column attributes:
- Categorical / low-cardinality columns are rebuilt by frequency-weighted sampling of the real value set — cardinality, weights, and repetition structure preserved exactly.
- Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just min/max) is preserved.
- Null ratios are matched per column.
- Unstructured text columns that carry string-length stats (min/max/avg/stddev) get values truncated to the source's character-length distribution.
Layer in a Data Dictionary from describegpt --dictionary --infer-content-type --format JSON and each column's Content Type picks a realistic fake-rs faker — email, phone, street_address, uuid, etc. — for non-enumerable columns. --locale picks from 14 fake-rs locales (en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr).
Important
Cross-column correlation is not modeled by default. Columns are generated independently. Use describegpt --two-pass to let the LLM detect related fields (e.g. city ↔ zip_code) and refine Content Types — synthesize will then keep those relationships consistent in the generated rows.
Example: pure statistical synthesis — no LLM needed
qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csvThe --seed flag makes the output fully reproducible — same seed, same file, every time.
Example: realistic fakes via a Content-Type-tagged dictionary
# 1. Build the dictionary once (local LLM keeps the data on-device)
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON \
-u http://localhost:11434/v1 --model deepseek-r1:14b -o customers.dict.json
# 2. Synthesize 10,000 rows that mirror customers.csv's shape, with realistic fakes
qsv synthesize customers.csv --dictionary customers.dict.json \
--locale en --seed 42 -n 10000 > customers-fake.csvExample: one-shot — let synthesize build the dictionary itself
qsv synthesize data.csv --infer-content-type --seed 42 -n 5000 > synthetic.csvRequires an LLM API key in QSV_LLM_APIKEY (or -u <local-endpoint> for a local LLM).
Example: stable source → fake mapping (--consistent-fakes)
qsv synthesize customers.csv --dictionary customers.dict.json \
--consistent-fakes --seed 42 -n 10000 > customers-fake.csvFor structured-faker columns with bounded cardinality, the same source value always produces the same fake — useful for deidentified synthesis where you want stable joins on the faked columns.
See also: /docs/help/synthesize.md, describegpt Content Types, Cookbook → Synthesize Fake Data, Stats Cache & Caching, src/cmd/synthesize/faker_map.rs — Content Type → faker mapping.
- Command Reference (index)
- Claude Cowork Plugin — qsv as 15 skills + 3 agents for Claude Code
- MCP Server — qsv as a Model Context Protocol server
- qsv pro Spotlight — desktop GUI companion
docs/Describegpt.md-
SQL & Polars —
describegptSQL-RAG runs throughsqlp/ DuckDB -
Stats Cache & Caching — what powers
describegpt's andsynthesize's symbolic half - Cookbook → Stats → Insights
- Cookbook → Synthesize Fake Data
- External Resources — "Have we achieved ACI?" blog series
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- Visualization (viz)
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data