Skip to content

AI and Documentation

Joel Natividad edited this page May 18, 2026 · 12 revisions

AI & Documentation

Tier: Intermediate Commands covered: describegpt, color, pro, synthesize

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

The flagship here is describegpt — a neuro-symbolic data dictionary generator and SQL-RAG chat assistant. "Neuro-symbolic" means the heavy lifting (column types, ranges, cardinalities) is done deterministically by qsv's stats and frequency caches; the LLM only fills in human-friendly labels and descriptions. The result: hallucination-resistant data documentation, often produced against a local LLM (Ollama / Jan / LM Studio).

Since 20.1.0, describegpt can also tag each column with a semantic Content Type from a curated 47-token vocabulary (email, phone, street_address, job_title, credit_card, ipv6_address, …). Those tags then drive the new synthesize command, which generates a statistically-faithful fake CSV that mirrors the source's distributions and null rates while substituting realistic fakes for sensitive columns.

For the deep-dive, see docs/Describegpt.md and the output gallery (Markdown, JSON, TOON, Spanish, Mandarin, and SQL-RAG examples).

color is the colorized-table cousin of table. pro bridges to the qsv pro desktop app.

Quick decision table

If you want to… Use Notes
Generate a data dictionary for a CSV describegpt --dictionary Outputs deterministic stats + LLM-written labels
Tag each column with a semantic Content Type describegpt --dictionary --infer-content-type 47-token vocabulary; unique_id deterministic for ALL-UNIQUE cols (20.1.0+)
Sharpen labels with cross-field awareness describegpt --all --two-pass Second LLM pass relates fields (e.g. street_no + street_name + city + zip → mailing address) (20.1.0+)
Generate description + tags + dictionary describegpt --all The full "FAIRify" mode
Ask a natural-language question about a CSV describegpt --prompt "..." SQL-RAG sub-mode kicks in when needed
Customize the Markdown layout describegpt --markdown-template my.toml MiniJinja TOML templates per inference kind (20.1.0+)
Get multilingual descriptions (Spanish, Mandarin, …) describegpt --lang ... LLM-driven; quality varies by model
Use a local LLM (Ollama / Jan / LM Studio) describegpt -u http://localhost:11434/v1 --model ... Recommended for sensitive data
Generate a statistically-faithful fake CSV synthesize Preserves distributions + null rates; substitutes realistic fakes (20.1.0+)
Pretty colorized table for the terminal color Auto-detects light/dark theme, fits to terminal width
Open a file in csvlens via qsv pro qsv pro lens Requires qsv pro running
Import a file into qsv pro's Workflow qsv pro workflow Requires qsv pro running

describegpt

Calls any OpenAI-compatible LLM endpoint with a configurable MiniJinja-templated prompt. The prompt is fed pre-computed summary statistics and frequency distribution from qsv's stats / frequency caches — that's the "symbolic" half. The LLM generates only the natural-language labels and descriptions — that's the "neuro" half. Result: deterministic, reproducible documentation that doesn't hallucinate column names or invent ranges.

Two modes

  • Dictionary mode (--dictionary, --description, --tags, --all) — generates documentation for the whole dataset.
  • Chat / RAG mode (--prompt "...") — answer a natural-language question. If the answer needs more than stats+frequency, qsv enters SQL-RAG sub-mode: it generates a SQL query, runs it against the data (DuckDB if QSV_DUCKDB_PATH is set, Polars SQL otherwise), and returns the deterministic answer.

Example: full data dictionary against OpenAI's default model

qsv describegpt data.csv --api-key "$OPENAI_API_KEY" --all
# Writes data.dictionary.json (or similar) and prints a Markdown summary

Example: against a local Ollama instance (no API key, no data leaves your machine)

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  --dictionary

Example: chat — ask a question the stats alone can answer

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What is the most common complaint type?"

The LLM consults the frequency table (already cached) and answers without writing any SQL.

Example: chat with SQL-RAG sub-mode

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What are the top 10 complaint types by community board and borough by year?"

Stats alone can't answer this (it needs a group-by × borough × year), so qsv:

  1. Builds a small random sample as additional LLM context.
  2. Asks the LLM to write a SQL query that answers the question.
  3. Runs the query with DuckDB (if QSV_DUCKDB_PATH is set) or Polars SQL.
  4. Returns the actual result, not a guess. See docs/describegpt/nyc311-describegpt-prompt.md for a worked example with the resulting CSV.

Example: iterative SQL-RAG session refinement

The Allegheny property sales session shows three rounds of refinement against the dataset. The final query produces a most-expensive-listings CSV — deterministic, reproducible.

Example: multilingual output

qsv describegpt NYC_311.csv --all --lang es > nyc311-describegpt-spanish.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-describegpt-mandarin.md

(See docs/describegpt/ for actual Spanish and Mandarin outputs.)

Example: controlled tag vocabulary (avoid LLM tag drift)

qsv describegpt NYC_311.csv \
  --tags \
  --tag-vocab data-tag-vocabulary.csv > nyc311-tags.md

--tag-vocab constrains the LLM to choose from a list of approved tags — useful for CKAN harmonization.

Output formats

describegpt emits Markdown (default), JSON, and TOON (Toon Format — a compact JSON encoding designed for LLM prompts). See docs/describegpt/ for examples of each.

Configurable prompts

The default prompt templates are in resources/describegpt_defaults.toml. Copy and edit to fit your organization's documentation style — describegpt --prompt-file my-prompts.toml ....

Content Types — semantic column labels (new in 20.1.0)

Pass --infer-content-type and describegpt asks the LLM to tag each column with a semantic Content Type from a curated 47-token vocabulary that covers people (first_name, last_name, full_name, job_title, …), addresses (street_no, street_name, street_address, city, state, state_abbr, zip_code, country, …), contact info (email, phone, username), technical identifiers (ipv4_address, ipv6_address, mac_address, uuid, domain, url), payment (credit_card, iban, currency_code), and more (industry, profession, license_plate, …). The full token list and the faker each token maps to lives in src/cmd/synthesize/faker_map.rs.

A Content Type column is appended to the emitted Data Dictionary. Those tags drive synthesize's realistic-fake generation but they're also useful on their own as auto-generated documentation.

Deterministic unique_id. Columns whose cardinality equals their row count (primary keys, surrogate keys, UUIDs, sequence numbers) are tagged unique_id directly by qsv, before the LLM ever sees them — so the label is 100 % reproducible and doesn't drift between LLM versions.

qsv describegpt customers.csv --dictionary --infer-content-type --format JSON -o customers.dict.json

--two-pass cross-field refinement (new in 20.1.0)

A first-pass LLM call sees each column in isolation, which means it can mislabel "this is a 2-letter string column" without realizing the next column is a zip_code (making the first one a state_abbr). Pass --two-pass and describegpt runs a second LLM call that takes the whole first-pass Data Dictionary as JSON context, then refines each field's Label, Description, and (when --infer-content-type is set) Content Type with cross-field awareness. Roughly doubles dictionary LLM cost — opt in when accuracy matters more than throughput.

This is also the option that unlocks cross-column consistency in synthesize (a synthetic state_abbr will be a real US state abbreviation matched to the row's zip_code).

qsv describegpt customers.csv --all --two-pass --infer-content-type \
  -u http://localhost:11434/v1 --model deepseek-r1:14b

--markdown-template for customizable Markdown output (new in 20.1.0)

Override the Markdown layout that describegpt emits with a TOML file of MiniJinja templates. Five optional template fields — dictionary_md_template, description_md_template, tags_md_template, custom_prompt_md_template, and the per-field dictionary_md_body_template that fills the dictionary wrapper's {{ llm_response }}. Any omitted field falls back to the embedded default, so a minimal TOML can override just the one template you want to change. Custom Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) are pre-registered and documented inline in resources/describegpt_md_defaults.toml.

qsv describegpt customers.csv --all --markdown-template team-style.toml > customers-dictionary.md

Lower LLM cost in 20.1.0

The default description and tags prompts now inline {{ dictionary }} directly into the system prompt instead of re-sending the dictionary as a separate chat message. Measurably cuts token usage on multi-phase (--all) runs.

See also: /docs/help/describegpt.md, docs/Describegpt.md, Output gallery, Stats Cache & Caching, SQL & Polars, Claude Cowork Plugin, MCP Server, Cookbook → Stats → Insights, Cookbook → Synthesize Fake Data.

color

table with colors. Same elastic-tab alignment, but with color-coded data types (string vs number vs date), terminal-fit truncation, and theme auto-detection (light vs dark). Loads the entire CSV into memory — pair with slice or sample for large files.

The polars feature lets color also display Arrow, Avro, Parquet, JSON arrays, and JSONL.

Example: colorized top-10 cities by population

qsv search --select Country '^us$' wcp.csv \
  | qsv sort --select Population --numeric --reverse \
  | qsv slice --len 10 \
  | qsv color

Example: force colors when piping (or running in CI)

QSV_FORCE_COLOR=1 qsv stats wcp.csv | qsv color | less -R

Example: override terminal theme detection

QSV_THEME=DARK qsv color wcp.csv

Example: browse a Parquet file (polars feature)

qsv to parquet outdir/ wcp.csv
qsv color outdir/wcp.parquet

See also: /docs/help/color.md, table — uncolored alternative, lens — interactive viewer, Environment VariablesQSV_FORCE_COLOR, QSV_THEME, QSV_TERMWIDTH.

pro

Bridges to the qsv pro API. qsv pro must be running on the same machine. Two subcommands:

  • lens — opens a CSV in csvlens inside an Alacritty window (Windows only).
  • workflow — imports a file into qsv pro's Workflow panel.

Example: send a CSV to qsv pro's Workflow

qsv pro workflow data.csv

Example: open a CSV in csvlens via qsv pro (Windows)

qsv pro lens data.csv

The Workflow subcommand accepts CSV, TSV, SSV, TAB, XLSX, XLS, XLSB, XLSM, ODS — auto-conversion happens inside qsv pro.

For everything qsv pro offers beyond this command, see qsv pro Spotlight and qsvpro.dathere.com.

See also: /docs/help/pro.md, qsv pro Spotlight, lens — qsv's built-in interactive viewer, Integrations.

synthesize

New in 20.1.0. Generate a statistically-faithful synthetic CSV from a real one — same value mix, same distribution shape, same null rate — without any of the original records. Useful for sharing test data, populating staging environments, building demos, and benchmarking pipelines without leaking real customer data.

synthesize runs stats and frequency against the source under the hood, then emits N rows that reproduce per-column attributes:

  • Categorical / low-cardinality columns are rebuilt by frequency-weighted sampling of the real value set — cardinality, weights, and repetition structure preserved exactly.
  • Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just min/max) is preserved.
  • Null ratios are matched per column.
  • Unstructured text columns that carry string-length stats (min/max/avg/stddev) get values truncated to the source's character-length distribution.

Layer in a Data Dictionary from describegpt --dictionary --infer-content-type --format JSON and each column's Content Type picks a realistic fake-rs faker — email, phone, street_address, uuid, etc. — for non-enumerable columns. --locale picks from 14 fake-rs locales (en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr).

Cross-column correlation is not modeled by default. Columns are generated independently. Use describegpt --two-pass to let the LLM detect related fields (e.g. cityzip_code) and refine Content Types — synthesize will then keep those relationships consistent in the generated rows.

Example: pure statistical synthesis — no LLM needed

qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

The --seed flag makes the output fully reproducible — same seed, same file, every time.

Example: realistic fakes via a Content-Type-tagged dictionary

# 1. Build the dictionary once (local LLM keeps the data on-device)
qsv describegpt customers.csv --dictionary --infer-content-type --format JSON \
  -u http://localhost:11434/v1 --model deepseek-r1:14b -o customers.dict.json

# 2. Synthesize 10,000 rows that mirror customers.csv's shape, with realistic fakes
qsv synthesize customers.csv --dictionary customers.dict.json \
  --locale en --seed 42 -n 10000 > customers-fake.csv

Example: one-shot — let synthesize build the dictionary itself

qsv synthesize data.csv --infer-content-type --seed 42 -n 5000 > synthetic.csv

Requires an LLM API key in QSV_LLM_APIKEY (or -u <local-endpoint> for a local LLM).

Example: stable source → fake mapping (--consistent-fakes)

qsv synthesize customers.csv --dictionary customers.dict.json \
  --consistent-fakes --seed 42 -n 10000 > customers-fake.csv

For structured-faker columns with bounded cardinality, the same source value always produces the same fake — useful for deidentified synthesis where you want stable joins on the faked columns.

See also: /docs/help/synthesize.md, describegpt Content Types, Cookbook → Synthesize Fake Data, Stats Cache & Caching, src/cmd/synthesize/faker_map.rs — Content Type → faker mapping.

See also

Clone this wiki locally