-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Stats to Insights
Tier: Intermediate
Commands used: stats, moarstats, frequency, pragmastat, describegpt, sqlp
Anchor dataset: NYC 311 (1M-row sample), with allegheny property sales as a secondary anchor
You have a CSV. You want to go from raw numbers to a narrative — a data dictionary, a description, top insights, anomaly callouts, and answers to natural-language questions — without writing a single line of analytics code.
qsv has a four-step pipeline for this: stats → moarstats → pragmastat → describegpt. Each step produces deterministic outputs that the next step reuses, with the LLM only writing the human-readable narrative on top.
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
# Or:
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv
# Plus an Ollama / LM Studio / Jan instance running locally with a model loaded
# Example: ollama pull gpt-oss-20bqsv stats --everything --infer-dates --infer-boolean --stats-jsonl \
NYC_311_SR_2010-2020-sample-1M.csv > base.stats.csv
ls NYC_311_SR_2010-2020-sample-1M.csv.*
# .stats.csv .stats.csv.data.jsonlTwo sidecar files are written: a human-readable .stats.csv and a machine-readable .data.jsonl. Every "smart" command after this (frequency, schema, pragmastat, pivotp, sqlp scoresql, describegpt) picks up the JSONL automatically.
qsv moarstats NYC_311_SR_2010-2020-sample-1M.csv
# Updates NYC_311_SR_2010-2020-sample-1M.stats.csv with:
# - Pearson's 2nd skewness
# - quartile coefficient of dispersion
# - z-scores of min/max/mode
# - W3C XSD datatype mapping
# - ...These extras matter when your data has outliers or asymmetric distributions — exactly the case for NYC 311 (TAT has a heavy right tail).
qsv pragmastat \
--select 'Unique Key' \
NYC_311_SR_2010-2020-sample-1M.csvPragmastat appends seven ps_* columns to the stats cache: Hodges-Lehmann center, Shamos spread, plus 95% confidence bounds for both. Hodges-Lehmann tolerates up to 29% data corruption — much more reliable than mean for skewed distributions.
For the highly-skewed Allegheny property sales:
qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csv
# Result columns include ps_center (robust median-of-pairwise-averages),
# ps_spread (robust dispersion), and confidence bounds.Pragmastat will reveal that the robust center of Sale Price differs from the mean — diagnostic for skewness.
qsv frequency \
--select 'Borough,Complaint Type,Agency,Status' \
--limit 10 \
--json \
NYC_311_SR_2010-2020-sample-1M.csv > nyc311_freqs.jsonJSON output is LLM-friendly. The frequency cache (.freq.csv.data.jsonl) is also written when --frequency-jsonl is set, allowing later commands to reuse it.
qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--all \
-u http://localhost:11434/v1 \
--model gpt-oss-20b \
> nyc311-describegpt.md--all produces description + tags + dictionary. The LLM only writes the human-friendly labels and descriptions; the deterministic statistical context comes from your stats and frequency caches.
For examples, see docs/describegpt/ — there are pre-generated outputs for NYC 311 in Markdown, JSON, TOON, Spanish, and Mandarin.
qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What is the most common complaint type in Brooklyn?"If the question can be answered from stats + frequency alone, the LLM does so directly. If it needs more (group-by × time, percentiles, joins), qsv enters SQL-RAG sub-mode:
qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
--prompt "What are the top 10 complaint types by community board and borough by year?"qsv:
- Adds a small random data sample as LLM context.
- Asks the LLM to write SQL.
- Runs the SQL against the data using DuckDB (if
QSV_DUCKDB_PATHis set) or Polars SQL. - Returns the deterministic answer.
See docs/describegpt/nyc311-describegpt-prompt.md for a worked example with the resulting CSV.
Sessions let you refine. See docs/describegpt/allegheny_discussion3.md — three rounds of "what about ... " arrive at a final query that produces the most-expensive listings CSV.
qsv describegpt NYC_311.csv --all --lang es > nyc311-es.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-zh.mdqsv describegpt NYC_311.csv --tags \
--tag-vocab nyc-open-data-tag-vocabulary.csv > nyc311-tags.mdConstrains the LLM to choose from a curated list — no tag drift.
qsv describegpt NYC_311.csv --all --format markdown # default
qsv describegpt NYC_311.csv --all --format json
qsv describegpt NYC_311.csv --all --format toon # compact JSON for LLM contextqsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
"SELECT \"Complaint Type\", COUNT(*) AS n
FROM nyc311
WHERE Borough = 'BROOKLYN'
GROUP BY \"Complaint Type\"
ORDER BY n DESC LIMIT 10"qsv scoresql NYC_311_SR_2010-2020-sample-1M.csv \
"SELECT * FROM nyc311"
# Anti-pattern warning: SELECT * without LIMIT- The full pipeline (steps 1–5) on the 1M-row NYC 311 sample completes in ~10–15 seconds on an M2 Pro, with the bulk of that being LLM latency in step 5.
-
statsitself is ~0.7 s for stats(-E) on 1M rows. Pre-populating the cache (--stats-jsonl) is essentially free vs. running without it because downstream commands save more than the extra write cost. - Local LLMs (Ollama/Jan/LM Studio) are slower than cloud LLMs but keep sensitive data on-premise.
-
describegptoutputs to JSON / TOON instead of Markdown when piped to other LLM tooling — TOON is a compact JSON encoding designed for token efficiency.
- Aggregation & Statistics
- AI & Documentation — describegpt deep-dive
-
SQL & Polars —
sqlpandscoresql - Stats Cache & Caching
docs/Describegpt.md-
docs/describegpt/— pre-generated outputs in many formats - Claude Cowork Plugin — alternative path: the same workflow in Claude Code
- Recipe: Build a Data Pipeline — automates this into a markdown-report pipeline
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation