Recipe Stats to Insights

Recipe: Stats → Insights

Tier: Intermediate Commands used: stats, moarstats, frequency, pragmastat, describegpt, sqlp Anchor dataset: NYC 311 (1M-row sample), with allegheny property sales as a secondary anchor

Problem

You have a CSV. You want to go from raw numbers to a narrative — a data dictionary, a description, top insights, anomaly callouts, and answers to natural-language questions — without writing a single line of analytics code.

qsv has a four-step pipeline for this: stats → moarstats → pragmastat → describegpt. Each step produces deterministic outputs that the next step reuses, with the LLM only writing the human-readable narrative on top.

Data

ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
# Or:
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv

# Plus an Ollama / LM Studio / Jan instance running locally with a model loaded
# Example: ollama pull gpt-oss-20b

Solution

1. Base stats — 48 metrics, 0.7 seconds

qsv stats --everything --infer-dates --infer-boolean --stats-jsonl \
  NYC_311_SR_2010-2020-sample-1M.csv > base.stats.csv

ls NYC_311_SR_2010-2020-sample-1M.csv.*
# .stats.csv  .stats.csv.data.jsonl

Two sidecar files are written: a human-readable .stats.csv and a machine-readable .data.jsonl. Every "smart" command after this (frequency, schema, pragmastat, pivotp, sqlp scoresql, describegpt) picks up the JSONL automatically.

2. Extended stats — 25+ more measures via moarstats

qsv moarstats NYC_311_SR_2010-2020-sample-1M.csv
# Updates NYC_311_SR_2010-2020-sample-1M.stats.csv with:
#  - Pearson's 2nd skewness
#  - quartile coefficient of dispersion
#  - z-scores of min/max/mode
#  - W3C XSD datatype mapping
#  - ...

These extras matter when your data has outliers or asymmetric distributions — exactly the case for NYC 311 (TAT has a heavy right tail).

3. Robust stats with pragmastat (heavy-tailed-data-aware)

qsv pragmastat \
  --select 'Unique Key' \
  NYC_311_SR_2010-2020-sample-1M.csv

Pragmastat appends seven ps_* columns to the stats cache: Hodges-Lehmann center, Shamos spread, plus 95% confidence bounds for both. Hodges-Lehmann tolerates up to 29% data corruption — much more reliable than mean for skewed distributions.

For the highly-skewed Allegheny property sales:

qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csv
# Result columns include ps_center (robust median-of-pairwise-averages),
# ps_spread (robust dispersion), and confidence bounds.

Pragmastat will reveal that the robust center of Sale Price differs from the mean — diagnostic for skewness.

4. Frequency for categorical columns

qsv frequency \
  --select 'Borough,Complaint Type,Agency,Status' \
  --limit 10 \
  --json \
  NYC_311_SR_2010-2020-sample-1M.csv > nyc311_freqs.json

JSON output is LLM-friendly. The frequency cache (.freq.csv.data.jsonl) is also written when --frequency-jsonl is set, allowing later commands to reuse it.

5. Generate a data dictionary with describegpt

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --all \
  -u http://localhost:11434/v1 \
  --model gpt-oss-20b \
  > nyc311-describegpt.md

--all produces description + tags + dictionary. The LLM only writes the human-friendly labels and descriptions; the deterministic statistical context comes from your stats and frequency caches.

For examples, see docs/describegpt/ — there are pre-generated outputs for NYC 311 in Markdown, JSON, TOON, Spanish, and Mandarin.

6. Chat — ask a natural-language question

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What is the most common complaint type in Brooklyn?"

If the question can be answered from stats + frequency alone, the LLM does so directly. If it needs more (group-by × time, percentiles, joins), qsv enters SQL-RAG sub-mode:

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What are the top 10 complaint types by community board and borough by year?"

qsv:

Adds a small random data sample as LLM context.
Asks the LLM to write SQL.
Runs the SQL against the data using DuckDB (if QSV_DUCKDB_PATH is set) or Polars SQL.
Returns the deterministic answer.

See docs/describegpt/nyc311-describegpt-prompt.md for a worked example with the resulting CSV.

7. Iterative refinement — SQL-RAG session

Sessions let you refine. See docs/describegpt/allegheny_discussion3.md — three rounds of "what about ... " arrive at a final query that produces the most-expensive listings CSV.

Variations

Multilingual data dictionaries

qsv describegpt NYC_311.csv --all --lang es > nyc311-es.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-zh.md

Controlled tag vocabulary (CKAN-compatible)

qsv describegpt NYC_311.csv --tags \
  --tag-vocab nyc-open-data-tag-vocabulary.csv > nyc311-tags.md

Constrains the LLM to choose from a curated list — no tag drift.

Output format

qsv describegpt NYC_311.csv --all --format markdown   # default
qsv describegpt NYC_311.csv --all --format json
qsv describegpt NYC_311.csv --all --format toon       # compact JSON for LLM context

Direct SQL query without LLM (when you already know what you want)

qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT \"Complaint Type\", COUNT(*) AS n
   FROM nyc311
   WHERE Borough = 'BROOKLYN'
   GROUP BY \"Complaint Type\"
   ORDER BY n DESC LIMIT 10"

Score the SQL query for performance before running

qsv scoresql NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT * FROM nyc311"
# Anti-pattern warning: SELECT * without LIMIT

Performance notes

The full pipeline (steps 1–5) on the 1M-row NYC 311 sample completes in ~10–15 seconds on an M2 Pro, with the bulk of that being LLM latency in step 5.
stats itself is ~0.7 s for stats(-E) on 1M rows. Pre-populating the cache (--stats-jsonl) is essentially free vs. running without it because downstream commands save more than the extra write cost.
Local LLMs (Ollama/Jan/LM Studio) are slower than cloud LLMs but keep sensitive data on-premise.
describegpt outputs to JSON / TOON instead of Markdown when piped to other LLM tooling — TOON is a compact JSON encoding designed for token efficiency.

Recipe Stats to Insights

Recipe: Stats → Insights

Problem

Data

Solution

1. Base stats — 48 metrics, 0.7 seconds

2. Extended stats — 25+ more measures via moarstats

3. Robust stats with pragmastat (heavy-tailed-data-aware)

4. Frequency for categorical columns

5. Generate a data dictionary with describegpt

6. Chat — ask a natural-language question

7. Iterative refinement — SQL-RAG session

Variations

Multilingual data dictionaries

Controlled tag vocabulary (CKAN-compatible)

Output format

Direct SQL query without LLM (when you already know what you want)

Score the SQL query for performance before running

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference