Skip to content

Recipe Stats to Insights

Joel Natividad edited this page Jun 3, 2026 · 3 revisions

Recipe: Stats → Insights

Tier: Intermediate Commands used: stats, moarstats, frequency, pragmastat, describegpt, sqlp Anchor dataset: NYC 311 (1M-row sample), with allegheny property sales as a secondary anchor

Problem

You have a CSV. You want to go from raw numbers to a narrative — a data dictionary, a description, top insights, anomaly callouts, and answers to natural-language questions — without writing a single line of analytics code.

qsv has a four-step pipeline for this: statsmoarstatspragmastatdescribegpt. Each step produces deterministic outputs that the next step reuses, with the LLM only writing the human-readable narrative on top.

Data

ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
# Or:
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv

# Plus an Ollama / LM Studio / Jan instance running locally with a model loaded
# Example: ollama pull gpt-oss-20b

Solution

1. Base stats — 48 metrics, 0.7 seconds

qsv stats --everything --infer-dates --infer-boolean --stats-jsonl \
  NYC_311_SR_2010-2020-sample-1M.csv > base.stats.csv

ls NYC_311_SR_2010-2020-sample-1M.csv.*
# .stats.csv  .stats.csv.data.jsonl

Two sidecar files are written: a human-readable .stats.csv and a machine-readable .data.jsonl. Every "smart" command after this (frequency, schema, pragmastat, pivotp, sqlp scoresql, describegpt) picks up the JSONL automatically.

2. Extended stats — 25+ more measures via moarstats

qsv moarstats NYC_311_SR_2010-2020-sample-1M.csv
# Updates NYC_311_SR_2010-2020-sample-1M.stats.csv with:
#  - Pearson's 2nd skewness
#  - quartile coefficient of dispersion
#  - z-scores of min/max/mode
#  - W3C XSD datatype mapping
#  - ...

These extras matter when your data has outliers or asymmetric distributions — exactly the case for NYC 311 (TAT has a heavy right tail).

3. Robust stats with pragmastat (heavy-tailed-data-aware)

qsv pragmastat \
  --select 'Unique Key' \
  NYC_311_SR_2010-2020-sample-1M.csv

Pragmastat appends seven ps_* columns to the stats cache: Hodges-Lehmann center, Shamos spread, plus 95% confidence bounds for both. Hodges-Lehmann tolerates up to 29% data corruption — much more reliable than mean for skewed distributions.

For the highly-skewed Allegheny property sales:

qsv stats -E --infer-dates --stats-jsonl allegheny_property_sales.csv
qsv pragmastat --select 'Sale Price' allegheny_property_sales.csv
# Result columns include ps_center (robust median-of-pairwise-averages),
# ps_spread (robust dispersion), and confidence bounds.

Pragmastat will reveal that the robust center of Sale Price differs from the mean — diagnostic for skewness.

4. Frequency for categorical columns

qsv frequency \
  --select 'Borough,Complaint Type,Agency,Status' \
  --limit 10 \
  --json \
  NYC_311_SR_2010-2020-sample-1M.csv > nyc311_freqs.json

JSON output is LLM-friendly. The frequency cache (.freq.csv.data.jsonl) is also written when --frequency-jsonl is set, allowing later commands to reuse it.

5. Generate a data dictionary with describegpt

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --all \
  -u http://localhost:11434/v1 \
  --model gpt-oss-20b \
  > nyc311-describegpt.md

--all produces description + tags + dictionary. The LLM only writes the human-friendly labels and descriptions; the deterministic statistical context comes from your stats and frequency caches.

For examples, see docs/describegpt/ — there are pre-generated outputs for NYC 311 in Markdown, JSON, TOON, Spanish, and Mandarin.

6. Chat — ask a natural-language question

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What is the most common complaint type in Brooklyn?"

If the question can be answered from stats + frequency alone, the LLM does so directly. If it needs more (group-by × time, percentiles, joins), qsv enters SQL-RAG sub-mode:

qsv describegpt NYC_311_SR_2010-2020-sample-1M.csv \
  --prompt "What are the top 10 complaint types by community board and borough by year?"

qsv:

  1. Adds a small random data sample as LLM context.
  2. Asks the LLM to write SQL.
  3. Runs the SQL against the data using DuckDB (if QSV_DUCKDB_PATH is set) or Polars SQL.
  4. Returns the deterministic answer.

See docs/describegpt/nyc311-describegpt-prompt.md for a worked example with the resulting CSV.

7. Iterative refinement — SQL-RAG session

Sessions let you refine. See docs/describegpt/allegheny_discussion3.md — three rounds of "what about ... " arrive at a final query that produces the most-expensive listings CSV.

Variations

Multilingual data dictionaries

qsv describegpt NYC_311.csv --all --lang es > nyc311-es.md
qsv describegpt NYC_311.csv --all --lang zh > nyc311-zh.md

Controlled tag vocabulary (CKAN-compatible)

qsv describegpt NYC_311.csv --tags \
  --tag-vocab nyc-open-data-tag-vocabulary.csv > nyc311-tags.md

Constrains the LLM to choose from a curated list — no tag drift.

Output format

qsv describegpt NYC_311.csv --all --format markdown   # default
qsv describegpt NYC_311.csv --all --format json
qsv describegpt NYC_311.csv --all --format toon       # compact JSON for LLM context

Direct SQL query without LLM (when you already know what you want)

qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT \"Complaint Type\", COUNT(*) AS n
   FROM nyc311
   WHERE Borough = 'BROOKLYN'
   GROUP BY \"Complaint Type\"
   ORDER BY n DESC LIMIT 10"

Score the SQL query for performance before running

qsv scoresql NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT * FROM nyc311"
# Anti-pattern warning: SELECT * without LIMIT

Performance notes

  • The full pipeline (steps 1–5) on the 1M-row NYC 311 sample completes in ~10–15 seconds on an M2 Pro, with the bulk of that being LLM latency in step 5.
  • stats itself is ~0.7 s for stats(-E) on 1M rows. Pre-populating the cache (--stats-jsonl) is essentially free vs. running without it because downstream commands save more than the extra write cost.
  • Local LLMs (Ollama/Jan/LM Studio) are slower than cloud LLMs but keep sensitive data on-premise.
  • describegpt outputs to JSON / TOON instead of Markdown when piped to other LLM tooling — TOON is a compact JSON encoding designed for token efficiency.

See also

Clone this wiki locally