-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Build a Data Pipeline
Tier: Advanced
Commands used: input, safenames, stats, schema, validate, sqlp, pivotp, template, to
Anchor dataset: Allegheny County property sales (resources/test/allegheny_property_sales*.csv)
You want to take a raw CSV all the way from "what is this thing?" to "publishable analytics report" in a single repeatable pipeline. The output should be:
- clean (normalized headers, validated against a schema)
- fast (uses the stats cache + Polars schema for speed)
- analytical (SQL queries, pivoted summaries)
- human-readable (a Markdown report ready to ship)
- machine-consumable (Parquet output for downstream systems)
This recipe is a template — copy it, swap the dataset, and you have a production pipeline.
ls resources/test/allegheny_property_sales*.csv
# allegheny_property_sales.csv
# allegheny_property_sales.csv.idx (pre-built index)
# allegheny_property_sales.stats.csv (pre-built stats)If you're not in the qsv repo:
curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/allegheny_property_sales.csv# 1.1: Normalize encoding, trim whitespace, skip preamble
qsv input --auto-skip --trim-headers --trim-fields \
allegheny_property_sales.csv > stage1_input.csv
# 1.2: Make column names DB-safe
qsv safenames stage1_input.csv > stage1_clean.csv
# 1.3: Index it (one-time cost, payoff downstream)
qsv index stage1_clean.csv# 2.1: Full stats with cardinality + date inference; write the sidecar JSONL
qsv stats --everything --infer-dates --infer-boolean --stats-jsonl \
stage1_clean.csv > stage2_stats.csv
# 2.2: Generate a Polars schema for fast sqlp/joinp/pivotp
qsv schema --polars stage1_clean.csv
ls stage1_clean.*
# .csv .idx .stats.csv .stats.csv.data.jsonl .pschema.json# 3.1: Generate a JSON Schema from representative data
qsv schema stage1_clean.csv # produces stage1_clean.csv.schema.json
# 3.2: Hand-edit the schema (in a real pipeline, this is checked in to git)
# For this recipe, the generated schema is good enough.
# 3.3: Validate the data against the schema
qsv validate stage1_clean.csv stage1_clean.csv.schema.json
# Exit 0 = clean. Non-zero = .invalid.csv + .validation-errors.tsv produced.# 4.1: Polars SQL aggregation — sales by year and property class
qsv sqlp stage1_clean.csv \
"SELECT
strftime('%Y', sale_date) AS year,
property_class,
COUNT(*) AS sales,
SUM(sale_price) AS total_revenue,
AVG(sale_price) AS avg_price,
MIN(sale_price) AS min_price,
MAX(sale_price) AS max_price
FROM stage1_clean
WHERE sale_price > 0
GROUP BY year, property_class
ORDER BY year DESC, total_revenue DESC" \
> stage4_summary.csv
# 4.2: Pivot for a wide year-over-year view
qsv pivotp property_class stage4_summary.csv \
--index year --values total_revenue --agg sum \
> stage4_pivot.csv
# 4.3: Score-before-run is great for production queries — checks for anti-patterns
qsv scoresql stage1_clean.csv \
"SELECT * FROM stage1_clean WHERE sale_price > 1000000" \
> stage4_query_score.txtA MiniJinja template that renders the per-year summary as Markdown:
{# report.j2 — one row per year/class #}
# Allegheny Property Sales — {{ year }} — {{ property_class }}
- **Sales count:** {{ sales | human_count }}
- **Total revenue:** ${{ total_revenue | format_float(2) }}
- **Average price:** ${{ avg_price | format_float(2) }}
- **Price range:** ${{ min_price | format_float(2) }} – ${{ max_price | format_float(2) }}
---qsv template --template-file report.j2 \
reports_outdir/ \
--outfilename '{{ year }}_{{ property_class }}.md' \
stage4_summary.csv
ls reports_outdir/
# 2024_Residential.md 2024_Commercial.md 2023_Residential.md ...Or render everything into one combined Markdown report:
qsv template --template-file report.j2 - stage4_summary.csv > full_report.md# 6.1: Parquet for downstream analytics (DuckDB, Spark, etc.)
qsv to parquet publish/ stage1_clean.csv stage4_summary.csv stage4_pivot.csv
ls publish/
# stage1_clean.parquet stage4_summary.parquet stage4_pivot.parquet
# 6.2: A Data Package descriptor for FAIR data sharing
qsv to datapackage --stats datapackage.json stage1_clean.csv# Exit 1 if anything is wrong; bash-pipe-safe
set -euo pipefail
qsv validate stage1_clean.csv stage1_clean.csv.schema.json || {
echo "::error::Validation failed"; exit 1;
}
test -f publish/stage1_clean.parquet || {
echo "::error::Parquet output missing"; exit 1;
}
echo "Pipeline OK."For exploratory analysis, swap the hand-written SQL for a natural-language prompt:
qsv describegpt stage1_clean.csv \
--prompt "Show me the top 10 most expensive sales per year." \
> stage4_summary.mdqsv writes the SQL itself in SQL-RAG sub-mode and runs it via DuckDB or Polars SQL.
# Makefile fragment
stage1_clean.csv: raw.csv
qsv input --auto-skip $< | qsv safenames - > $@
stage1_clean.csv.idx: stage1_clean.csv
qsv index $<
stage2_stats.csv stage1_clean.stats.csv.data.jsonl: stage1_clean.csv stage1_clean.csv.idx
qsv stats --everything --infer-dates --stats-jsonl $< > stage2_stats.csv
stage1_clean.csv.schema.json: stage1_clean.csv stage1_clean.stats.csv.data.jsonl
qsv schema $<
stage4_summary.csv: stage1_clean.csv stage1_clean.csv.schema.json
qsv validate $< $<.schema.json
qsv sqlp $< 'SELECT ...' > $@
publish: stage4_summary.csv
qsv to parquet publish/ $<Each step is incremental and cached — make reruns only what changed.
# .github/workflows/data-pipeline.yml
name: Data pipeline
on:
push:
paths: ['data/raw/*.csv']
jobs:
pipeline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -L https://github.com/dathere/qsv/releases/latest/download/qsv-x86_64-unknown-linux-gnu.zip -o qsv.zip
- run: unzip qsv.zip && sudo install qsv /usr/local/bin/
- run: make publish
- uses: actions/upload-artifact@v4
with:
name: publish
path: publish/- The stats cache + Polars schema accelerate steps 3, 4, and 5 substantially. On the Allegheny dataset (~mid-sized), the full pipeline runs in seconds.
- For multi-GB inputs, swap
sort/dedup→extsort/extdedup, and addQSV_AUTOINDEX_SIZEto your environment for automatic indexing. - Stage 7 is the "production gate" — make it part of your CI. The exit codes from
qsv validatemake this easy. - The Polars schema (
.pschema.json) is the biggest unexplored optimization for most users. Generate it once, check it into git, everysqlp/joinp/pivotpbecomes faster.
- SQL & Polars — sqlp, pivotp, scoresql
- Validation & Schema — schema, validate
- Stats Cache & Caching — the speed-up secret
- Conversion & I/O → to — Parquet, Data Package
- Scripting (Luau / Python) → template — MiniJinja
- Recipe: Stats → Insights — describegpt SQL-RAG alternative
- Recipe: Larger-than-RAM CSV — same pipeline with ext-* variants
- Integrations — CI/CD, DuckDB, Python notebooks
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation