-
Notifications
You must be signed in to change notification settings - Fork 104
SQL and Polars
Tier: Advanced
Commands covered: sqlp, joinp, pivotp, scoresql
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
These commands are powered by the Polars vectorized query engine. They handle larger-than-RAM data, are multithreaded by default, and accept Polars's PostgreSQL dialect of SQL. If you reach for pandas or duckdb for CSV analytics, this is the page that replaces both for many workflows.
The joinp command is also covered in Joins & Set Ops; this page focuses on sqlp, pivotp, and scoresql.
| If you want to… | Use | Notes |
|---|---|---|
| Run a SQL query across one or many CSVs / Parquet / JSONL / Arrow | sqlp |
Polars SQL (PostgreSQL dialect), larger-than-RAM, multithreaded |
| Score a SQL query BEFORE running it | scoresql |
Plan analysis, anti-pattern detection, cache freshness |
| Pivot a CSV (wide ↔ long) with smart aggregation | pivotp |
Group-by mode when <on-cols> is omitted |
| Join 2+ CSVs (asof, non-equi, larger-than-RAM) | joinp |
See Joins & Set Ops |
Polars SQL — a PostgreSQL dialect — over CSV, Parquet, JSONL, and Arrow files. Polars converts the query into a lazy expression and streams the execution. Output can be CSV, JSON, JSONL, Parquet, Arrow IPC, or AVRO. Returns (rows, cols) to stderr.
The optimization secret: feed it a Polars schema (.pschema.json generated by schema --polars) — Polars then knows column types without an inference scan. Combined with a stats cache and scoresql, sqlp becomes the most performant SQL engine for CSV-shaped data we know of.
Polars SQL follows the PostgreSQL dialect. The function/keyword reference is written for Python Polars, so expect minor syntax differences. For the full flag list see
/docs/help/sqlp.md; the subsections below are the workflow view of what the SQL layer can do.
Each input file is registered as a table named after its file stem — wcp.csv → wcp, country_continent.csv → country_continent. You can also address inputs positionally as _t_1, _t_2, … (1-based, in argument order). Quote column names that contain spaces with double quotes (escaped as \" inside a double-quoted shell string).
# named table + positional alias refer to the same file
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM wcp ORDER BY Population DESC LIMIT 20"
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM _t_1 ORDER BY Population DESC LIMIT 20"Input flexibility:
-
stdin — pass
-as the input; the table is namedstdin. - a directory — every file inside is read as input.
-
an
.infile-listfile — treated as a newline-delimited list of input paths. -
snappy
.szfiles — auto-decompressed on read;--output result.csv.szauto-compresses.
# stdin as a table, joined against a file
cat wcp.csv | qsv sqlp - country_continent.csv \
"SELECT stdin.AccentCity, country_continent.continent
FROM stdin JOIN country_continent ON stdin.Country = country_continent.country"read_csv, read_ndjson, read_parquet, and read_ipc table functions let you read — and mix — formats inline. Pair them with the SKIP_INPUT sentinel to bypass qsv's CSV preprocessing entirely and hand parsing straight to Polars' multithreaded, mem-mapped reader (dramatically faster, at the cost of CSV-parser configurability — comma delimiter only, no CSV comments).
# join a CSV against a Parquet file inline
qsv sqlp wcp.csv \
"SELECT * FROM wcp
JOIN read_parquet('country_continent.parquet') AS cc ON wcp.Country = cc.country"
# SKIP_INPUT: let Polars read the CSV directly — fastest path for a single-file scan
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('wcp.csv') ORDER BY Population DESC LIMIT 100"
# mix NDJSON + Arrow with no CSV input at all
qsv sqlp SKIP_INPUT \
"SELECT * FROM read_ndjson('events.jsonl') AS t1
JOIN read_ipc('dims.arrow') AS t2 ON t1.id = t2.id"read_csv also decompresses .gz, .zst, and .zlib files transparently:
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.gz')"
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.zst')"SKIP_INPUT also works when the query needs no input at all — handy for testing expressions:
qsv sqlp SKIP_INPUT "SELECT 1 AS one, '2' AS two, 3.0 AS three"Polars SQL supports inner / left / right / full joins, USING, NATURAL JOIN, LEFT SEMI / RIGHT ANTI joins, and UNION [ALL] BY NAME.
# inner join with a post-join filter
qsv sqlp wcp.csv country_continent.csv \
"SELECT wcp.AccentCity, wcp.Population, country_continent.continent
FROM wcp
JOIN country_continent ON wcp.Country = country_continent.country
WHERE wcp.Population > 1000000
ORDER BY wcp.Population DESC"
# NATURAL JOIN — match on identically-named columns automatically
qsv sqlp data1.csv data2.csv data3.csv \
"SELECT COLUMNS('^[^:]+$') FROM data1 NATURAL JOIN data2 NATURAL JOIN data3 ORDER BY COMPANY_ID"
# chained SEMI / ANTI joins
qsv sqlp tbl_a.csv tbl_b.csv tbl_c.csv \
"SELECT * FROM tbl_a
RIGHT ANTI JOIN tbl_b USING (b)
LEFT SEMI JOIN tbl_c USING (c)"
# stack yearly exports with UNION ALL BY NAME (column order need not match)
qsv sqlp nyc311-2023.csv nyc311-2024.csv \
"SELECT * FROM nyc311_2023 UNION ALL BY NAME SELECT * FROM nyc311_2024"Beyond standard =, <, IN, and LIKE, Polars SQL adds a rich operator set:
| Operator | Meaning |
|---|---|
^@ |
starts-with |
~~ / ~~*
|
LIKE / case-insensitive ILIKE
|
!~~ / !~~*
|
NOT LIKE / case-insensitive NOT ILIKE
|
~ / ~*
|
regex contains (case-sensitive / insensitive) |
!~ / !~*
|
regex does-not-contain (case-sensitive / insensitive) |
<=> |
spaceship: -1 / 0 / 1 for < / == / >
|
# starts-with as a predicate and as a projected boolean
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ^@ 'San'"
qsv sqlp wcp.csv "SELECT City, City ^@ 'San' AS starts_with_san FROM wcp"
# ILIKE-style ends-with, case-insensitive
qsv sqlp wcp.csv "SELECT City, City ~~* '%burg' AS ends_with_burg FROM wcp"
# regex operators combined with a numeric filter
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ~ '^San' AND Population > 10000"
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City !~* 'ville$' AND Population > 10000"
# regexp_like(string, pattern [, flags]) — flags: 'c' (default), 'i', 'm'
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE regexp_like(City, '^san', 'i')"
# regexp using a pattern from another column
qsv sqlp data.csv "SELECT idx, val FROM data WHERE val regexp pattern_col"SELECT * ILIKE projects only the columns whose names match the wildcard — useful for grabbing a family of columns without listing them:
# return every column whose name matches '%a%e%' (e.g. Name, State), case-insensitive
qsv sqlp wcp.csv "SELECT * ILIKE '%a%e%' FROM wcp ORDER BY Country"# CASE (searched and simple forms)
qsv sqlp wcp.csv \
"SELECT City,
CASE WHEN Population > 1000000 THEN 'mega'
WHEN Population > 100000 THEN 'large'
ELSE 'small' END AS tier
FROM wcp"
# COALESCE + NULLIF to default empty strings
qsv sqlp wcp.csv "SELECT COALESCE(NULLIF(Region, ''), 'unknown') AS region FROM wcp"
# string functions
qsv sqlp wcp.csv "SELECT lower(City), substr(Country, 1, 2) FROM wcp WHERE starts_with(City, 'San')"
# FILTER clause on aggregates — per-aggregate predicates in one GROUP BY pass
qsv sqlp sales.csv \
"SELECT region,
SUM(amount) AS total,
SUM(amount) FILTER (WHERE amount > 100) AS total_over_100,
COUNT(*) FILTER (WHERE status = 'refunded') AS refund_count
FROM sales GROUP BY region ORDER BY region"Dollar-quoting (PostgreSQL syntax) avoids escaping quotes and apostrophes inside literals. In a double-quoted shell string the $ must be escaped as \$:
qsv sqlp tweets.csv "SELECT * FROM tweets WHERE handle = \$\$Diane's@Twitter\$\$"
# named tag form for literals containing both quote styles
qsv sqlp data.csv 'SELECT * FROM data WHERE col1 = $tag$Diane'\''s "horse"$tag$'# Common Table Expression
qsv sqlp wcp.csv \
"WITH big AS (SELECT * FROM wcp WHERE Population >= 1000000)
SELECT * FROM big WHERE STARTS_WITH(City, 'S')"
# correlated-style IN subquery
qsv sqlp wcp.csv country_continent.csv \
"SELECT * FROM wcp
WHERE Country IN (SELECT country FROM country_continent WHERE continent = 'EU')"
# window function: rank cities within each country
qsv sqlp wcp.csv \
"SELECT City, Country, Population,
ROW_NUMBER() OVER (PARTITION BY Country ORDER BY Population DESC) AS rank_in_country
FROM wcp"If the sql argument ends in .sql, it's read as a script: queries separated by ;, single-line -- comments allowed, and the result of the last query is returned. Earlier queries typically build temporary tables consumed by later ones. In long scripts, TRUNCATE TABLE <name>; frees a temp table's memory before the script ends.
-- pipeline.sql — staged build, last SELECT is the result
CREATE TABLE eu AS
SELECT country FROM country_continent WHERE continent = 'EU';
CREATE TABLE eu_cities AS
SELECT w.* FROM wcp w JOIN eu ON w.Country = eu.country;
TRUNCATE TABLE eu; -- free intermediate memory
SELECT Country, COUNT(*) AS cities, MAX(Population) AS biggest
FROM eu_cities GROUP BY Country ORDER BY cities DESC;qsv sqlp wcp.csv country_continent.csv pipeline.sql --format json --output eu_summary.json--format accepts csv (default), json, jsonl, parquet, arrow, and avro. Columnar formats take --compression (and Parquet adds --compress-level and --statistics).
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE Population > 1000000" \
--format parquet --compression zstd --compress-level 9 --statistics \
--output big_cities.parquet
# snappy round-trip via .sz extension (auto compress/decompress)
qsv sqlp wcp.csv.sz "SELECT * FROM wcp WHERE Population > 10000" --output result.csv.szCSV-output date/number formatting is controlled with --datetime-format, --date-format, --time-format, --float-precision, --decimal-comma, and --wnull-value.
--cache-schema writes a <file>.pschema.json next to each input and reuses it on later runs (skipping inference). The file is editable — change a column's type to cast it. For example, force a Float32 to a Decimal with explicit precision/scale by replacing "Float32" with {"Decimal": [10, 3]}. Valid types include Boolean, the sized ints/uints, Float32/64, String, Date, Datetime, Duration, Time, Categorical, Decimal, and Enum. If a .pschema.json already exists, sqlp uses it automatically — even without --cache-schema.
# first run infers + caches the schema; edit big_cities.pschema.json to cast, then re-run
qsv sqlp --cache-schema wcp.csv "SELECT * FROM wcp WHERE Population > 1000000"For date parsing on input, add --try-parsedates (falls back to strings on failure). --infer-len controls how many rows are scanned for inference (0 = full scan; for a single non-script CSV, qsv counts rows instead of forcing a full lazy scan).
# inspect the optimized query plan without running it
qsv sqlp wcp.csv "EXPLAIN SELECT * FROM wcp WHERE Population > 10000 ORDER BY Population DESC LIMIT 20"For out-of-memory situations use --streaming or --low-memory (both slower). --no-optimizations disables non-default optimizations when a query errors. Run scoresql first to catch anti-patterns and estimate cost before executing.
See also: /docs/help/sqlp.md, Polars SQL reference, scoresql, pivotp, joinp, Recipe: Build a Data Pipeline.
Analyze a SQL query against the stats / frequency / moarstats caches and the query plan — before you run it. Output is a human-readable performance report (default) or JSON. Supports both Polars (default) and DuckDB modes for the underlying plan analysis.
Caches are auto-generated when missing: qsv stats --everything --stats-jsonl and qsv frequency --frequency-jsonl if their outputs aren't already there.
Example: score a join query before running it
qsv scoresql wcp.csv country_continent.csv \
"SELECT wcp.AccentCity, country_continent.continent
FROM wcp JOIN country_continent ON wcp.Country = country_continent.country"Report includes: query plan (EXPLAIN output), type optimization warnings, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, …), and cache freshness.
Example: get JSON output for CI gating
qsv scoresql --json wcp.csv "SELECT * FROM wcp" | jq '.score, .warnings'A SELECT * without LIMIT will earn an anti-pattern warning.
Example: use DuckDB for plan analysis instead of Polars
qsv scoresql --duckdb nyc311.csv "SELECT * FROM nyc311 WHERE status = 'active'"Example: score a query loaded from a .sql file
qsv scoresql data.csv complex_query.sqlOnly the last query in the file is scored.
See also: /docs/help/scoresql.md, sqlp, stats, frequency, Stats Cache & Caching.
Pivot or group-and-aggregate CSV data using Polars. Output is CSV; stdin is not supported. The command has two modes, selected by whether you pass the <on-cols> positional argument:
-
Pivot mode —
qsv pivotp <on-cols> <input>:--indexcolumn(s) become rows, the distinct<on-cols>values become new columns, and--valuesis aggregated into each cell. At least one of--index/--valuesmust be given (the unspecified one defaults to "all remaining columns"). -
Group-by mode —
qsv pivotp <input>(no<on-cols>): a plainGROUP BYaggregation.--indexis required (it's the group key); if--valuesis omitted you get a singlecountcolumn.
For the full flag list see
/docs/help/pivotp.md; the subsections below are the workflow view.
# Complaint Type × Year, counting Unique Keys per cell
qsv pivotp 'Complaint Type' NYC_311_SR_2010-2020-sample-1M.csv \
--index YearReported \
--values 'Unique Key' \
--agg len \
--output nyc311_pivot.csv(Assumes a YearReported column — derive it first with datefmt --formatstr '%Y' --new-column YearReported.)
# Sale Year × Property Class, average Sale Price per cell
qsv pivotp 'Property Class' allegheny_property_sales.csv \
--index 'Sale Year' \
--values 'Sale Price' \
--agg meanPivot on multiple columns and/or aggregate multiple value columns — the generated column names join the parts with --col-separator (default _):
qsv pivotp 'Property Class,Sale Year' allegheny_property_sales.csv \
--index Municipality \
--values 'Sale Price,Lot Area' \
--agg mean --col-separator '::'# count rows per Borough (no --values → a single "count" column)
qsv pivotp NYC_311_SR_2010-2020-sample-1M.csv \
--index Borough \
--output borough_counts.csv
# sum/mean per composite group
qsv pivotp allegheny_property_sales.csv \
--index 'Municipality,Property Class' \
--values 'Sale Price' \
--agg mean--agg (default smart) accepts:
| Function | Result |
|---|---|
first / last
|
first / last value encountered |
sum / min / max / mean / median
|
the usual reductions |
quantile@<p> (alias q@<p>) |
linear-interpolation quantile at p ∈ [0, 1]
|
len |
count of values |
item |
the single value in the group — errors if there's more than one (pivot mode only) |
smart |
auto-pick based on the value column's type & statistics |
Since 20.1.0, q@0.5 equals median for even-length groups, and q@0.95 / q@0.99 are the canonical high-percentile choices for latency / SLA work:
# p95 latency per Borough
qsv pivotp --index Borough --values latency_ms --agg q@0.95 nyc311.csvImportant
Group-by mode does not support --agg item or --agg none.
With --agg smart (the default), pivotp reads the stats cache for the value column and picks an aggregation from its data type and distribution — so you usually don't have to think about it. In group-by mode smart resolves to len (count). In pivot mode it considers type, cardinality, sparsity, sign distribution, and sort order; when richer stats are present it goes further:
- a prior
stats --everythingadds skewness and mode count, -
moarstatsadds outlier profile, MAD/stddev and mean/median divergence, and quartile dispersion, -
moarstats --advancedadds kurtosis, bimodality, entropy, and the Gini coefficient.
In practice: a clean numeric column → sum; heavy-tailed / outlier-laden / skewed → median; mixed-sign → mean (to avoid sum cancellation); sparse or near-uniform → len; a single distinct value → item. The chosen function and the reasoning are printed to stderr (silence with -q). Smart only engages with a single value column; with multiple value columns it falls back to first.
# add a grand-total row (sums every numeric, non-index column)
qsv pivotp 'Property Class' allegheny_property_sales.csv \
--index 'Sale Year' --values 'Sale Price' --agg sum --grand-total
# subtotal after each first-index group (requires 2+ index columns), custom label
qsv pivotp 'Property Class' allegheny_property_sales.csv \
--index 'Municipality,Sale Year' --values 'Sale Price' --agg sum \
--subtotal --total-label 'Subtotal'--grand-total needs at least one index column; --subtotal needs at least two (the label lands in the second index column) and is pivot-mode only.
-
--maintain-order— preserve input column order (pivot) / group order (group-by) instead of the faster unspecified order. -
--sort-columns— sort the generated pivot columns alphabetically (pivot mode only). -
--validate— check the<on-cols>cardinality first and warn (>1,000 result columns) or error (>100,000) before doing the work; great for guarding against an accidental high-cardinality pivot (pivot mode only). -
--try-parsedates,--infer-len <n>,--decimal-comma(needs an alternate--delimiter), and--ignore-errorscontrol CSV parsing.
# cardinality check before committing to a wide pivot
qsv pivotp 'Complaint Type' nyc311.csv --index Borough --values 'Unique Key' --validatepivotp automatically picks up a <filename>.pschema.json (from schema --polars) when present — unless you override inference with a non-default --infer-len. That lets Polars skip the type-inference scan.
See also: /docs/help/pivotp.md, sqlp — alternative via GROUP BY and PIVOT, stats / moarstats — feed smart aggregation, transpose, schema --polars.
See Joins & Set Ops → joinp for the full treatment. The headline capabilities for SQL-style work:
- Asof joins — time-series-aware, matches each left row to the nearest prior right row.
-
Non-equi joins — range-based matches via a Polars SQL
WHEREclause with_left/_rightcolumn suffixes. -
Pre-join filtering —
--filter-left/--filter-rightevaluate Polars SQL before the join, cheaper than filtering after. -
Column coalescing —
--coalescemerges identically-named columns post-join.
- Command Reference (index)
-
Joins & Set Ops —
joinpin depth -
Aggregation & Statistics —
statscache feedsscoresql -
Validation & Schema —
schema --polarsproduces.pschema.jsonfor sqlp/joinp/pivotp - Stats Cache & Caching
- Polars SQL docs
- Cookbook → Build a Data Pipeline
- Cookbook → Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation