SQL and Polars

SQL & Polars

Tier: Advanced Commands covered: sqlp, joinp, pivotp, scoresql

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These commands are powered by the Polars vectorized query engine. They handle larger-than-RAM data, are multithreaded by default, and accept Polars's PostgreSQL dialect of SQL. If you reach for pandas or duckdb for CSV analytics, this is the page that replaces both for many workflows.

The joinp command is also covered in Joins & Set Ops; this page focuses on sqlp, pivotp, and scoresql.

Quick decision table

If you want to…	Use	Notes
Run a SQL query across one or many CSVs / Parquet / JSONL / Arrow	`sqlp`	Polars SQL (PostgreSQL dialect), larger-than-RAM, multithreaded
Score a SQL query BEFORE running it	`scoresql`	Plan analysis, anti-pattern detection, cache freshness
Pivot a CSV (wide ↔ long) with smart aggregation	`pivotp`	Group-by mode when `<on-cols>` is omitted
Join 2+ CSVs (asof, non-equi, larger-than-RAM)	`joinp`	See Joins & Set Ops

`sqlp`

Polars SQL — a PostgreSQL dialect — over CSV, Parquet, JSONL, and Arrow files. Polars converts the query into a lazy expression and streams the execution. Output can be CSV, JSON, JSONL, Parquet, Arrow IPC, or AVRO. Returns (rows, cols) to stderr.

The optimization secret: feed it a Polars schema (.pschema.json generated by schema --polars) — Polars then knows column types without an inference scan. Combined with a stats cache and scoresql, sqlp becomes the most performant SQL engine for CSV-shaped data we know of.

Polars SQL follows the PostgreSQL dialect. The function/keyword reference is written for Python Polars, so expect minor syntax differences. For the full flag list see /docs/help/sqlp.md; the subsections below are the workflow view of what the SQL layer can do.

How tables get their names

Each input file is registered as a table named after its file stem — wcp.csv → wcp, country_continent.csv → country_continent. You can also address inputs positionally as _t_1, _t_2, … (1-based, in argument order). Quote column names that contain spaces with double quotes (escaped as \" inside a double-quoted shell string).

# named table + positional alias refer to the same file
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM wcp ORDER BY Population DESC LIMIT 20"
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM _t_1 ORDER BY Population DESC LIMIT 20"

Input flexibility:

stdin — pass - as the input; the table is named stdin.
a directory — every file inside is read as input.
an .infile-list file — treated as a newline-delimited list of input paths.
snappy .sz files — auto-decompressed on read; --output result.csv.sz auto-compresses.

# stdin as a table, joined against a file
cat wcp.csv | qsv sqlp - country_continent.csv \
  "SELECT stdin.AccentCity, country_continent.continent
   FROM stdin JOIN country_continent ON stdin.Country = country_continent.country"

Multi-format input via table functions

read_csv, read_ndjson, read_parquet, and read_ipc table functions let you read — and mix — formats inline. Pair them with the SKIP_INPUT sentinel to bypass qsv's CSV preprocessing entirely and hand parsing straight to Polars' multithreaded, mem-mapped reader (dramatically faster, at the cost of CSV-parser configurability — comma delimiter only, no CSV comments).

# join a CSV against a Parquet file inline
qsv sqlp wcp.csv \
  "SELECT * FROM wcp
   JOIN read_parquet('country_continent.parquet') AS cc ON wcp.Country = cc.country"

# SKIP_INPUT: let Polars read the CSV directly — fastest path for a single-file scan
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('wcp.csv') ORDER BY Population DESC LIMIT 100"

# mix NDJSON + Arrow with no CSV input at all
qsv sqlp SKIP_INPUT \
  "SELECT * FROM read_ndjson('events.jsonl') AS t1
   JOIN read_ipc('dims.arrow') AS t2 ON t1.id = t2.id"

read_csv also decompresses .gz, .zst, and .zlib files transparently:

qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.gz')"
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.zst')"

SKIP_INPUT also works when the query needs no input at all — handy for testing expressions:

qsv sqlp SKIP_INPUT "SELECT 1 AS one, '2' AS two, 3.0 AS three"

Joins & set operations

Polars SQL supports inner / left / right / full joins, USING, NATURAL JOIN, LEFT SEMI / RIGHT ANTI joins, and UNION [ALL] BY NAME.

# inner join with a post-join filter
qsv sqlp wcp.csv country_continent.csv \
  "SELECT wcp.AccentCity, wcp.Population, country_continent.continent
   FROM wcp
   JOIN country_continent ON wcp.Country = country_continent.country
   WHERE wcp.Population > 1000000
   ORDER BY wcp.Population DESC"

# NATURAL JOIN — match on identically-named columns automatically
qsv sqlp data1.csv data2.csv data3.csv \
  "SELECT COLUMNS('^[^:]+$') FROM data1 NATURAL JOIN data2 NATURAL JOIN data3 ORDER BY COMPANY_ID"

# chained SEMI / ANTI joins
qsv sqlp tbl_a.csv tbl_b.csv tbl_c.csv \
  "SELECT * FROM tbl_a
     RIGHT ANTI JOIN tbl_b USING (b)
     LEFT SEMI JOIN tbl_c USING (c)"

# stack yearly exports with UNION ALL BY NAME (column order need not match)
qsv sqlp nyc311-2023.csv nyc311-2024.csv \
  "SELECT * FROM nyc311_2023 UNION ALL BY NAME SELECT * FROM nyc311_2024"

Filtering: pattern, regex & comparison operators

Beyond standard =, <, IN, and LIKE, Polars SQL adds a rich operator set:

Operator	Meaning
`^@`	starts-with
`~~` / `~~*`	`LIKE` / case-insensitive `ILIKE`
`!~~` / `!~~*`	`NOT LIKE` / case-insensitive `NOT ILIKE`
`~` / `~*`	regex contains (case-sensitive / insensitive)
`!~` / `!~*`	regex does-not-contain (case-sensitive / insensitive)
`<=>`	spaceship: `-1` / `0` / `1` for `<` / `==` / `>`

# starts-with as a predicate and as a projected boolean
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ^@ 'San'"
qsv sqlp wcp.csv "SELECT City, City ^@ 'San' AS starts_with_san FROM wcp"

# ILIKE-style ends-with, case-insensitive
qsv sqlp wcp.csv "SELECT City, City ~~* '%burg' AS ends_with_burg FROM wcp"

# regex operators combined with a numeric filter
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ~ '^San' AND Population > 10000"
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City !~* 'ville$' AND Population > 10000"

# regexp_like(string, pattern [, flags]) — flags: 'c' (default), 'i', 'm'
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE regexp_like(City, '^san', 'i')"

# regexp using a pattern from another column
qsv sqlp data.csv "SELECT idx, val FROM data WHERE val regexp pattern_col"

SELECT * ILIKE projects only the columns whose names match the wildcard — useful for grabbing a family of columns without listing them:

# return every column whose name matches '%a%e%' (e.g. Name, State), case-insensitive
qsv sqlp wcp.csv "SELECT * ILIKE '%a%e%' FROM wcp ORDER BY Country"

Expressions, aggregates & string functions

# CASE (searched and simple forms)
qsv sqlp wcp.csv \
  "SELECT City,
     CASE WHEN Population > 1000000 THEN 'mega'
          WHEN Population > 100000  THEN 'large'
          ELSE 'small' END AS tier
   FROM wcp"

# COALESCE + NULLIF to default empty strings
qsv sqlp wcp.csv "SELECT COALESCE(NULLIF(Region, ''), 'unknown') AS region FROM wcp"

# string functions
qsv sqlp wcp.csv "SELECT lower(City), substr(Country, 1, 2) FROM wcp WHERE starts_with(City, 'San')"

# FILTER clause on aggregates — per-aggregate predicates in one GROUP BY pass
qsv sqlp sales.csv \
  "SELECT region,
     SUM(amount) AS total,
     SUM(amount) FILTER (WHERE amount > 100)        AS total_over_100,
     COUNT(*)    FILTER (WHERE status = 'refunded') AS refund_count
   FROM sales GROUP BY region ORDER BY region"

Dollar-quoting (PostgreSQL syntax) avoids escaping quotes and apostrophes inside literals. In a double-quoted shell string the $ must be escaped as \$:

qsv sqlp tweets.csv "SELECT * FROM tweets WHERE handle = \$\$Diane's@Twitter\$\$"
# named tag form for literals containing both quote styles
qsv sqlp data.csv 'SELECT * FROM data WHERE col1 = $tag$Diane'\''s "horse"$tag$'

CTEs, subqueries & window functions

# Common Table Expression
qsv sqlp wcp.csv \
  "WITH big AS (SELECT * FROM wcp WHERE Population >= 1000000)
   SELECT * FROM big WHERE STARTS_WITH(City, 'S')"

# correlated-style IN subquery
qsv sqlp wcp.csv country_continent.csv \
  "SELECT * FROM wcp
   WHERE Country IN (SELECT country FROM country_continent WHERE continent = 'EU')"

# window function: rank cities within each country
qsv sqlp wcp.csv \
  "SELECT City, Country, Population,
     ROW_NUMBER() OVER (PARTITION BY Country ORDER BY Population DESC) AS rank_in_country
   FROM wcp"

SQL scripts (multi-query)

If the sql argument ends in .sql, it's read as a script: queries separated by ;, single-line -- comments allowed, and the result of the last query is returned. Earlier queries typically build temporary tables consumed by later ones. In long scripts, TRUNCATE TABLE <name>; frees a temp table's memory before the script ends.

-- pipeline.sql — staged build, last SELECT is the result
CREATE TABLE eu AS
  SELECT country FROM country_continent WHERE continent = 'EU';

CREATE TABLE eu_cities AS
  SELECT w.* FROM wcp w JOIN eu ON w.Country = eu.country;

TRUNCATE TABLE eu;          -- free intermediate memory

SELECT Country, COUNT(*) AS cities, MAX(Population) AS biggest
FROM eu_cities GROUP BY Country ORDER BY cities DESC;

qsv sqlp wcp.csv country_continent.csv pipeline.sql --format json --output eu_summary.json

Output formats & compression

--format accepts csv (default), json, jsonl, parquet, arrow, and avro. Columnar formats take --compression (and Parquet adds --compress-level and --statistics).

qsv sqlp wcp.csv "SELECT * FROM wcp WHERE Population > 1000000" \
  --format parquet --compression zstd --compress-level 9 --statistics \
  --output big_cities.parquet

# snappy round-trip via .sz extension (auto compress/decompress)
qsv sqlp wcp.csv.sz "SELECT * FROM wcp WHERE Population > 10000" --output result.csv.sz

CSV-output date/number formatting is controlled with --datetime-format, --date-format, --time-format, --float-precision, --decimal-comma, and --wnull-value.

Schema caching & type casting

--cache-schema writes a <file>.pschema.json next to each input and reuses it on later runs (skipping inference). The file is editable — change a column's type to cast it. For example, force a Float32 to a Decimal with explicit precision/scale by replacing "Float32" with {"Decimal": [10, 3]}. Valid types include Boolean, the sized ints/uints, Float32/64, String, Date, Datetime, Duration, Time, Categorical, Decimal, and Enum. If a .pschema.json already exists, sqlp uses it automatically — even without --cache-schema.

# first run infers + caches the schema; edit big_cities.pschema.json to cast, then re-run
qsv sqlp --cache-schema wcp.csv "SELECT * FROM wcp WHERE Population > 1000000"

For date parsing on input, add --try-parsedates (falls back to strings on failure). --infer-len controls how many rows are scanned for inference (0 = full scan; for a single non-script CSV, qsv counts rows instead of forcing a full lazy scan).

Performance & debugging

# inspect the optimized query plan without running it
qsv sqlp wcp.csv "EXPLAIN SELECT * FROM wcp WHERE Population > 10000 ORDER BY Population DESC LIMIT 20"

For out-of-memory situations use --streaming or --low-memory (both slower). --no-optimizations disables non-default optimizations when a query errors. Run scoresql first to catch anti-patterns and estimate cost before executing.

See also: /docs/help/sqlp.md, Polars SQL reference, scoresql, pivotp, joinp, Recipe: Build a Data Pipeline.

`scoresql`

Analyze a SQL query against the stats / frequency / moarstats caches and the query plan — before you run it. Output is a human-readable performance report (default) or JSON. Supports both Polars (default) and DuckDB modes for the underlying plan analysis.

Caches are auto-generated when missing: qsv stats --everything --stats-jsonl and qsv frequency --frequency-jsonl if their outputs aren't already there.

Example: score a join query before running it

qsv scoresql wcp.csv country_continent.csv \
  "SELECT wcp.AccentCity, country_continent.continent
   FROM wcp JOIN country_continent ON wcp.Country = country_continent.country"

Report includes: query plan (EXPLAIN output), type optimization warnings, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, …), and cache freshness.

Example: get JSON output for CI gating

qsv scoresql --json wcp.csv "SELECT * FROM wcp" | jq '.score, .warnings'

A SELECT * without LIMIT will earn an anti-pattern warning.

Example: use DuckDB for plan analysis instead of Polars

qsv scoresql --duckdb nyc311.csv "SELECT * FROM nyc311 WHERE status = 'active'"

Example: score a query loaded from a .sql file

qsv scoresql data.csv complex_query.sql

Only the last query in the file is scored.

See also: /docs/help/scoresql.md, sqlp, stats, frequency, Stats Cache & Caching.

`pivotp`

Pivot or group-and-aggregate CSV data using Polars. Output is CSV; stdin is not supported. The command has two modes, selected by whether you pass the <on-cols> positional argument:

Pivot mode — qsv pivotp <on-cols> <input>: --index column(s) become rows, the distinct <on-cols> values become new columns, and --values is aggregated into each cell. At least one of --index / --values must be given (the unspecified one defaults to "all remaining columns").
Group-by mode — qsv pivotp <input> (no <on-cols>): a plain GROUP BY aggregation. --index is required (it's the group key); if --values is omitted you get a single count column.

For the full flag list see /docs/help/pivotp.md; the subsections below are the workflow view.

Pivot mode

# Complaint Type × Year, counting Unique Keys per cell
qsv pivotp 'Complaint Type' NYC_311_SR_2010-2020-sample-1M.csv \
  --index YearReported \
  --values 'Unique Key' \
  --agg len \
  --output nyc311_pivot.csv

(Assumes a YearReported column — derive it first with datefmt --formatstr '%Y' --new-column YearReported.)

# Sale Year × Property Class, average Sale Price per cell
qsv pivotp 'Property Class' allegheny_property_sales.csv \
  --index 'Sale Year' \
  --values 'Sale Price' \
  --agg mean

Pivot on multiple columns and/or aggregate multiple value columns — the generated column names join the parts with --col-separator (default _):

qsv pivotp 'Property Class,Sale Year' allegheny_property_sales.csv \
  --index Municipality \
  --values 'Sale Price,Lot Area' \
  --agg mean --col-separator '::'

Group-by mode

# count rows per Borough (no --values → a single "count" column)
qsv pivotp NYC_311_SR_2010-2020-sample-1M.csv \
  --index Borough \
  --output borough_counts.csv

# sum/mean per composite group
qsv pivotp allegheny_property_sales.csv \
  --index 'Municipality,Property Class' \
  --values 'Sale Price' \
  --agg mean

Aggregation functions

--agg (default smart) accepts:

Function	Result
`first` / `last`	first / last value encountered
`sum` / `min` / `max` / `mean` / `median`	the usual reductions
`quantile@<p>` (alias `q@<p>`)	linear-interpolation quantile at `p ∈ [0, 1]`
`len`	count of values
`item`	the single value in the group — errors if there's more than one (pivot mode only)
`smart`	auto-pick based on the value column's type & statistics

Since 20.1.0, q@0.5 equals median for even-length groups, and q@0.95 / q@0.99 are the canonical high-percentile choices for latency / SLA work:

# p95 latency per Borough
qsv pivotp --index Borough --values latency_ms --agg q@0.95 nyc311.csv

Important

Group-by mode does not support --agg item or --agg none.

Smart aggregation

With --agg smart (the default), pivotp reads the stats cache for the value column and picks an aggregation from its data type and distribution — so you usually don't have to think about it. In group-by mode smart resolves to len (count). In pivot mode it considers type, cardinality, sparsity, sign distribution, and sort order; when richer stats are present it goes further:

a prior stats --everything adds skewness and mode count,
moarstats adds outlier profile, MAD/stddev and mean/median divergence, and quartile dispersion,
moarstats --advanced adds kurtosis, bimodality, entropy, and the Gini coefficient.

In practice: a clean numeric column → sum; heavy-tailed / outlier-laden / skewed → median; mixed-sign → mean (to avoid sum cancellation); sparse or near-uniform → len; a single distinct value → item. The chosen function and the reasoning are printed to stderr (silence with -q). Smart only engages with a single value column; with multiple value columns it falls back to first.

Totals & subtotals

# add a grand-total row (sums every numeric, non-index column)
qsv pivotp 'Property Class' allegheny_property_sales.csv \
  --index 'Sale Year' --values 'Sale Price' --agg sum --grand-total

# subtotal after each first-index group (requires 2+ index columns), custom label
qsv pivotp 'Property Class' allegheny_property_sales.csv \
  --index 'Municipality,Sale Year' --values 'Sale Price' --agg sum \
  --subtotal --total-label 'Subtotal'

--grand-total needs at least one index column; --subtotal needs at least two (the label lands in the second index column) and is pivot-mode only.

Ordering, validation & parsing

--maintain-order — preserve input column order (pivot) / group order (group-by) instead of the faster unspecified order.
--sort-columns — sort the generated pivot columns alphabetically (pivot mode only).
--validate — check the <on-cols> cardinality first and warn (>1,000 result columns) or error (>100,000) before doing the work; great for guarding against an accidental high-cardinality pivot (pivot mode only).
--try-parsedates, --infer-len <n>, --decimal-comma (needs an alternate --delimiter), and --ignore-errors control CSV parsing.

# cardinality check before committing to a wide pivot
qsv pivotp 'Complaint Type' nyc311.csv --index Borough --values 'Unique Key' --validate

pivotp automatically picks up a <filename>.pschema.json (from schema --polars) when present — unless you override inference with a non-default --infer-len. That lets Polars skip the type-inference scan.

See also: /docs/help/pivotp.md, sqlp — alternative via GROUP BY and PIVOT, stats / moarstats — feed smart aggregation, transpose, schema --polars.

`joinp` (cross-reference)

See Joins & Set Ops → joinp for the full treatment. The headline capabilities for SQL-style work:

Asof joins — time-series-aware, matches each left row to the nearest prior right row.
Non-equi joins — range-based matches via a Polars SQL WHERE clause with _left / _right column suffixes.
Pre-join filtering — --filter-left / --filter-right evaluate Polars SQL before the join, cheaper than filtering after.
Column coalescing — --coalesce merges identically-named columns post-join.

SQL and Polars

SQL & Polars

Quick decision table

sqlp

How tables get their names

Multi-format input via table functions

Joins & set operations

Filtering: pattern, regex & comparison operators

Expressions, aggregates & string functions

CTEs, subqueries & window functions

SQL scripts (multi-query)

Output formats & compression

Schema caching & type casting

Performance & debugging

scoresql

pivotp

Pivot mode

Group-by mode

Aggregation functions

Smart aggregation

Totals & subtotals

Ordering, validation & parsing

joinp (cross-reference)

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

`sqlp`

`scoresql`

`pivotp`

`joinp` (cross-reference)