Skip to content
Joel Natividad edited this page May 30, 2026 · 6 revisions

SQL & Polars

Tier: Advanced Commands covered: sqlp, joinp, pivotp, scoresql

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These commands are powered by the Polars vectorized query engine. They handle larger-than-RAM data, are multithreaded by default, and accept Polars's PostgreSQL dialect of SQL. If you reach for pandas or duckdb for CSV analytics, this is the page that replaces both for many workflows.

The joinp command is also covered in Joins & Set Ops; this page focuses on sqlp, pivotp, and scoresql.

Quick decision table

If you want to… Use Notes
Run a SQL query across one or many CSVs / Parquet / JSONL / Arrow sqlp Polars SQL (PostgreSQL dialect), larger-than-RAM, multithreaded
Score a SQL query BEFORE running it scoresql Plan analysis, anti-pattern detection, cache freshness
Pivot a CSV (wide ↔ long) with smart aggregation pivotp Group-by mode when <on-cols> is omitted
Join 2+ CSVs (asof, non-equi, larger-than-RAM) joinp See Joins & Set Ops

sqlp

Polars SQL — a PostgreSQL dialect — over CSV, Parquet, JSONL, and Arrow files. Polars converts the query into a lazy expression and streams the execution. Output can be CSV, JSON, JSONL, Parquet, Arrow IPC, or AVRO. Returns (rows, cols) to stderr.

The optimization secret: feed it a Polars schema (.pschema.json generated by schema --polars) — Polars then knows column types without an inference scan. Combined with a stats cache and scoresql, sqlp becomes the most performant SQL engine for CSV-shaped data we know of.

Polars SQL follows the PostgreSQL dialect. The function/keyword reference is written for Python Polars, so expect minor syntax differences. For the full flag list see /docs/help/sqlp.md; the subsections below are the workflow view of what the SQL layer can do.

How tables get their names

Each input file is registered as a table named after its file stemwcp.csvwcp, country_continent.csvcountry_continent. You can also address inputs positionally as _t_1, _t_2, … (1-based, in argument order). Quote column names that contain spaces with double quotes (escaped as \" inside a double-quoted shell string).

# named table + positional alias refer to the same file
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM wcp ORDER BY Population DESC LIMIT 20"
qsv sqlp wcp.csv "SELECT \"AccentCity\", Population FROM _t_1 ORDER BY Population DESC LIMIT 20"

Input flexibility:

  • stdin — pass - as the input; the table is named stdin.
  • a directory — every file inside is read as input.
  • an .infile-list file — treated as a newline-delimited list of input paths.
  • snappy .sz files — auto-decompressed on read; --output result.csv.sz auto-compresses.
# stdin as a table, joined against a file
cat wcp.csv | qsv sqlp - country_continent.csv \
  "SELECT stdin.AccentCity, country_continent.continent
   FROM stdin JOIN country_continent ON stdin.Country = country_continent.country"

Multi-format input via table functions

read_csv, read_ndjson, read_parquet, and read_ipc table functions let you read — and mix — formats inline. Pair them with the SKIP_INPUT sentinel to bypass qsv's CSV preprocessing entirely and hand parsing straight to Polars' multithreaded, mem-mapped reader (dramatically faster, at the cost of CSV-parser configurability — comma delimiter only, no CSV comments).

# join a CSV against a Parquet file inline
qsv sqlp wcp.csv \
  "SELECT * FROM wcp
   JOIN read_parquet('country_continent.parquet') AS cc ON wcp.Country = cc.country"

# SKIP_INPUT: let Polars read the CSV directly — fastest path for a single-file scan
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('wcp.csv') ORDER BY Population DESC LIMIT 100"

# mix NDJSON + Arrow with no CSV input at all
qsv sqlp SKIP_INPUT \
  "SELECT * FROM read_ndjson('events.jsonl') AS t1
   JOIN read_ipc('dims.arrow') AS t2 ON t1.id = t2.id"

read_csv also decompresses .gz, .zst, and .zlib files transparently:

qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.gz')"
qsv sqlp SKIP_INPUT "SELECT * FROM read_csv('boston311-100.csv.zst')"

SKIP_INPUT also works when the query needs no input at all — handy for testing expressions:

qsv sqlp SKIP_INPUT "SELECT 1 AS one, '2' AS two, 3.0 AS three"

Joins & set operations

Polars SQL supports inner / left / right / full joins, USING, NATURAL JOIN, LEFT SEMI / RIGHT ANTI joins, and UNION [ALL] BY NAME.

# inner join with a post-join filter
qsv sqlp wcp.csv country_continent.csv \
  "SELECT wcp.AccentCity, wcp.Population, country_continent.continent
   FROM wcp
   JOIN country_continent ON wcp.Country = country_continent.country
   WHERE wcp.Population > 1000000
   ORDER BY wcp.Population DESC"

# NATURAL JOIN — match on identically-named columns automatically
qsv sqlp data1.csv data2.csv data3.csv \
  "SELECT COLUMNS('^[^:]+$') FROM data1 NATURAL JOIN data2 NATURAL JOIN data3 ORDER BY COMPANY_ID"

# chained SEMI / ANTI joins
qsv sqlp tbl_a.csv tbl_b.csv tbl_c.csv \
  "SELECT * FROM tbl_a
     RIGHT ANTI JOIN tbl_b USING (b)
     LEFT SEMI JOIN tbl_c USING (c)"

# stack yearly exports with UNION ALL BY NAME (column order need not match)
qsv sqlp nyc311-2023.csv nyc311-2024.csv \
  "SELECT * FROM nyc311_2023 UNION ALL BY NAME SELECT * FROM nyc311_2024"

Filtering: pattern, regex & comparison operators

Beyond standard =, <, IN, and LIKE, Polars SQL adds a rich operator set:

Operator Meaning
^@ starts-with
~~ / ~~* LIKE / case-insensitive ILIKE
!~~ / !~~* NOT LIKE / case-insensitive NOT ILIKE
~ / ~* regex contains (case-sensitive / insensitive)
!~ / !~* regex does-not-contain (case-sensitive / insensitive)
<=> spaceship: -1 / 0 / 1 for < / == / >
# starts-with as a predicate and as a projected boolean
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ^@ 'San'"
qsv sqlp wcp.csv "SELECT City, City ^@ 'San' AS starts_with_san FROM wcp"

# ILIKE-style ends-with, case-insensitive
qsv sqlp wcp.csv "SELECT City, City ~~* '%burg' AS ends_with_burg FROM wcp"

# regex operators combined with a numeric filter
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City ~ '^San' AND Population > 10000"
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE City !~* 'ville$' AND Population > 10000"

# regexp_like(string, pattern [, flags]) — flags: 'c' (default), 'i', 'm'
qsv sqlp wcp.csv "SELECT * FROM wcp WHERE regexp_like(City, '^san', 'i')"

# regexp using a pattern from another column
qsv sqlp data.csv "SELECT idx, val FROM data WHERE val regexp pattern_col"

SELECT * ILIKE projects only the columns whose names match the wildcard — useful for grabbing a family of columns without listing them:

# return every column whose name matches '%a%e%' (e.g. Name, State), case-insensitive
qsv sqlp wcp.csv "SELECT * ILIKE '%a%e%' FROM wcp ORDER BY Country"

Expressions, aggregates & string functions

# CASE (searched and simple forms)
qsv sqlp wcp.csv \
  "SELECT City,
     CASE WHEN Population > 1000000 THEN 'mega'
          WHEN Population > 100000  THEN 'large'
          ELSE 'small' END AS tier
   FROM wcp"

# COALESCE + NULLIF to default empty strings
qsv sqlp wcp.csv "SELECT COALESCE(NULLIF(Region, ''), 'unknown') AS region FROM wcp"

# string functions
qsv sqlp wcp.csv "SELECT lower(City), substr(Country, 1, 2) FROM wcp WHERE starts_with(City, 'San')"

# FILTER clause on aggregates — per-aggregate predicates in one GROUP BY pass
qsv sqlp sales.csv \
  "SELECT region,
     SUM(amount) AS total,
     SUM(amount) FILTER (WHERE amount > 100)        AS total_over_100,
     COUNT(*)    FILTER (WHERE status = 'refunded') AS refund_count
   FROM sales GROUP BY region ORDER BY region"

Dollar-quoting (PostgreSQL syntax) avoids escaping quotes and apostrophes inside literals. In a double-quoted shell string the $ must be escaped as \$:

qsv sqlp tweets.csv "SELECT * FROM tweets WHERE handle = \$\$Diane's@Twitter\$\$"
# named tag form for literals containing both quote styles
qsv sqlp data.csv 'SELECT * FROM data WHERE col1 = $tag$Diane'\''s "horse"$tag$'

CTEs, subqueries & window functions

# Common Table Expression
qsv sqlp wcp.csv \
  "WITH big AS (SELECT * FROM wcp WHERE Population >= 1000000)
   SELECT * FROM big WHERE STARTS_WITH(City, 'S')"

# correlated-style IN subquery
qsv sqlp wcp.csv country_continent.csv \
  "SELECT * FROM wcp
   WHERE Country IN (SELECT country FROM country_continent WHERE continent = 'EU')"

# window function: rank cities within each country
qsv sqlp wcp.csv \
  "SELECT City, Country, Population,
     ROW_NUMBER() OVER (PARTITION BY Country ORDER BY Population DESC) AS rank_in_country
   FROM wcp"

SQL scripts (multi-query)

If the sql argument ends in .sql, it's read as a script: queries separated by ;, single-line -- comments allowed, and the result of the last query is returned. Earlier queries typically build temporary tables consumed by later ones. In long scripts, TRUNCATE TABLE <name>; frees a temp table's memory before the script ends.

-- pipeline.sql — staged build, last SELECT is the result
CREATE TABLE eu AS
  SELECT country FROM country_continent WHERE continent = 'EU';

CREATE TABLE eu_cities AS
  SELECT w.* FROM wcp w JOIN eu ON w.Country = eu.country;

TRUNCATE TABLE eu;          -- free intermediate memory

SELECT Country, COUNT(*) AS cities, MAX(Population) AS biggest
FROM eu_cities GROUP BY Country ORDER BY cities DESC;
qsv sqlp wcp.csv country_continent.csv pipeline.sql --format json --output eu_summary.json

Output formats & compression

--format accepts csv (default), json, jsonl, parquet, arrow, and avro. Columnar formats take --compression (and Parquet adds --compress-level and --statistics).

qsv sqlp wcp.csv "SELECT * FROM wcp WHERE Population > 1000000" \
  --format parquet --compression zstd --compress-level 9 --statistics \
  --output big_cities.parquet

# snappy round-trip via .sz extension (auto compress/decompress)
qsv sqlp wcp.csv.sz "SELECT * FROM wcp WHERE Population > 10000" --output result.csv.sz

CSV-output date/number formatting is controlled with --datetime-format, --date-format, --time-format, --float-precision, --decimal-comma, and --wnull-value.

Schema caching & type casting

--cache-schema writes a <file>.pschema.json next to each input and reuses it on later runs (skipping inference). The file is editable — change a column's type to cast it. For example, force a Float32 to a Decimal with explicit precision/scale by replacing "Float32" with {"Decimal": [10, 3]}. Valid types include Boolean, the sized ints/uints, Float32/64, String, Date, Datetime, Duration, Time, Categorical, Decimal, and Enum. If a .pschema.json already exists, sqlp uses it automatically — even without --cache-schema.

# first run infers + caches the schema; edit big_cities.pschema.json to cast, then re-run
qsv sqlp --cache-schema wcp.csv "SELECT * FROM wcp WHERE Population > 1000000"

For date parsing on input, add --try-parsedates (falls back to strings on failure). --infer-len controls how many rows are scanned for inference (0 = full scan; for a single non-script CSV, qsv counts rows instead of forcing a full lazy scan).

Performance & debugging

# inspect the optimized query plan without running it
qsv sqlp wcp.csv "EXPLAIN SELECT * FROM wcp WHERE Population > 10000 ORDER BY Population DESC LIMIT 20"

For out-of-memory situations use --streaming or --low-memory (both slower). --no-optimizations disables non-default optimizations when a query errors. Run scoresql first to catch anti-patterns and estimate cost before executing.

See also: /docs/help/sqlp.md, Polars SQL reference, scoresql, pivotp, joinp, Recipe: Build a Data Pipeline.

scoresql

Analyze a SQL query against the stats / frequency / moarstats caches and the query plan — before you run it. Output is a human-readable performance report (default) or JSON. Supports both Polars (default) and DuckDB modes for the underlying plan analysis.

Caches are auto-generated when missing: qsv stats --everything --stats-jsonl and qsv frequency --frequency-jsonl if their outputs aren't already there.

Example: score a join query before running it

qsv scoresql wcp.csv country_continent.csv \
  "SELECT wcp.AccentCity, country_continent.continent
   FROM wcp JOIN country_continent ON wcp.Country = country_continent.country"

Report includes: query plan (EXPLAIN output), type optimization warnings, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, …), and cache freshness.

Example: get JSON output for CI gating

qsv scoresql --json wcp.csv "SELECT * FROM wcp" | jq '.score, .warnings'

A SELECT * without LIMIT will earn an anti-pattern warning.

Example: use DuckDB for plan analysis instead of Polars

qsv scoresql --duckdb nyc311.csv "SELECT * FROM nyc311 WHERE status = 'active'"

Example: score a query loaded from a .sql file

qsv scoresql data.csv complex_query.sql

Only the last query in the file is scored.

See also: /docs/help/scoresql.md, sqlp, stats, frequency, Stats Cache & Caching.

pivotp

Pivot CSV data using Polars. Two modes:

  • Pivot mode — when you pass <on-cols>: one or more index columns become rows, the <on-cols> values become new columns, and a values column is aggregated.
  • Group-by mode — when <on-cols> is omitted: a simple GROUP BY aggregation. Counting rows per group is the canonical use case.

Smart aggregation auto-selection picks sum, mean, first, or len based on the data type of the values column (or you can specify explicitly via --agg).

Example: pivot NYC 311 by Complaint Type × Year (count)

qsv pivotp 'Complaint Type' NYC_311_SR_2010-2020-sample-1M.csv \
  --index 'YearReported' \
  --values 'Unique Key' \
  --agg len \
  --output nyc311_pivot.csv

(Assumes a YearReported column — derive it first with datefmt--formatstr '%Y' --new-column YearReported.)

Example: group-by mode — count rows per Borough

qsv pivotp NYC_311_SR_2010-2020-sample-1M.csv \
  --index 'Borough' \
  --output borough_counts.csv

Example: pivot Allegheny property sales: Sale Year × Property Class, average Sale Price

qsv pivotp 'Property Class' allegheny_property_sales.csv \
  --index 'Sale Year' \
  --values 'Sale Price' \
  --agg mean

Example: p95 latency per Borough (new in 20.1.0)

qsv pivotp --index Borough \
  --values latency_ms \
  --agg q@0.95 \
  nyc311.csv

Since 20.1.0, --agg accepts quantile@<p> (alias q@<p>) for linear-interpolation quantile aggregation at any probability p ∈ [0, 1]. q@0.5 is equivalent to median for even-length groups; q@0.95 / q@0.99 are the canonical "high percentile" choices for latency / SLA work.

pivotp will pick up a <filename>.pschema.json if one exists (generated by schema --polars). That tells Polars the column types without re-inferring.

See also: /docs/help/pivotp.md, sqlp — alternative via GROUP BY and PIVOT, transpose, schema --polars.

joinp (cross-reference)

See Joins & Set Ops → joinp for the full treatment. The headline capabilities for SQL-style work:

  • Asof joins — time-series-aware, matches each left row to the nearest prior right row.
  • Non-equi joins — range-based matches via a Polars SQL WHERE clause with _left / _right column suffixes.
  • Pre-join filtering--filter-left / --filter-right evaluate Polars SQL before the join, cheaper than filtering after.
  • Column coalescing--coalesce merges identically-named columns post-join.

See also

Clone this wiki locally