-
Notifications
You must be signed in to change notification settings - Fork 104
Transform and Reshape
Tier: Intermediate
Commands covered: apply, applydp, behead, cat, datefmt, edit, enum, explode, fill, fixlengths, fmt, implode, input, pseudo, rename, replace, reverse, safenames, sort, sortcheck, transpose
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
These are the pre-analysis commands. Most data pipelines start by cleaning column names, fixing encoding, dropping bad rows, normalizing dates, and sorting — before any stats, joins, or validation.
| If you want to… | Use | Notes |
|---|---|---|
| Normalize encoding / skip preamble / strip BOM | input |
Should be the first command in most pipelines |
| Make column names DB- or CKAN-safe | safenames |
Lowercase, snake_case, dedupe, length cap |
| Rename specific columns | rename |
Pair-wise or all-at-once |
| Drop columns the cheap way | select '!cols' |
See Selection & Inspection → select |
| Run NLP / dates / currency / sentiment transformations | apply |
The Swiss-army knife |
Slim version of apply for DataPusher+ |
applydp |
qsvdp only |
| Format dates to ISO 8601 | datefmt |
19 input formats auto-detected |
| Regex replace across fields | replace |
Multithreaded with an index |
| Pseudonymize a column (stable mapping) | pseudo |
Same input value always maps to the same ID |
| Forward-fill missing values | fill |
With --groupby to scope to groups |
| Add row numbers / UUIDs / a constant column | enum |
Supports uuid4 and uuid7 |
| Split a multi-valued column into rows | explode |
E.g., "red|blue|green" → 3 rows |
| Collapse rows by key, joining a column | implode |
The inverse of explode
|
| Reshape: long ↔ wide | transpose |
--long N flips wide to long |
| Concatenate CSVs (rows or columns) | cat |
Supports directory & .infile-list input |
| Repair ragged rows | fixlengths |
Pads short rows; truncates long ones |
| Strip the header row | behead |
For piping headerless data |
| Change delimiter / line endings / quoting | fmt |
RFC 4180 by default; supports ASCII RS/US delimited |
| Edit a single cell at row × column | edit |
Surgical fix |
| Sort by one or more columns | sort |
Lex / natural / numeric / random / unique |
| Verify a file is already sorted | sortcheck |
Cheap precondition for dedup --sorted
|
| Reverse row order | reverse |
Constant memory with an index |
The "normalize my CSV before anything else" command. Handles non-UTF-8 encoding, BOM, preamble lines, custom quoting, custom escape characters, and trims whitespace.
Example: a legacy export with 3 preamble lines and ISO-8859-1 encoding
# Step 1: transcode to UTF-8 (input itself is lossy by default)
iconv -f ISO-8859-1 -t UTF-8 legacy.csv -o legacy-utf8.csv
# Step 2: skip the preamble, trim whitespace
qsv input --skip-lines 3 --trim-headers --trim-fields legacy-utf8.csv > clean.csvExample: a CSV with # comment lines and ragged whitespace
qsv input --comment '#' --trim-fields data.csv > clean.csvExample: auto-detect and skip a preamble
qsv input --auto-skip exports/2024-q4.csv > q4.csv--auto-skip sniffs the file for preamble records — useful for inconsistent vendor exports.
See also: /docs/help/input.md, sniff, Troubleshooting → UTF-8.
Make column headers safe for PostgreSQL, CKAN, BigQuery, and most other databases: lowercase, snake_case, deduplicate, cap at 60 bytes. Built for DataPusher+ but generally useful.
Example: clean headers from a NYC Open Data export
qsv safenames "311_Service_Requests.csv" > clean_headers.csvHeader "Complaint Type" becomes complaint_type; "Incident Address" becomes incident_address; "_id" is reserved by CKAN and gets renamed to reserved__id.
Example: audit how many unsafe headers exist (without rewriting)
qsv safenames --mode V exports/messy.csv
# stderr: 6 header/s
# 1 duplicate/s: "c1:2"
# 4 unsafe header/s: ["12_col", "Col with Embedded Spaces", "", "Column!@Invalid+Chars"]Example: JSON audit for CI/CD gates
qsv safenames --mode J exports/messy.csv | jq '.unsafe_headers | length'See also: /docs/help/safenames.md, Recipe: CKAN Integration, applydp.
Bulk rename columns. Takes a positional argument that is the new comma-separated header list (which replaces all headers), or --pairwise old1,new1,old2,new2 for targeted renames.
Example: rename two specific columns
qsv rename --pairwise 'AccentCity,City,Population,Pop' wcp.csv > renamed.csvExample: replace all column names with generic _col_1, _col_2, …
qsv rename _all_generic data.csvExample: add headers to a headerless CSV
qsv rename 'id,timestamp,user,action' --no-headers events.csvSee also: /docs/help/rename.md, safenames, select.
The big toolbox: run a chain of operations (trim, upper, lower, len, replace, mtrim, regex_replace, …) against one or more columns. Has NLP submodules for sentiment, profanity, similarity, eudex phonetic hashing, language detection, and name-gender inference. Has a dateadd and datesub for date arithmetic. Has a geocode submodule that complements the dedicated geocode command.
Example: clean a name column — trim, title-case, drop parentheticals
qsv apply operations trim,titlecase 'Full Name' contacts.csv \
| qsv apply operations mtrim 'Full Name' --comparand '()' > clean_contacts.csvExample: sentiment analysis on NYC 311 complaint descriptors
qsv apply sentiment 'Descriptor' \
--new-column sentiment_score \
NYC_311_SR_2010-2020-sample-1M.csv > with_sentiment.csvExample: phonetic hash for fuzzy name matching
qsv apply eudex 'Last Name' --new-column phonetic_key voters.csv > voters_keyed.csvExample: censor profanity in a feedback column
qsv apply operations censor 'Feedback' feedback.csv > clean_feedback.csvExample: name-gender inference for demographic reporting (with disclaimers — see genai-disclaimer)
qsv apply gender 'First Name' --new-column inferred_gender voters.csvSee also: /docs/help/apply.md, applydp — slimmed-down for CKAN, luau, py.
apply without the NLP/AI features — purpose-built for DataPusher+. Available only in the qsvdp binary variant. Use this when you need DataPusher+-compatible operations (trim, upper, lower, replace, mtrim, cast, calc) in a tiny binary.
Example: typical CKAN cleanup pipeline
qsvdp applydp operations trim,lower 'Email' contacts.csv \
| qsvdp applydp operations cast 'Amount' --comparand integer > clean.csvSee also: /docs/help/applydp.md, Binary Variants, Recipe: CKAN Integration.
Format dates. Recognizes 19 input formats automatically — no need to specify the input format.
Example: normalize NYC 311 "Created Date" to ISO 8601
qsv datefmt 'Created Date' NYC_311_SR_2010-2020-sample-1M.csv > iso.csvExample: extract year-month into a new column for grouping
qsv datefmt 'Created Date' \
--formatstr '%Y-%m' \
--new-column year_month \
NYC_311_SR_2010-2020-sample-1M.csv > with_ym.csvExample: format all columns ending in _date (case-insensitive)
qsv datefmt '/(?i)_date$/' transactions.csv > transactions_iso.csvExample: ISO week number for weekly reporting
qsv datefmt 'Created Date' --formatstr '%V' --new-column iso_week NYC_311.csvSee also: /docs/help/datefmt.md, chrono strftime specifiers, Recipe: Date Enrichment.
Regex find-and-replace. Multithreaded with an index. Applies to each field individually.
Example: redact emails from a public release
qsv replace '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' '<EMAIL>' raw.csv > public.csvExample: normalize phone formats
qsv replace '\((\d{3})\)\s*(\d{3})-(\d{4})' '$1-$2-$3' contacts.csv > normalized.csvExample: case-insensitive cleanup of "N/A" sentinels into empty cells
qsv replace -i '^(N/A|NA|null|none|-)$' '' messy.csv > cleaned.csvSee also: /docs/help/replace.md, search, apply operations regex_replace.
Deterministic pseudonymization: the same input value always gets the same identifier, even across rows.
Example: replace customer names with stable IDs for analytics export
qsv pseudo 'Customer Name' \
--start 1000 --increment 1 \
--formatstr 'CUST-{}' \
transactions.csv > pseudonymized.csvMary → CUST-1000, John → CUST-1001, Mary again → CUST-1000 (same ID).
Example: pseudonymize multiple columns
qsv pseudo 'Patient ID' --formatstr 'PT-{}' raw.csv \
| qsv pseudo 'Provider' --formatstr 'PRV-{}' > deidentified.csvFor irreversible (one-way) hashing, see blake3.
See also: /docs/help/pseudo.md, enum, blake3, Cookbook → Clean & Normalize.
Fill empty values. Forward-fill (default), first-value-fill, backfill, or constant default. Optional --groupby scopes the fill within group boundaries.
Example: forward-fill a sparse "Region" column inside each "Country" group
qsv fill --groupby Country Region wcp.csv > filled.csvExample: replace all empty Population cells with 0
qsv fill --default 0 Population wcp.csv > zerofilled.csvExample: backfill missing dates from the first valid value
qsv fill --backfill 'Open Date' events.csv > backfilled.csvSee also: /docs/help/fill.md, apply, Recipe: Clean & Normalize.
Add a column of incremental numbers, UUIDs, or a constant value. Also can copy an existing column.
Example: add a UUID v7 (time-ordered) to NYC 311 rows for downstream joins
qsv enum --uuid7 --new-column row_id NYC_311_SR_2010-2020-sample-1M.csv > with_id.csvExample: tag every row with a constant "import_batch" identifier
qsv enum --constant 'batch-2024-12-01' --new-column batch raw.csv > tagged.csvExample: duplicate a column for a downstream tool that wants both raw and normalized
qsv enum --copy 'Email' --new-column email_raw \
contacts.csv \
| qsv apply operations lower 'Email' > normalized.csvSee also: /docs/help/enum.md, pseudo, blake3.
Split a column with multi-valued cells into multiple rows.
Example: a CKAN dataset's "Tags" field is |-delimited
# tags column: "energy|renewable|solar"
qsv explode Tags '|' datasets.csv > one_tag_per_row.csvAfter exploding, you can run frequency Tags to see the tag distribution.
Example: explode after splitting on commas (note: needs careful quoting)
qsv explode 'Skills' ',' resumes.csv > one_skill_per_row.csvSee also: /docs/help/explode.md, implode — the inverse, frequency.
Inverse of explode: collapse rows sharing a key into a single row, joining a value column with a separator. Drops all other columns by default. Use --sorted if input is already key-sorted to stream with bounded memory.
Example: rebuild a comma-separated tags column from exploded rows
qsv implode -k dataset_id -v Tag ',' exploded.csv > recombined.csvExample: aggregate after group-wise stats
# Get a per-Country list of cities in wcp
qsv implode --sorted -k Country -v AccentCity '; ' wcp_sorted.csv > cities_by_country.csvSee also: /docs/help/implode.md, explode, sort (often a prerequisite).
Swap rows and columns. Use --multipass for large datasets that don't fit in memory. Use --long N to convert wide-format data into long-format with column-name identifiers.
Example: convert a quarterly pivot (Region × Quarter columns) into long format
qsv transpose --long 'Region' quarterly_pivot.csv > long.csvExample: rotate a small wcp slice for inspection
qsv slice --len 1 wcp.csv | qsv transposeSee also: /docs/help/transpose.md, pivotp — the Polars-powered alternative.
Concatenate CSVs by row or by column. Supports a directory or an .infile-list as input.
Example: concatenate 12 monthly NYC 311 exports
qsv cat rows nyc311-2024-*.csv -o nyc311-2024.csvExample: same, but tag each row with its source filename
qsv cat rowskey --group fname --group-name source_file \
nyc311-2024-*.csv -o nyc311-2024-tagged.csvExample: a side-by-side join via column-wise concatenation (when row order matches)
# leftside.csv has Country,City; rightside.csv has Country,Population
qsv cat columns leftside.csv rightside.csv -o side_by_side.csvExample: concat every CSV in a directory
qsv cat rows exports/2024/ -o full_year.csvSee also: /docs/help/cat.md, join, sqlp (use SQL UNION ALL if you want filtering+concat in one step).
Pad short rows with empty fields or truncate long ones. Use when a CSV is ragged.
Example: a legacy ERP export with inconsistent row widths
qsv fixlengths legacy-erp.csv > fixed.csvBy default it pads up to the widest row. Use --length N to force a specific width.
See also: /docs/help/fixlengths.md, input.
Change output delimiter, line endings, or quoting style. Use to write TSV / SSV / Excel-friendly CSV / ASCII-delimited output.
Example: TSV output
qsv fmt --out-delimiter '\t' data.csv > data.tsvExample: Excel-friendly CSV (CRLF endings + quote everything)
qsv fmt --crlf --quote-always data.csv > excel-friendly.csvExample: ASCII RS/US delimited output (for tools that can't handle commas in fields)
qsv fmt --ascii data.csv > data.asciiSee also: /docs/help/fmt.md, Conversion & I/O.
Drop the header row. Useful before piping into commands that expect headerless input or before concatenating multiple files that share headers.
Example: combine month-by-month exports, keeping only the first header
qsv headers --just-names exports/2024-01.csv > combined.csv
for f in exports/2024-*.csv; do
qsv behead "$f" >> combined.csv
done(Prefer qsv cat rows exports/2024-*.csv — it does the same thing in one step. Use behead when you need more custom control.)
See also: /docs/help/behead.md, cat, headers.
Change a single cell at the given row and column. Surgical for those "fix one bad value" moments.
Example: correct a typo in a known cell
# Set row 0 (first data row), "color" column, to "green"
qsv edit items.csv color 0 green > fixed.csvcolor can be the column name or 0-based index. If it's a number, it's interpreted as an index.
See also: /docs/help/edit.md, replace, apply.
Sort by one or more columns: lexicographically, numerically, naturally (item1, item2, item10), with reverse and unique modes, or randomly.
Example: top-10 cities by population (used in Getting Started)
qsv search --select Country '^us$' wcp.csv \
| qsv sort --select Population --numeric --reverse \
| qsv slice --len 10Example: natural sort on parcel keys like BBL-1-100-12
qsv sort --select BBL --natural parcels.csv > sorted_parcels.csvExample: random shuffle with a seed (reproducible)
qsv sort --random --seed 42 wcp.csv > shuffled.csvExample: sort then dedup in a single pipeline
qsv sort --select Email contacts.csv | qsv dedup --select Email --sorted > unique_contacts.csvFor files > RAM, use extsort.
See also: /docs/help/sort.md, sortcheck, extsort.
Verify a CSV is sorted (cheap precondition for dedup --sorted and the --sorted flag on implode).
Example: gate dedup --sorted on a precondition
qsv sortcheck --numeric file.csv && qsv dedup --numeric --sorted file.csvExample: emit JSON sort stats for CI
qsv sortcheck --json --select id file.csvSee also: /docs/help/sortcheck.md, sort, dedup.
Reverse row order. With an index, runs in constant memory. Unlike sort --reverse, this preserves the order of rows with the same key (it just flips the file).
Example: reverse a date-sorted log to put newest events first
qsv index events.csv
qsv reverse events.csv > events_newest_first.csvSee also: /docs/help/reverse.md, sort --reverse, index.
- Command Reference (index)
- Selection & Inspection — the preceding category in a typical pipeline
- Aggregation & Statistics — comes after Transform
- Cookbook → Clean & Normalize
- Cookbook → Date Enrichment
- Cookbook → CKAN Integration
- Performance Tuning
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation