Skip to content

Transform and Reshape

Joel Natividad edited this page May 30, 2026 · 3 revisions

Transform & Reshape

Tier: Intermediate Commands covered: apply, applydp, behead, cat, datefmt, edit, enum, explode, fill, fixlengths, fmt, implode, input, pseudo, rename, replace, reverse, safenames, sort, sortcheck, transpose

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These are the pre-analysis commands. Most data pipelines start by cleaning column names, fixing encoding, dropping bad rows, normalizing dates, and sorting — before any stats, joins, or validation.

Quick decision table

If you want to… Use Notes
Normalize encoding / skip preamble / strip BOM input Should be the first command in most pipelines
Make column names DB- or CKAN-safe safenames Lowercase, snake_case, dedupe, length cap
Rename specific columns rename Pair-wise or all-at-once
Drop columns the cheap way select '!cols' See Selection & Inspection → select
Run NLP / dates / currency / sentiment transformations apply The Swiss-army knife
Slim version of apply for DataPusher+ applydp qsvdp only
Format dates to ISO 8601 datefmt 19 input formats auto-detected
Regex replace across fields replace Multithreaded with an index
Pseudonymize a column (stable mapping) pseudo Same input value always maps to the same ID
Forward-fill missing values fill With --groupby to scope to groups
Add row numbers / UUIDs / a constant column enum Supports uuid4 and uuid7
Split a multi-valued column into rows explode E.g., "red|blue|green" → 3 rows
Collapse rows by key, joining a column implode The inverse of explode
Reshape: long ↔ wide transpose --long N flips wide to long
Concatenate CSVs (rows or columns) cat Supports directory & .infile-list input
Repair ragged rows fixlengths Pads short rows; truncates long ones
Strip the header row behead For piping headerless data
Change delimiter / line endings / quoting fmt RFC 4180 by default; supports ASCII RS/US delimited
Edit a single cell at row × column edit Surgical fix
Sort by one or more columns sort Lex / natural / numeric / random / unique
Verify a file is already sorted sortcheck Cheap precondition for dedup --sorted
Reverse row order reverse Constant memory with an index

input

The "normalize my CSV before anything else" command. Handles non-UTF-8 encoding, BOM, preamble lines, custom quoting, custom escape characters, and trims whitespace.

Example: a legacy export with 3 preamble lines and ISO-8859-1 encoding

# Step 1: transcode to UTF-8 (input itself is lossy by default)
iconv -f ISO-8859-1 -t UTF-8 legacy.csv -o legacy-utf8.csv

# Step 2: skip the preamble, trim whitespace
qsv input --skip-lines 3 --trim-headers --trim-fields legacy-utf8.csv > clean.csv

Example: a CSV with # comment lines and ragged whitespace

qsv input --comment '#' --trim-fields data.csv > clean.csv

Example: auto-detect and skip a preamble

qsv input --auto-skip exports/2024-q4.csv > q4.csv

--auto-skip sniffs the file for preamble records — useful for inconsistent vendor exports.

See also: /docs/help/input.md, sniff, Troubleshooting → UTF-8.

safenames

Make column headers safe for PostgreSQL, CKAN, BigQuery, and most other databases: lowercase, snake_case, deduplicate, cap at 60 bytes. Built for DataPusher+ but generally useful.

Example: clean headers from a NYC Open Data export

qsv safenames "311_Service_Requests.csv" > clean_headers.csv

Header "Complaint Type" becomes complaint_type; "Incident Address" becomes incident_address; "_id" is reserved by CKAN and gets renamed to reserved__id.

Example: audit how many unsafe headers exist (without rewriting)

qsv safenames --mode V exports/messy.csv
# stderr: 6 header/s
# 1 duplicate/s: "c1:2"
# 4 unsafe header/s: ["12_col", "Col with Embedded Spaces", "", "Column!@Invalid+Chars"]

Example: JSON audit for CI/CD gates

qsv safenames --mode J exports/messy.csv | jq '.unsafe_headers | length'

See also: /docs/help/safenames.md, Recipe: CKAN Integration, applydp.

rename

Bulk rename columns. Takes a positional argument that is the new comma-separated header list (which replaces all headers), or --pairwise old1,new1,old2,new2 for targeted renames.

Example: rename two specific columns

qsv rename --pairwise 'AccentCity,City,Population,Pop' wcp.csv > renamed.csv

Example: replace all column names with generic _col_1, _col_2, …

qsv rename _all_generic data.csv

Example: add headers to a headerless CSV

qsv rename 'id,timestamp,user,action' --no-headers events.csv

See also: /docs/help/rename.md, safenames, select.

apply

The big toolbox: run a chain of operations (trim, upper, lower, len, replace, mtrim, regex_replace, …) against one or more columns. Has NLP submodules for sentiment, profanity, similarity, eudex phonetic hashing, language detection, and name-gender inference. Has a dateadd and datesub for date arithmetic. Has a geocode submodule that complements the dedicated geocode command.

Example: clean a name column — trim, title-case, drop parentheticals

qsv apply operations trim,titlecase 'Full Name' contacts.csv \
  | qsv apply operations mtrim 'Full Name' --comparand '()' > clean_contacts.csv

Example: sentiment analysis on NYC 311 complaint descriptors

qsv apply sentiment 'Descriptor' \
  --new-column sentiment_score \
  NYC_311_SR_2010-2020-sample-1M.csv > with_sentiment.csv

Example: phonetic hash for fuzzy name matching

qsv apply eudex 'Last Name' --new-column phonetic_key voters.csv > voters_keyed.csv

Example: censor profanity in a feedback column

qsv apply operations censor 'Feedback' feedback.csv > clean_feedback.csv

Example: name-gender inference for demographic reporting (with disclaimers — see genai-disclaimer)

qsv apply gender 'First Name' --new-column inferred_gender voters.csv

See also: /docs/help/apply.md, applydp — slimmed-down for CKAN, luau, py.

applydp

apply without the NLP/AI features — purpose-built for DataPusher+. Available only in the qsvdp binary variant. Use this when you need DataPusher+-compatible operations (trim, upper, lower, replace, mtrim, cast, calc) in a tiny binary.

Example: typical CKAN cleanup pipeline

qsvdp applydp operations trim,lower 'Email' contacts.csv \
  | qsvdp applydp operations cast 'Amount' --comparand integer > clean.csv

See also: /docs/help/applydp.md, Binary Variants, Recipe: CKAN Integration.

datefmt

Format dates. Recognizes 19 input formats automatically — no need to specify the input format.

Example: normalize NYC 311 "Created Date" to ISO 8601

qsv datefmt 'Created Date' NYC_311_SR_2010-2020-sample-1M.csv > iso.csv

Example: extract year-month into a new column for grouping

qsv datefmt 'Created Date' \
  --formatstr '%Y-%m' \
  --new-column year_month \
  NYC_311_SR_2010-2020-sample-1M.csv > with_ym.csv

Example: format all columns ending in _date (case-insensitive)

qsv datefmt '/(?i)_date$/' transactions.csv > transactions_iso.csv

Example: ISO week number for weekly reporting

qsv datefmt 'Created Date' --formatstr '%V' --new-column iso_week NYC_311.csv

See also: /docs/help/datefmt.md, chrono strftime specifiers, Recipe: Date Enrichment.

replace

Regex find-and-replace. Multithreaded with an index. Applies to each field individually.

Example: redact emails from a public release

qsv replace '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' '<EMAIL>' raw.csv > public.csv

Example: normalize phone formats

qsv replace '\((\d{3})\)\s*(\d{3})-(\d{4})' '$1-$2-$3' contacts.csv > normalized.csv

Example: case-insensitive cleanup of "N/A" sentinels into empty cells

qsv replace -i '^(N/A|NA|null|none|-)$' '' messy.csv > cleaned.csv

See also: /docs/help/replace.md, search, apply operations regex_replace.

pseudo

Deterministic pseudonymization: the same input value always gets the same identifier, even across rows.

Example: replace customer names with stable IDs for analytics export

qsv pseudo 'Customer Name' \
  --start 1000 --increment 1 \
  --formatstr 'CUST-{}' \
  transactions.csv > pseudonymized.csv

Mary → CUST-1000, John → CUST-1001, Mary again → CUST-1000 (same ID).

Example: pseudonymize multiple columns

qsv pseudo 'Patient ID' --formatstr 'PT-{}' raw.csv \
  | qsv pseudo 'Provider' --formatstr 'PRV-{}' > deidentified.csv

For irreversible (one-way) hashing, see blake3.

See also: /docs/help/pseudo.md, enum, blake3, Cookbook → Clean & Normalize.

fill

Fill empty values. Forward-fill (default), first-value-fill, backfill, or constant default. Optional --groupby scopes the fill within group boundaries.

Example: forward-fill a sparse "Region" column inside each "Country" group

qsv fill --groupby Country Region wcp.csv > filled.csv

Example: replace all empty Population cells with 0

qsv fill --default 0 Population wcp.csv > zerofilled.csv

Example: backfill missing dates from the first valid value

qsv fill --backfill 'Open Date' events.csv > backfilled.csv

See also: /docs/help/fill.md, apply, Recipe: Clean & Normalize.

enum

Add a column of incremental numbers, UUIDs, or a constant value. Also can copy an existing column.

Example: add a UUID v7 (time-ordered) to NYC 311 rows for downstream joins

qsv enum --uuid7 --new-column row_id NYC_311_SR_2010-2020-sample-1M.csv > with_id.csv

Example: tag every row with a constant "import_batch" identifier

qsv enum --constant 'batch-2024-12-01' --new-column batch raw.csv > tagged.csv

Example: duplicate a column for a downstream tool that wants both raw and normalized

qsv enum --copy 'Email' --new-column email_raw \
  contacts.csv \
  | qsv apply operations lower 'Email' > normalized.csv

See also: /docs/help/enum.md, pseudo, blake3.

explode

Split a column with multi-valued cells into multiple rows.

Example: a CKAN dataset's "Tags" field is |-delimited

# tags column: "energy|renewable|solar"
qsv explode Tags '|' datasets.csv > one_tag_per_row.csv

After exploding, you can run frequency Tags to see the tag distribution.

Example: explode after splitting on commas (note: needs careful quoting)

qsv explode 'Skills' ',' resumes.csv > one_skill_per_row.csv

See also: /docs/help/explode.md, implode — the inverse, frequency.

implode

Inverse of explode: collapse rows sharing a key into a single row, joining a value column with a separator. Drops all other columns by default. Use --sorted if input is already key-sorted to stream with bounded memory.

Example: rebuild a comma-separated tags column from exploded rows

qsv implode -k dataset_id -v Tag ',' exploded.csv > recombined.csv

Example: aggregate after group-wise stats

# Get a per-Country list of cities in wcp
qsv implode --sorted -k Country -v AccentCity '; ' wcp_sorted.csv > cities_by_country.csv

See also: /docs/help/implode.md, explode, sort (often a prerequisite).

transpose

Swap rows and columns. Use --multipass for large datasets that don't fit in memory. Use --long N to convert wide-format data into long-format with column-name identifiers.

Example: convert a quarterly pivot (Region × Quarter columns) into long format

qsv transpose --long 'Region' quarterly_pivot.csv > long.csv

Example: rotate a small wcp slice for inspection

qsv slice --len 1 wcp.csv | qsv transpose

See also: /docs/help/transpose.md, pivotp — the Polars-powered alternative.

cat

Concatenate CSVs by row or by column. Supports a directory or an .infile-list as input.

Example: concatenate 12 monthly NYC 311 exports

qsv cat rows nyc311-2024-*.csv -o nyc311-2024.csv

Example: same, but tag each row with its source filename

qsv cat rowskey --group fname --group-name source_file \
  nyc311-2024-*.csv -o nyc311-2024-tagged.csv

Example: a side-by-side join via column-wise concatenation (when row order matches)

# leftside.csv has Country,City; rightside.csv has Country,Population
qsv cat columns leftside.csv rightside.csv -o side_by_side.csv

Example: concat every CSV in a directory

qsv cat rows exports/2024/ -o full_year.csv

See also: /docs/help/cat.md, join, sqlp (use SQL UNION ALL if you want filtering+concat in one step).

fixlengths

Pad short rows with empty fields or truncate long ones. Use when a CSV is ragged.

Example: a legacy ERP export with inconsistent row widths

qsv fixlengths legacy-erp.csv > fixed.csv

By default it pads up to the widest row. Use --length N to force a specific width.

See also: /docs/help/fixlengths.md, input.

fmt

Change output delimiter, line endings, or quoting style. Use to write TSV / SSV / Excel-friendly CSV / ASCII-delimited output.

Example: TSV output

qsv fmt --out-delimiter '\t' data.csv > data.tsv

Example: Excel-friendly CSV (CRLF endings + quote everything)

qsv fmt --crlf --quote-always data.csv > excel-friendly.csv

Example: ASCII RS/US delimited output (for tools that can't handle commas in fields)

qsv fmt --ascii data.csv > data.ascii

See also: /docs/help/fmt.md, Conversion & I/O.

behead

Drop the header row. Useful before piping into commands that expect headerless input or before concatenating multiple files that share headers.

Example: combine month-by-month exports, keeping only the first header

qsv headers --just-names exports/2024-01.csv > combined.csv
for f in exports/2024-*.csv; do
  qsv behead "$f" >> combined.csv
done

(Prefer qsv cat rows exports/2024-*.csv — it does the same thing in one step. Use behead when you need more custom control.)

See also: /docs/help/behead.md, cat, headers.

edit

Change a single cell at the given row and column. Surgical for those "fix one bad value" moments.

Example: correct a typo in a known cell

# Set row 0 (first data row), "color" column, to "green"
qsv edit items.csv color 0 green > fixed.csv

color can be the column name or 0-based index. If it's a number, it's interpreted as an index.

See also: /docs/help/edit.md, replace, apply.

sort

Sort by one or more columns: lexicographically, numerically, naturally (item1, item2, item10), with reverse and unique modes, or randomly.

Example: top-10 cities by population (used in Getting Started)

qsv search --select Country '^us$' wcp.csv \
  | qsv sort --select Population --numeric --reverse \
  | qsv slice --len 10

Example: natural sort on parcel keys like BBL-1-100-12

qsv sort --select BBL --natural parcels.csv > sorted_parcels.csv

Example: random shuffle with a seed (reproducible)

qsv sort --random --seed 42 wcp.csv > shuffled.csv

Example: sort then dedup in a single pipeline

qsv sort --select Email contacts.csv | qsv dedup --select Email --sorted > unique_contacts.csv

For files > RAM, use extsort.

See also: /docs/help/sort.md, sortcheck, extsort.

sortcheck

Verify a CSV is sorted (cheap precondition for dedup --sorted and the --sorted flag on implode).

Example: gate dedup --sorted on a precondition

qsv sortcheck --numeric file.csv && qsv dedup --numeric --sorted file.csv

Example: emit JSON sort stats for CI

qsv sortcheck --json --select id file.csv

See also: /docs/help/sortcheck.md, sort, dedup.

reverse

Reverse row order. With an index, runs in constant memory. Unlike sort --reverse, this preserves the order of rows with the same key (it just flips the file).

Example: reverse a date-sorted log to put newest events first

qsv index events.csv
qsv reverse events.csv > events_newest_first.csv

See also: /docs/help/reverse.md, sort --reverse, index.

See also

Clone this wiki locally