Skip to content

Recipe Inspect Unknown CSV

Joel Natividad edited this page Jun 3, 2026 · 4 revisions

Recipe: Inspect an Unknown CSV

Tier: Beginner Commands used: sniff, headers, count, stats, frequency, sample, table, flatten Anchor dataset: wcp.csv (World Cities Population, 2.7M rows / 124 MB)

Problem

A file lands on your desk. You don't know the delimiter, the columns, the row count, the data types, or whether anything's weird. You'd like to characterize it in under 30 seconds before deciding what to do next.

Data

# Get wcp.csv if you don't have it
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip
unzip wcp.zip

Or substitute any CSV from your own work — the recipe is dataset-agnostic.

Solution

Six commands, run in order.

1. Sniff — what is this file?

qsv sniff wcp.csv
Delimiter:        ,
Header Row:       true
Preamble Rows:    0
Quote Char:       none
Flexible:         false
Is UTF8:          true
Average Record Length: 33
Num Records:      2699354
Content Length:   ...
Fields:
  0:  Country: String
  1:  City: String
  2:  AccentCity: String
  3:  Region: String
  4:  Population: Integer (nullable)
  5:  Latitude: Float
  6:  Longitude: Float

sniff samples the first ~1000 rows and infers everything. For a remote URL it works without downloading the whole file.

2. Headers — exact column names

qsv headers wcp.csv

You already saw them in sniff's output, but headers --just-names is the clean machine-readable form for shell pipelines.

3. Count — exact row count

qsv count wcp.csv
# 2699354

If the file is huge and you'll keep coming back to it, build an index now:

qsv index wcp.csv     # ~2 seconds for wcp; ~14s for 15 GB NYC 311

After indexing, count / sample / slice are instant for the rest of this session.

4. Stats — full profile in under a second

qsv stats wcp.csv | qsv table

48 metrics per column: min, max, mean, stddev, sum, type, null count, sortiness, max-precision, etc. Runs in ~0.7 s on wcp.csv. Adds a wcp.stats.csv sidecar that downstream commands reuse.

For the deluxe version:

qsv stats --everything wcp.csv | qsv table

Adds cardinality, modes, antimodes, median, quartiles, IQR, fences, skewness, MAD, percentiles. Now you know which columns are categorical (low cardinality) vs continuous (high cardinality).

5. Frequency — top values per column

qsv frequency --select Country --limit 10 wcp.csv | qsv table
field    value  count   percentage  rank
Country  us     141989  5.26        1
Country  in     150535  5.58        ...
Country  de     107054  3.97        ...
Country  fr     78812   2.92        ...
...
Country  Other  1939456 71.84       0

Use --limit -2 to keep only values appearing 2 or more times (drops singletons), --asc for least-frequent first, or --ignore-case to merge case-variants.

For a JSON-formatted version (great for piping to jq or feeding an LLM):

qsv frequency --select Country --limit 10 --json wcp.csv

6. Sample + table — eyeball some rows

qsv sample --seed 42 5 wcp.csv | qsv table
Country  City         AccentCity   Region  Population  Latitude    Longitude
de       lehrte       Lehrte       06      43831       52.3666667  10.0
br       jaguariuna   Jaguariúna   27               -22.7041667  -46.9858333
...

For a single row in vertical (one-column-per-line) form:

qsv slice -i 100000 wcp.csv | qsv flatten

Variations

Sniff a remote file without downloading

qsv sniff https://example.com/large.csv

The whole point of sniff for remote files is it does HTTP Range requests to fetch just the first ~1000 rows.

Detect MIME type only (no schema inference) — useful in CKAN harvesting

qsv sniff --no-infer https://example.com/data.xlsx
# Returns just the detected MIME type, file size, last-modified

Add a natural-language summary via describegpt

qsv describegpt --all wcp.csv \
  -u http://localhost:11434/v1 \
  --model gpt-oss-20b

This produces a Markdown data dictionary, description, and tags — using stats / frequency cached above as deterministic context. See AI & Documentation.

Pretty-print the stats in a colorized terminal

qsv stats wcp.csv | qsv color

Same as qsv table but with theme-detecting colors. See color.

Performance notes

  • stats runs in ~0.7 s on wcp.csv (2.7M rows) on an M2 Pro. With an index, slightly faster + uses all cores.
  • sniff is constant time regardless of file size — it samples the first ~1000 rows. For files where the first 1000 rows aren't representative, increase with --sample 0.20 (sample 20% of the file).
  • frequency with the stats cache present completes in under a second for wcp.csv. The cache short-circuits all-unique columns (like ID columns) — these would otherwise blow up memory.
  • The qsv index step is a one-time cost (~14 s on a 15 GB file). After that, this whole recipe runs in well under 5 s for files up to ~30 GB.

See also

Clone this wiki locally