-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Inspect Unknown CSV
Tier: Beginner
Commands used: sniff, headers, count, stats, frequency, sample, table, flatten
Anchor dataset: wcp.csv (World Cities Population, 2.7M rows / 124 MB)
A file lands on your desk. You don't know the delimiter, the columns, the row count, the data types, or whether anything's weird. You'd like to characterize it in under 30 seconds before deciding what to do next.
# Get wcp.csv if you don't have it
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip
unzip wcp.zipOr substitute any CSV from your own work — the recipe is dataset-agnostic.
Six commands, run in order.
qsv sniff wcp.csvDelimiter: ,
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Average Record Length: 33
Num Records: 2699354
Content Length: ...
Fields:
0: Country: String
1: City: String
2: AccentCity: String
3: Region: String
4: Population: Integer (nullable)
5: Latitude: Float
6: Longitude: Float
sniff samples the first ~1000 rows and infers everything. For a remote URL it works without downloading the whole file.
qsv headers wcp.csvYou already saw them in sniff's output, but headers --just-names is the clean machine-readable form for shell pipelines.
qsv count wcp.csv
# 2699354If the file is huge and you'll keep coming back to it, build an index now:
qsv index wcp.csv # ~2 seconds for wcp; ~14s for 15 GB NYC 311After indexing, count / sample / slice are instant for the rest of this session.
qsv stats wcp.csv | qsv table48 metrics per column: min, max, mean, stddev, sum, type, null count, sortiness, max-precision, etc. Runs in ~0.7 s on wcp.csv. Adds a wcp.stats.csv sidecar that downstream commands reuse.
For the deluxe version:
qsv stats --everything wcp.csv | qsv tableAdds cardinality, modes, antimodes, median, quartiles, IQR, fences, skewness, MAD, percentiles. Now you know which columns are categorical (low cardinality) vs continuous (high cardinality).
qsv frequency --select Country --limit 10 wcp.csv | qsv tablefield value count percentage rank
Country us 141989 5.26 1
Country in 150535 5.58 ...
Country de 107054 3.97 ...
Country fr 78812 2.92 ...
...
Country Other 1939456 71.84 0
Use --limit -2 to keep only values appearing 2 or more times (drops singletons), --asc for least-frequent first, or --ignore-case to merge case-variants.
For a JSON-formatted version (great for piping to jq or feeding an LLM):
qsv frequency --select Country --limit 10 --json wcp.csvqsv sample --seed 42 5 wcp.csv | qsv tableCountry City AccentCity Region Population Latitude Longitude
de lehrte Lehrte 06 43831 52.3666667 10.0
br jaguariuna Jaguariúna 27 -22.7041667 -46.9858333
...
For a single row in vertical (one-column-per-line) form:
qsv slice -i 100000 wcp.csv | qsv flattenqsv sniff https://example.com/large.csvThe whole point of sniff for remote files is it does HTTP Range requests to fetch just the first ~1000 rows.
qsv sniff --no-infer https://example.com/data.xlsx
# Returns just the detected MIME type, file size, last-modifiedqsv describegpt --all wcp.csv \
-u http://localhost:11434/v1 \
--model gpt-oss-20bThis produces a Markdown data dictionary, description, and tags — using stats / frequency cached above as deterministic context. See AI & Documentation.
qsv stats wcp.csv | qsv colorSame as qsv table but with theme-detecting colors. See color.
-
statsruns in ~0.7 s onwcp.csv(2.7M rows) on an M2 Pro. With an index, slightly faster + uses all cores. -
sniffis constant time regardless of file size — it samples the first ~1000 rows. For files where the first 1000 rows aren't representative, increase with--sample 0.20(sample 20% of the file). -
frequencywith the stats cache present completes in under a second forwcp.csv. The cache short-circuits all-unique columns (like ID columns) — these would otherwise blow up memory. - The
qsv indexstep is a one-time cost (~14 s on a 15 GB file). After that, this whole recipe runs in well under 5 s for files up to ~30 GB.
-
Selection & Inspection —
sniff,headers,count,sample,slice,flatten,tablein depth -
Aggregation & Statistics —
stats,frequency,pragmastat - Validation & Schema — for guaranteed (vs. sampled) schema inference
-
docs/whirlwind_tour.md— longer-form walkthrough on the same dataset - Recipe: Stats → Insights — what to do once you have stats
- Performance Tuning — when to index, when to skip it
- Getting Started — overlaps significantly; this recipe is the "I have an unknown file" angle
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation