-
Notifications
You must be signed in to change notification settings - Fork 104
Validation and Schema
Tier: Intermediate
Commands covered: validate, schema, sniff (see also profile for standards-based metadata generation)
Note
Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
This is qsv's data quality story. Use schema to infer a JSON Schema 2020-12 spec from representative data, edit the schema by hand to tighten the rules, then validate to enforce it against new data at up to 780,000 rows/sec. Use sniff for fast metadata (delimiter, types, MIME) when you don't need a full schema scan.
For the deep-dive, see docs/Validate.md.
| If you want to… | Use | Notes |
|---|---|---|
| Check a CSV is valid RFC 4180 + UTF-8 |
validate (no schema) |
Run before commands that assume well-formed input (especially stats) |
| Validate against a JSON Schema 2020-12 spec | validate <data.csv> <schema.json> |
Outputs .valid.csv, .invalid.csv, .validation-errors.tsv
|
| Generate a JSON Schema from representative data | schema |
.schema.json sidecar; edit to tighten rules |
Generate a Polars schema for sqlp / joinp / pivotp
|
schema --polars |
.pschema.json sidecar; speeds up Polars commands |
| Quickly probe an unknown CSV / URL | sniff |
Faster than schema; samples the first N rows |
| Generate standards-based metadata (DCAT / Croissant / CKAN) | profile |
.metadata.json; FAIRification for data portals & ML registries |
JSON Schema Validation Draft 2020-12 with the fastest validator in Rust (jsonschema-rs). Up to 780k rows/sec on NYC 311 with the bundled schema.
Without a schema argument, validate just verifies RFC 4180 conformance and UTF-8 — a cheap sanity check before running stats (which assumes well-formed input).
With a schema, three output files are produced:
-
<input>.valid.csv— rows that passed -
<input>.invalid.csv— rows that failed -
<input>.validation-errors.tsv—row_number, field, errorfor every failure
Three custom keywords beyond stock JSON Schema 2020-12:
-
currency— ISO 4217 currency validation (USD,€,¥,$1,000.00,(€100,00),-USD100.00, …) -
dynamicEnum—enumwhose values come from a CSV (local, HTTP/HTTPS,dathere://, orckan://). Use this when the valid-value set is too large to hardcode or it changes over time. -
uniqueCombinedWith— composite-key uniqueness check across columns.
Example: cheap RFC 4180 + UTF-8 sanity check
qsv validate NYC_311_SR_2010-2020-sample-1M.csv
# Exit code 0 = valid; non-zero = problem; details to stderr.Example: validate NYC 311 against the bundled schema
qsv validate \
NYC_311_SR_2010-2020-sample-1M.csv \
resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json
ls NYC_311_SR_2010-2020-sample-1M.*
# .valid.csv .invalid.csv .validation-errors.tsvExample: dynamicEnum against a CSV of valid NYC agencies
{
"type": "object",
"properties": {
"Agency": {
"type": "string",
"dynamicEnum": "NYC_agencies.csv"
}
}
}dynamicEnum accepts a local path or a URL — including ckan:// to fetch a resource by CKAN ID.
Example: uniqueCombinedWith for a composite primary key
{
"properties": {
"Created Date": { "uniqueCombinedWith": ["Complaint Type", "Incident Address"] }
}
}This enforces that the combination of those three columns is unique across the file.
Example: currency format validation
{
"properties": {
"Sale Price": { "type": "string", "format": "currency" }
}
}Example: validate a directory of CSVs against one schema
qsv validate exports/2024/ schema.jsonExample: fancy regex (look-around, backreferences) when needed
qsv validate --fancy-regex data.csv strict-schema.jsonBy default validate uses the same fast regex crate as search. --fancy-regex opts into the slower but more expressive fancy-regex crate for schemas that need it.
See also: /docs/help/validate.md, docs/Validate.md — deep dive, schema, Lookup Tables — for dynamicEnum patterns, Recipe: JSON Schema Validation.
Two modes:
-
JSON Schema mode (default) — produces a
<input>.schema.jsonwith inferred types, formats, ranges, and enum / pattern constraints. Edit the file by hand to tighten rules (mark required fields, adddynamicEnumlookups, set customcurrencyformats, …) before using it withvalidate. -
Polars Schema mode (
--polars) — produces a<input>.pschema.jsonconsumed bysqlp/joinp/pivotpto skip type inference. Edit to change inferred Polars data types where needed.
Both reuse the stats cache when available (run qsv stats --cardinality --infer-dates --stats-jsonl <input> first for max speed).
Example: infer a JSON Schema from NYC 311
qsv stats --cardinality --infer-dates --infer-boolean --stats-jsonl \
NYC_311_SR_2010-2020-sample-1M.csv
qsv schema NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.csv.schema.jsonThe generated schema has inferred types, min/max ranges (for numerics), date formats, enum values (for low-cardinality columns), regex patterns (when --pattern-columns is used), and the qsv-specific currency, dynamicEnum, uniqueCombinedWith custom keywords where appropriate.
Example: re-validate the source data — schema should always pass for the data it was inferred from
qsv validate \
NYC_311_SR_2010-2020-sample-1M.csv \
NYC_311_SR_2010-2020-sample-1M.csv.schema.json
# Exit code 0; no .invalid.csv or .validation-errors.tsv producedExample: Polars schema for speeding up sqlp
qsv schema --polars NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.pschema.json
# Now sqlp / joinp / pivotp pick it up automatically
qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
"SELECT Borough, COUNT(*) FROM nyc311 GROUP BY Borough"Example: validate a manually-fine-tuned schema is itself valid JSON Schema
qsv validate schema fine-tuned-schema.jsonSee also: /docs/help/schema.md, validate, JSON Schema 2020-12 spec, stats, Stats Cache & Caching.
(Also covered in Selection & Inspection → sniff. Mentioned here because it's the lightweight cousin of schema for validation workflows.)
sniff samples the first N rows (default 1,000) to infer delimiter, header row, preamble rows, quote character, encoding, average record length, total row count, and per-field types & names. Works on local files and remote URLs (without downloading the whole file).
With the magika feature compiled in, it also identifies 200+ file types via Google's AI-based content detector — useful for CKAN harvesting.
Example: probe a remote dataset before downloading
qsv sniff https://opendata.cityofnewyork.us/data.csv
# Reports delimiter, headers, content length, estimated row count, field typesExample: MIME-type-only probe for fast URL health checks (CKAN-style)
qsv sniff --no-infer https://example.com/data.xlsxExample: stricter sampling for an oddly-shaped CSV
qsv sniff --sample 0.20 --delimiter '|' weird.csvFor guaranteed (non-sampled) type inference, use stats or schema — they scan the entire file.
- Command Reference (index)
-
Aggregation & Statistics → stats — guaranteed type inference (vs.
sniff's sample) -
SQL & Polars —
schema --polarsproduces the.pschema.jsonconsumed here -
Metadata Profiling (
profile) — standards-based metadata generation & validation (DCAT / Croissant / CKAN) -
Lookup Tables —
dynamicEnumagainst local files, HTTP URLs, dathere, CKAN -
docs/Validate.md— canonical reference for custom keywords - Cookbook → JSON Schema Validation
- Cookbook → Clean & Normalize
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation