Skip to content

Validation and Schema

Joel Natividad edited this page May 30, 2026 · 4 revisions

Validation & Schema

Tier: Intermediate Commands covered: validate, schema, sniff (see also profile for standards-based metadata generation)

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

This is qsv's data quality story. Use schema to infer a JSON Schema 2020-12 spec from representative data, edit the schema by hand to tighten the rules, then validate to enforce it against new data at up to 780,000 rows/sec. Use sniff for fast metadata (delimiter, types, MIME) when you don't need a full schema scan.

For the deep-dive, see docs/Validate.md.

Quick decision table

If you want to… Use Notes
Check a CSV is valid RFC 4180 + UTF-8 validate (no schema) Run before commands that assume well-formed input (especially stats)
Validate against a JSON Schema 2020-12 spec validate <data.csv> <schema.json> Outputs .valid.csv, .invalid.csv, .validation-errors.tsv
Generate a JSON Schema from representative data schema .schema.json sidecar; edit to tighten rules
Generate a Polars schema for sqlp / joinp / pivotp schema --polars .pschema.json sidecar; speeds up Polars commands
Quickly probe an unknown CSV / URL sniff Faster than schema; samples the first N rows
Generate standards-based metadata (DCAT / Croissant / CKAN) profile .metadata.json; FAIRification for data portals & ML registries

validate

JSON Schema Validation Draft 2020-12 with the fastest validator in Rust (jsonschema-rs). Up to 780k rows/sec on NYC 311 with the bundled schema.

Without a schema argument, validate just verifies RFC 4180 conformance and UTF-8 — a cheap sanity check before running stats (which assumes well-formed input).

With a schema, three output files are produced:

  • <input>.valid.csv — rows that passed
  • <input>.invalid.csv — rows that failed
  • <input>.validation-errors.tsvrow_number, field, error for every failure

Three custom keywords beyond stock JSON Schema 2020-12:

  • currency — ISO 4217 currency validation (USD, , ¥, $1,000.00, (€100,00), -USD100.00, …)
  • dynamicEnumenum whose values come from a CSV (local, HTTP/HTTPS, dathere://, or ckan://). Use this when the valid-value set is too large to hardcode or it changes over time.
  • uniqueCombinedWith — composite-key uniqueness check across columns.

Example: cheap RFC 4180 + UTF-8 sanity check

qsv validate NYC_311_SR_2010-2020-sample-1M.csv
# Exit code 0 = valid; non-zero = problem; details to stderr.

Example: validate NYC 311 against the bundled schema

qsv validate \
  NYC_311_SR_2010-2020-sample-1M.csv \
  resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json
ls NYC_311_SR_2010-2020-sample-1M.*
# .valid.csv  .invalid.csv  .validation-errors.tsv

Example: dynamicEnum against a CSV of valid NYC agencies

{
  "type": "object",
  "properties": {
    "Agency": {
      "type": "string",
      "dynamicEnum": "NYC_agencies.csv"
    }
  }
}

dynamicEnum accepts a local path or a URL — including ckan:// to fetch a resource by CKAN ID.

Example: uniqueCombinedWith for a composite primary key

{
  "properties": {
    "Created Date":       { "uniqueCombinedWith": ["Complaint Type", "Incident Address"] }
  }
}

This enforces that the combination of those three columns is unique across the file.

Example: currency format validation

{
  "properties": {
    "Sale Price": { "type": "string", "format": "currency" }
  }
}

Example: validate a directory of CSVs against one schema

qsv validate exports/2024/ schema.json

Example: fancy regex (look-around, backreferences) when needed

qsv validate --fancy-regex data.csv strict-schema.json

By default validate uses the same fast regex crate as search. --fancy-regex opts into the slower but more expressive fancy-regex crate for schemas that need it.

See also: /docs/help/validate.md, docs/Validate.md — deep dive, schema, Lookup Tables — for dynamicEnum patterns, Recipe: JSON Schema Validation.

schema

Two modes:

  • JSON Schema mode (default) — produces a <input>.schema.json with inferred types, formats, ranges, and enum / pattern constraints. Edit the file by hand to tighten rules (mark required fields, add dynamicEnum lookups, set custom currency formats, …) before using it with validate.
  • Polars Schema mode (--polars) — produces a <input>.pschema.json consumed by sqlp / joinp / pivotp to skip type inference. Edit to change inferred Polars data types where needed.

Both reuse the stats cache when available (run qsv stats --cardinality --infer-dates --stats-jsonl <input> first for max speed).

Example: infer a JSON Schema from NYC 311

qsv stats --cardinality --infer-dates --infer-boolean --stats-jsonl \
  NYC_311_SR_2010-2020-sample-1M.csv
qsv schema NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.csv.schema.json

The generated schema has inferred types, min/max ranges (for numerics), date formats, enum values (for low-cardinality columns), regex patterns (when --pattern-columns is used), and the qsv-specific currency, dynamicEnum, uniqueCombinedWith custom keywords where appropriate.

Example: re-validate the source data — schema should always pass for the data it was inferred from

qsv validate \
  NYC_311_SR_2010-2020-sample-1M.csv \
  NYC_311_SR_2010-2020-sample-1M.csv.schema.json
# Exit code 0; no .invalid.csv or .validation-errors.tsv produced

Example: Polars schema for speeding up sqlp

qsv schema --polars NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.pschema.json

# Now sqlp / joinp / pivotp pick it up automatically
qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT Borough, COUNT(*) FROM nyc311 GROUP BY Borough"

Example: validate a manually-fine-tuned schema is itself valid JSON Schema

qsv validate schema fine-tuned-schema.json

See also: /docs/help/schema.md, validate, JSON Schema 2020-12 spec, stats, Stats Cache & Caching.

sniff

(Also covered in Selection & Inspection → sniff. Mentioned here because it's the lightweight cousin of schema for validation workflows.)

sniff samples the first N rows (default 1,000) to infer delimiter, header row, preamble rows, quote character, encoding, average record length, total row count, and per-field types & names. Works on local files and remote URLs (without downloading the whole file).

With the magika feature compiled in, it also identifies 200+ file types via Google's AI-based content detector — useful for CKAN harvesting.

Example: probe a remote dataset before downloading

qsv sniff https://opendata.cityofnewyork.us/data.csv
# Reports delimiter, headers, content length, estimated row count, field types

Example: MIME-type-only probe for fast URL health checks (CKAN-style)

qsv sniff --no-infer https://example.com/data.xlsx

Example: stricter sampling for an oddly-shaped CSV

qsv sniff --sample 0.20 --delimiter '|' weird.csv

For guaranteed (non-sampled) type inference, use stats or schema — they scan the entire file.

See also

Clone this wiki locally