Validation and Schema

Validation & Schema

Tier: Intermediate Commands covered: validate, schema, sniff (see also profile for standards-based metadata generation)

Note

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

This is qsv's data quality story. Use schema to infer a JSON Schema 2020-12 spec from representative data, edit the schema by hand to tighten the rules, then validate to enforce it against new data at up to 780,000 rows/sec. Use sniff for fast metadata (delimiter, types, MIME) when you don't need a full schema scan.

For the deep-dive, see docs/Validate.md.

Quick decision table

If you want to…	Use	Notes
Check a CSV is valid RFC 4180 + UTF-8	`validate` (no schema)	Run before commands that assume well-formed input (especially `stats`)
Validate against a JSON Schema 2020-12 spec	`validate <data.csv> <schema.json>`	Outputs `.valid.csv`, `.invalid.csv`, `.validation-errors.tsv`
Generate a JSON Schema from representative data	`schema`	`.schema.json` sidecar; edit to tighten rules
Generate a Polars schema for `sqlp` / `joinp` / `pivotp`	`schema --polars`	`.pschema.json` sidecar; speeds up Polars commands
Quickly probe an unknown CSV / URL	`sniff`	Faster than schema; samples the first N rows
Generate standards-based metadata (DCAT / Croissant / CKAN)	`profile`	`.metadata.json`; FAIRification for data portals & ML registries

`validate`

JSON Schema Validation Draft 2020-12 with the fastest validator in Rust (jsonschema-rs). Up to 780k rows/sec on NYC 311 with the bundled schema.

Without a schema argument, validate just verifies RFC 4180 conformance and UTF-8 — a cheap sanity check before running stats (which assumes well-formed input).

With a schema, three output files are produced:

<input>.valid.csv — rows that passed
<input>.invalid.csv — rows that failed
<input>.validation-errors.tsv — row_number, field, error for every failure

Three custom keywords beyond stock JSON Schema 2020-12:

currency — ISO 4217 currency validation (USD, €, ¥, $1,000.00, (€100,00), -USD100.00, …)
dynamicEnum — enum whose values come from a CSV (local, HTTP/HTTPS, dathere://, or ckan://). Use this when the valid-value set is too large to hardcode or it changes over time.
uniqueCombinedWith — composite-key uniqueness check across columns.

Example: cheap RFC 4180 + UTF-8 sanity check

qsv validate NYC_311_SR_2010-2020-sample-1M.csv
# Exit code 0 = valid; non-zero = problem; details to stderr.

Example: validate NYC 311 against the bundled schema

qsv validate \
  NYC_311_SR_2010-2020-sample-1M.csv \
  resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json
ls NYC_311_SR_2010-2020-sample-1M.*
# .valid.csv  .invalid.csv  .validation-errors.tsv

Example: dynamicEnum against a CSV of valid NYC agencies

{
  "type": "object",
  "properties": {
    "Agency": {
      "type": "string",
      "dynamicEnum": "NYC_agencies.csv"
    }
  }
}

dynamicEnum accepts a local path or a URL — including ckan:// to fetch a resource by CKAN ID.

Example: uniqueCombinedWith for a composite primary key

{
  "properties": {
    "Created Date":       { "uniqueCombinedWith": ["Complaint Type", "Incident Address"] }
  }
}

This enforces that the combination of those three columns is unique across the file.

Example: currency format validation

{
  "properties": {
    "Sale Price": { "type": "string", "format": "currency" }
  }
}

Example: validate a directory of CSVs against one schema

qsv validate exports/2024/ schema.json

Example: fancy regex (look-around, backreferences) when needed

qsv validate --fancy-regex data.csv strict-schema.json

By default validate uses the same fast regex crate as search. --fancy-regex opts into the slower but more expressive fancy-regex crate for schemas that need it.

See also: /docs/help/validate.md, docs/Validate.md — deep dive, schema, Lookup Tables — for dynamicEnum patterns, Recipe: JSON Schema Validation.

`schema`

Two modes:

JSON Schema mode (default) — produces a <input>.schema.json with inferred types, formats, ranges, and enum / pattern constraints. Edit the file by hand to tighten rules (mark required fields, add dynamicEnum lookups, set custom currency formats, …) before using it with validate.
Polars Schema mode (--polars) — produces a <input>.pschema.json consumed by sqlp / joinp / pivotp to skip type inference. Edit to change inferred Polars data types where needed.

Both reuse the stats cache when available (run qsv stats --cardinality --infer-dates --stats-jsonl <input> first for max speed).

Example: infer a JSON Schema from NYC 311

qsv stats --cardinality --infer-dates --infer-boolean --stats-jsonl \
  NYC_311_SR_2010-2020-sample-1M.csv
qsv schema NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.csv.schema.json

The generated schema has inferred types, min/max ranges (for numerics), date formats, enum values (for low-cardinality columns), regex patterns (when --pattern-columns is used), and the qsv-specific currency, dynamicEnum, uniqueCombinedWith custom keywords where appropriate.

Example: re-validate the source data — schema should always pass for the data it was inferred from

qsv validate \
  NYC_311_SR_2010-2020-sample-1M.csv \
  NYC_311_SR_2010-2020-sample-1M.csv.schema.json
# Exit code 0; no .invalid.csv or .validation-errors.tsv produced

Example: Polars schema for speeding up sqlp

qsv schema --polars NYC_311_SR_2010-2020-sample-1M.csv
# Produces NYC_311_SR_2010-2020-sample-1M.pschema.json

# Now sqlp / joinp / pivotp pick it up automatically
qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT Borough, COUNT(*) FROM nyc311 GROUP BY Borough"

Example: validate a manually-fine-tuned schema is itself valid JSON Schema

qsv validate schema fine-tuned-schema.json

See also: /docs/help/schema.md, validate, JSON Schema 2020-12 spec, stats, Stats Cache & Caching.

`sniff`

(Also covered in Selection & Inspection → sniff. Mentioned here because it's the lightweight cousin of schema for validation workflows.)

sniff samples the first N rows (default 1,000) to infer delimiter, header row, preamble rows, quote character, encoding, average record length, total row count, and per-field types & names. Works on local files and remote URLs (without downloading the whole file).

With the magika feature compiled in, it also identifies 200+ file types via Google's AI-based content detector — useful for CKAN harvesting.

Example: probe a remote dataset before downloading

qsv sniff https://opendata.cityofnewyork.us/data.csv
# Reports delimiter, headers, content length, estimated row count, field types

Example: MIME-type-only probe for fast URL health checks (CKAN-style)

qsv sniff --no-infer https://example.com/data.xlsx

Example: stricter sampling for an oddly-shaped CSV

qsv sniff --sample 0.20 --delimiter '|' weird.csv

For guaranteed (non-sampled) type inference, use stats or schema — they scan the entire file.

Validation and Schema

Validation & Schema

Quick decision table

validate

schema

sniff

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

`validate`

`schema`

`sniff`