-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe CKAN Integration
Tier: Intermediate
Commands used: safenames, applydp (in qsvdp), to postgres, to sqlite, jsonl, sniff, validate, describegpt
Anchor dataset: any CKAN instance — examples use demo.ckan.org, catalog.data.gov, data.cnra.ca.gov
You manage a CKAN data portal — or you ingest data from one. qsv has a CKAN-aware integration surface that handles:
- pulling dataset / resource / user / org metadata into CSVs
- preparing CSVs for the CKAN Datastore (safe column names, reserved fields, length caps)
- pushing CSVs into Postgres for DataPusher+ ingestion
- validating CSVs against schemas before publishing
- generating data dictionaries
This recipe expands the original Cookbook entry. The original short snippets are preserved at the bottom of Cookbook.
A live CKAN instance plus ckanapi (Python) and jq installed:
pipx install ckanapi
brew install jq # or apt install jqckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csvckanapi emits JSONL; qsv jsonl flattens it to CSV.
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl > resources.csvckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar.json
cat wellstar.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv# resources.csv has columns: id, name, url, format, ...
qsv select url resources.csv \
| qsv behead \
| xargs -I {} qsv sniff --no-infer --json {} > catalog_health.jsonlsniff --no-infer returns just the MIME type, content length, and last-modified — fast even for stale or moved resources.
qsv safenames raw_export.csv > step1.csvsafenames enforces:
- lowercase, snake_case, alphanumeric + underscore only
- ≤ 60 bytes (snapped to UTF-8 character boundary)
- duplicates get numeric suffixes
- columns named
_id(reserved by CKAN) are renamed toreserved__id - columns starting with
_get anunsafe_prefix
qsv safenames --mode V raw_export.csv # audit mode, no rewrite
# stderr: 4 unsafe header/s: ["12_col", "Col with Spaces!", "", "_id"]If you're inside the qsvdp variant (the slim DataPusher+ build):
qsvdp applydp operations trim,lower email step1.csv \
| qsvdp applydp operations cast amount --comparand integer \
> step2.csvIf you're using the full qsv binary, use apply with the same operations:
qsv apply operations trim,lower email step1.csv \
| qsv apply operations cast amount --comparand integer \
> step2.csvqsv to postgres 'postgresql://datapusher:secret@localhost:5432/datastore' step2.csv
# Creates a table named after the CSV file stemDataPusher+ takes it from there.
qsv to datapackage --stats datapackage.json step2.csv--stats embeds qsv stats output in the Data Package descriptor — gives downstream consumers schema + range information without an extra round-trip.
qsv describegpt step2.csv \
--all \
--tag-vocab ckan_tag_vocabulary.csv \
-u http://localhost:11434/v1 \
--model deepseek-r1:14b \
> step2_dictionary.md-
--allproduces description + tags + dictionary -
--tag-vocabconstrains tags to a curated CKAN-aligned vocabulary - Local Ollama keeps sensitive data on-premise
See AI & Documentation → describegpt for prompt customization.
qsv schema step2.csv # produces step2.csv.schema.json
# Edit the schema to tighten rules, add dynamicEnum lookups, ...
qsv validate step2.csv step2.csv.schema.jsonSee Recipe: JSON Schema Validation.
In a schema, validate that a agency column only contains values that appear in a CKAN-hosted lookup CSV:
{
"properties": {
"agency": {
"type": "string",
"dynamicEnum": "ckan://nyc-agencies-resource-id"
}
}
}qsv resolves the ckan:// URL via the CKAN action API. See Lookup Tables.
Use sniff --no-infer against every CKAN resource URL to detect dead links, MIME type mismatches, and content-length regressions — without downloading anything.
qsv sniff --no-infer https://example.com/data.xlsx-
safenamesruns in O(headers) — instantaneous regardless of file size. -
applydp/applyare streaming. -
to postgresuses Postgres'sCOPY FROMunder the hood — millions of rows per minute on a local DB. - For CKAN catalog health-checks at scale, parallelize with
xargs -P 8againstsniff --no-infer.
- Transform & Reshape → safenames
- Transform & Reshape → applydp
- Conversion & I/O → to — Postgres / SQLite / Data Package
-
Validation & Schema —
dynamicEnumagainst CKAN URLs -
Binary Variants —
qsvdpfor DataPusher+ - DataPusher+
- ckanapi
- Integrations
- Cookbook (legacy)
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation