Recipe CKAN Integration

Recipe: CKAN Integration

Tier: Intermediate Commands used: safenames, applydp (in qsvdp), to postgres, to sqlite, jsonl, sniff, validate, describegpt Anchor dataset: any CKAN instance — examples use demo.ckan.org, catalog.data.gov, data.cnra.ca.gov

Problem

You manage a CKAN data portal — or you ingest data from one. qsv has a CKAN-aware integration surface that handles:

pulling dataset / resource / user / org metadata into CSVs
preparing CSVs for the CKAN Datastore (safe column names, reserved fields, length caps)
pushing CSVs into Postgres for DataPusher+ ingestion
validating CSVs against schemas before publishing
generating data dictionaries

This recipe expands the original Cookbook entry. The original short snippets are preserved at the bottom of Cookbook.

Data

A live CKAN instance plus ckanapi (Python) and jq installed:

pipx install ckanapi
brew install jq      # or apt install jq

Solution — pull metadata FROM CKAN

1. Dump all datasets / users / groups / orgs as CSV

ckanapi -r https://demo.ckan.org dump datasets --all      | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users    --all      | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups   --all      | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv

ckanapi emits JSONL; qsv jsonl flattens it to CSV.

2. Get the resource list for a single dataset

ckanapi -r https://catalog.data.gov action package_show \
  id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
  | jq -c '.resources[]' \
  | qsv jsonl > resources.csv

3. Download the latest CSV of a named resource and run stats on it

ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
  > wellstar.json

cat wellstar.json \
  | jq -c '.resources[] | select(.name=="CSV") | .url' \
  | xargs -L 1 wget -O wellstar.csv

qsv stats --everything wellstar.csv > wellstar-stats.csv

4. Health-check a whole CKAN catalog without downloading every resource

# resources.csv has columns: id, name, url, format, ...
qsv select url resources.csv \
  | qsv behead \
  | xargs -I {} qsv sniff --no-infer --json {} > catalog_health.jsonl

sniff --no-infer returns just the MIME type, content length, and last-modified — fast even for stale or moved resources.

Solution — push CSVs TO CKAN

5. Make column names CKAN-Datastore-safe

qsv safenames raw_export.csv > step1.csv

safenames enforces:

lowercase, snake_case, alphanumeric + underscore only
≤ 60 bytes (snapped to UTF-8 character boundary)
duplicates get numeric suffixes
columns named _id (reserved by CKAN) are renamed to reserved__id
columns starting with _ get an unsafe_ prefix

qsv safenames --mode V raw_export.csv     # audit mode, no rewrite
# stderr: 4 unsafe header/s: ["12_col", "Col with Spaces!", "", "_id"]

6. Clean values with `applydp` (DataPusher+ optimized)

If you're inside the qsvdp variant (the slim DataPusher+ build):

qsvdp applydp operations trim,lower email step1.csv \
  | qsvdp applydp operations cast amount --comparand integer \
  > step2.csv

If you're using the full qsv binary, use apply with the same operations:

qsv apply operations trim,lower email step1.csv \
  | qsv apply operations cast amount --comparand integer \
  > step2.csv

7. Load into Postgres for DataPusher+

qsv to postgres 'postgresql://datapusher:secret@localhost:5432/datastore' step2.csv
# Creates a table named after the CSV file stem

DataPusher+ takes it from there.

8. Generate a Frictionless Data Package

qsv to datapackage --stats datapackage.json step2.csv

--stats embeds qsv stats output in the Data Package descriptor — gives downstream consumers schema + range information without an extra round-trip.

Solution — generate a CKAN-ready data dictionary

qsv describegpt step2.csv \
  --all \
  --tag-vocab ckan_tag_vocabulary.csv \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  > step2_dictionary.md

--all produces description + tags + dictionary
--tag-vocab constrains tags to a curated CKAN-aligned vocabulary
Local Ollama keeps sensitive data on-premise

See AI & Documentation → describegpt for prompt customization.

Variations

Validate before publishing

qsv schema step2.csv         # produces step2.csv.schema.json
# Edit the schema to tighten rules, add dynamicEnum lookups, ...
qsv validate step2.csv step2.csv.schema.json

See Recipe: JSON Schema Validation.

Use `dynamicEnum` against a CKAN resource

In a schema, validate that a agency column only contains values that appear in a CKAN-hosted lookup CSV:

{
  "properties": {
    "agency": {
      "type": "string",
      "dynamicEnum": "ckan://nyc-agencies-resource-id"
    }
  }
}

qsv resolves the ckan:// URL via the CKAN action API. See Lookup Tables.

`sniff` as a CKAN harvester accelerator

Use sniff --no-infer against every CKAN resource URL to detect dead links, MIME type mismatches, and content-length regressions — without downloading anything.

qsv sniff --no-infer https://example.com/data.xlsx

Performance notes

safenames runs in O(headers) — instantaneous regardless of file size.
applydp / apply are streaming.
to postgres uses Postgres's COPY FROM under the hood — millions of rows per minute on a local DB.
For CKAN catalog health-checks at scale, parallelize with xargs -P 8 against sniff --no-infer.

Recipe CKAN Integration

Recipe: CKAN Integration

Problem

Data

Solution — pull metadata FROM CKAN

1. Dump all datasets / users / groups / orgs as CSV

2. Get the resource list for a single dataset

3. Download the latest CSV of a named resource and run stats on it

4. Health-check a whole CKAN catalog without downloading every resource

Solution — push CSVs TO CKAN

5. Make column names CKAN-Datastore-safe

6. Clean values with applydp (DataPusher+ optimized)

7. Load into Postgres for DataPusher+

8. Generate a Frictionless Data Package

Solution — generate a CKAN-ready data dictionary

Variations

Validate before publishing

Use dynamicEnum against a CKAN resource

sniff as a CKAN harvester accelerator

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

6. Clean values with `applydp` (DataPusher+ optimized)

Use `dynamicEnum` against a CKAN resource

`sniff` as a CKAN harvester accelerator