Skip to content

Recipe CKAN Integration

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: CKAN Integration

Tier: Intermediate Commands used: safenames, applydp (in qsvdp), to postgres, to sqlite, jsonl, sniff, validate, describegpt Anchor dataset: any CKAN instance — examples use demo.ckan.org, catalog.data.gov, data.cnra.ca.gov

Problem

You manage a CKAN data portal — or you ingest data from one. qsv has a CKAN-aware integration surface that handles:

  • pulling dataset / resource / user / org metadata into CSVs
  • preparing CSVs for the CKAN Datastore (safe column names, reserved fields, length caps)
  • pushing CSVs into Postgres for DataPusher+ ingestion
  • validating CSVs against schemas before publishing
  • generating data dictionaries

This recipe expands the original Cookbook entry. The original short snippets are preserved at the bottom of Cookbook.

Data

A live CKAN instance plus ckanapi (Python) and jq installed:

pipx install ckanapi
brew install jq      # or apt install jq

Solution — pull metadata FROM CKAN

1. Dump all datasets / users / groups / orgs as CSV

ckanapi -r https://demo.ckan.org dump datasets --all      | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users    --all      | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups   --all      | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv

ckanapi emits JSONL; qsv jsonl flattens it to CSV.

2. Get the resource list for a single dataset

ckanapi -r https://catalog.data.gov action package_show \
  id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
  | jq -c '.resources[]' \
  | qsv jsonl > resources.csv

3. Download the latest CSV of a named resource and run stats on it

ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
  > wellstar.json

cat wellstar.json \
  | jq -c '.resources[] | select(.name=="CSV") | .url' \
  | xargs -L 1 wget -O wellstar.csv

qsv stats --everything wellstar.csv > wellstar-stats.csv

4. Health-check a whole CKAN catalog without downloading every resource

# resources.csv has columns: id, name, url, format, ...
qsv select url resources.csv \
  | qsv behead \
  | xargs -I {} qsv sniff --no-infer --json {} > catalog_health.jsonl

sniff --no-infer returns just the MIME type, content length, and last-modified — fast even for stale or moved resources.

Solution — push CSVs TO CKAN

5. Make column names CKAN-Datastore-safe

qsv safenames raw_export.csv > step1.csv

safenames enforces:

  • lowercase, snake_case, alphanumeric + underscore only
  • ≤ 60 bytes (snapped to UTF-8 character boundary)
  • duplicates get numeric suffixes
  • columns named _id (reserved by CKAN) are renamed to reserved__id
  • columns starting with _ get an unsafe_ prefix
qsv safenames --mode V raw_export.csv     # audit mode, no rewrite
# stderr: 4 unsafe header/s: ["12_col", "Col with Spaces!", "", "_id"]

6. Clean values with applydp (DataPusher+ optimized)

If you're inside the qsvdp variant (the slim DataPusher+ build):

qsvdp applydp operations trim,lower email step1.csv \
  | qsvdp applydp operations cast amount --comparand integer \
  > step2.csv

If you're using the full qsv binary, use apply with the same operations:

qsv apply operations trim,lower email step1.csv \
  | qsv apply operations cast amount --comparand integer \
  > step2.csv

7. Load into Postgres for DataPusher+

qsv to postgres 'postgresql://datapusher:secret@localhost:5432/datastore' step2.csv
# Creates a table named after the CSV file stem

DataPusher+ takes it from there.

8. Generate a Frictionless Data Package

qsv to datapackage --stats datapackage.json step2.csv

--stats embeds qsv stats output in the Data Package descriptor — gives downstream consumers schema + range information without an extra round-trip.

Solution — generate a CKAN-ready data dictionary

qsv describegpt step2.csv \
  --all \
  --tag-vocab ckan_tag_vocabulary.csv \
  -u http://localhost:11434/v1 \
  --model deepseek-r1:14b \
  > step2_dictionary.md
  • --all produces description + tags + dictionary
  • --tag-vocab constrains tags to a curated CKAN-aligned vocabulary
  • Local Ollama keeps sensitive data on-premise

See AI & Documentation → describegpt for prompt customization.

Variations

Validate before publishing

qsv schema step2.csv         # produces step2.csv.schema.json
# Edit the schema to tighten rules, add dynamicEnum lookups, ...
qsv validate step2.csv step2.csv.schema.json

See Recipe: JSON Schema Validation.

Use dynamicEnum against a CKAN resource

In a schema, validate that a agency column only contains values that appear in a CKAN-hosted lookup CSV:

{
  "properties": {
    "agency": {
      "type": "string",
      "dynamicEnum": "ckan://nyc-agencies-resource-id"
    }
  }
}

qsv resolves the ckan:// URL via the CKAN action API. See Lookup Tables.

sniff as a CKAN harvester accelerator

Use sniff --no-infer against every CKAN resource URL to detect dead links, MIME type mismatches, and content-length regressions — without downloading anything.

qsv sniff --no-infer https://example.com/data.xlsx

Performance notes

  • safenames runs in O(headers) — instantaneous regardless of file size.
  • applydp / apply are streaming.
  • to postgres uses Postgres's COPY FROM under the hood — millions of rows per minute on a local DB.
  • For CKAN catalog health-checks at scale, parallelize with xargs -P 8 against sniff --no-infer.

See also

Clone this wiki locally