Recipe Fetch and Cache

Recipe: Fetch & Cache

Tier: Advanced Commands used: fetch, fetchpost, --disk-cache, --redis-cache, --jaq Anchor datasets: NOAA GHCN-Daily (primary), GitHub stargazers (secondary)

Problem

Enriching a CSV with data from an external HTTP API, one row at a time, is the classic shape of:

looking up weather for each row's date
pulling GitHub repo metadata
bulk OCR / ML inference
geocoding via a paid service (when local geocode misses)
pricing lookups
compliance / sanctions list checks

Bare curl per row works but blows up under rate limits, doesn't cache, and serializes. qsv fetch and fetchpost handle all of this with HTTP/2 flow control, RateLimit-header-aware throttling, four cache options, and jaq for response extraction.

Data

NOAA GHCN-Daily (primary)

A list of weather stations and the dates we want to look up for each:

cat <<'EOF' > stations.csv
station_id,name
USW00094728,New York Central Park
USC00301309,Boston
USW00012842,Tampa
USW00023174,Los Angeles
USW00094846,Chicago O'Hare
EOF

NOAA serves each station's daily history at: https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/<STATION_ID>.csv

GitHub stargazers (secondary)

A list of qsv-related repos for which we want star counts:

cat <<'EOF' > repos.csv
owner_repo
dathere/qsv
dathere/datapusher-plus
dathere/qsv-lookup-tables
BurntSushi/xsv
johnkerl/miller
EOF

Solution

1. Fetch — GET each row's URL (NOAA)

qsv fetch \
  --url-template "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/{station_id}.csv" \
  --new-column raw_csv \
  --disk-cache \
  --disk-cache-dir ~/.qsv-cache/noaa \
  stations.csv > stations_with_data.csv

--url-template substitutes column values into the URL template. --disk-cache persists responses to ~/.qsv-cache/noaa/ keyed by URL — a second run is instant for already-fetched stations.

The raw_csv column contains the full NOAA response (a CSV-shaped text blob) per row. That's not always what you want — usually you want to extract specific fields.

2. Fetch + jaq — extract fields from JSON responses (GitHub)

qsv fetch \
  --url-template 'https://api.github.com/repos/{owner_repo}' \
  --http-header "Authorization: Bearer $GITHUB_TOKEN" \
  --new-column stars \
  --jaq '.stargazers_count' \
  --disk-cache \
  --disk-cache-dir ~/.qsv-cache/github \
  repos.csv > repos_with_stars.csv

The --jaq filter parses the JSON response and extracts the value at .stargazers_count. The new column contains just the number.

3. Multiple jaq projections in one fetch

qsv fetch \
  --url-template 'https://api.github.com/repos/{owner_repo}' \
  --http-header "Authorization: Bearer $GITHUB_TOKEN" \
  --new-column metrics \
  --jaq '{ stars: .stargazers_count, forks: .forks_count, open_issues: .open_issues_count }' \
  --disk-cache \
  repos.csv > repos_with_metrics.csv

The metrics column contains a small JSON object per row. Use qsv json afterward to flatten.

4. Rate-limit handling

qsv fetch \
  --url-template 'https://api.example.com/lookup/{key}' \
  --rate-limit 10 \
  --new-column response \
  --jaq '.result' \
  data.csv > enriched.csv

--rate-limit 10 caps you at 10 RPS. Additionally, qsv auto-throttles when the server returns RFC RateLimit headers — no extra config needed.

5. Redis cache for cross-machine / CI runs

qsv fetch \
  --url-template 'https://api.example.com/{id}' \
  --redis-cache \
  --redis-cache-conn 'redis://my-redis:6379' \
  --new-column response \
  data.csv > enriched.csv

Two CI runners hitting the same API will share cache hits via Redis — saves API quota AND time.

6. Per-row report (which calls succeeded?)

qsv fetch \
  --url-template 'https://api.example.com/{id}' \
  --report \
  data.csv > report.tsv
# report.tsv has: row, url, status, elapsed_ms, ...

--report mode skips the main output and writes a per-call audit TSV. Useful for diagnostics.

7. fetchpost — POST per row

You want to send each row to an OCR endpoint. Build a JSON payload with MiniJinja:

{# ocr_payload.j2 #}
{
  "model": "ocr-v2",
  "image_url": {{ image_url | tojson }},
  "language": {{ language | tojson }}
}

qsv fetchpost https://ocr.example.com/extract \
  --payload-tpl ocr_payload.j2 \
  --new-column text \
  --jaq '.text' \
  --rate-limit 5 \
  --disk-cache \
  images.csv > images_with_text.csv

For simple form-encoded POSTs (no template needed):

qsv fetchpost https://api.example.com/submit \
  name,email,score \
  --new-column response_id \
  --jaq '.id' \
  responses.csv > with_ids.csv

The columns name,email,score become name=...&email=...&score=... form fields.

Variations

Conditional fetch — only fetch rows that need it

# Split the input: rows that need fetching vs rows that already have data
qsv search --select 'stars' '^$' repos.csv > needs_fetch.csv
qsv search --select 'stars' '^$' --invert-match repos.csv > already_fetched.csv

# Fetch only the rows that need it
qsv fetch --url-template '...' needs_fetch.csv > newly_fetched.csv

# Merge back
qsv cat rows already_fetched.csv newly_fetched.csv > combined.csv

Test mode — preview first 5 calls

qsv slice --len 5 data.csv | qsv fetch --url-template '...' -

Run a quick preview against the first 5 rows before scaling up.

Chain fetch → describegpt for AI-driven enrichment

qsv fetch --url-template 'https://api.example.com/{id}' \
  --new-column raw \
  data.csv \
  | qsv describegpt --prompt "Summarize each raw response in one sentence" -

(In practice you'd save the intermediate and run describegpt on a sample, but the pattern is composable.)

Use a local LLM endpoint for inference

qsv fetchpost http://localhost:11434/api/generate \
  --payload-tpl ollama_payload.j2 \
  --new-column summary \
  --jaq '.response' \
  rows.csv > with_summaries.csv

Where ollama_payload.j2 is:

{
  "model": "gpt-oss-20b",
  "prompt": "Summarize: {{ text | tojson }}",
  "stream": false
}

Cache management

Cache	Best for	Reset / inspect
In-memory LRU (default)	One-shot runs, small datasets	Process-lifetime; lost on exit
Disk (`--disk-cache`)	Repeated runs against stable APIs	`rm -rf ~/.qsv-cache/fetch/<subdir>`
Redis (`--redis-cache`)	Distributed / CI runs	`redis-cli FLUSHDB` on the cache DB
No cache (`--no-cache`)	Live data, pricing, stock	n/a

For disk-cache TTL and Redis connection strings, see docs/Fetch.md.

Performance notes

HTTP/2 flow control means qsv adaptively raises and lowers in-flight requests based on the server's flow window.
RateLimit-header-aware throttling kicks in automatically when the server signals limits — no manual tuning needed for compliant APIs.
Disk cache hits are essentially free; Redis cache hits are a single round-trip to Redis (microseconds locally, low ms over LAN).
For very chatty APIs, parallelize via qsv split → xargs -P; cache hits will still dedupe across worker processes if you use the disk or Redis cache.

Recipe Fetch and Cache

Recipe: Fetch & Cache

Problem

Data

NOAA GHCN-Daily (primary)

GitHub stargazers (secondary)

Solution

1. Fetch — GET each row's URL (NOAA)

2. Fetch + jaq — extract fields from JSON responses (GitHub)

3. Multiple jaq projections in one fetch

4. Rate-limit handling

5. Redis cache for cross-machine / CI runs

6. Per-row report (which calls succeeded?)

7. fetchpost — POST per row

Variations

Conditional fetch — only fetch rows that need it

Test mode — preview first 5 calls

Chain fetch → describegpt for AI-driven enrichment

Use a local LLM endpoint for inference

Cache management

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals