Skip to content

Recipe Fetch and Cache

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: Fetch & Cache

Tier: Advanced Commands used: fetch, fetchpost, --disk-cache, --redis-cache, --jaq Anchor datasets: NOAA GHCN-Daily (primary), GitHub stargazers (secondary)

Problem

Enriching a CSV with data from an external HTTP API, one row at a time, is the classic shape of:

  • looking up weather for each row's date
  • pulling GitHub repo metadata
  • bulk OCR / ML inference
  • geocoding via a paid service (when local geocode misses)
  • pricing lookups
  • compliance / sanctions list checks

Bare curl per row works but blows up under rate limits, doesn't cache, and serializes. qsv fetch and fetchpost handle all of this with HTTP/2 flow control, RateLimit-header-aware throttling, four cache options, and jaq for response extraction.

Data

NOAA GHCN-Daily (primary)

A list of weather stations and the dates we want to look up for each:

cat <<'EOF' > stations.csv
station_id,name
USW00094728,New York Central Park
USC00301309,Boston
USW00012842,Tampa
USW00023174,Los Angeles
USW00094846,Chicago O'Hare
EOF

NOAA serves each station's daily history at: https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/<STATION_ID>.csv

GitHub stargazers (secondary)

A list of qsv-related repos for which we want star counts:

cat <<'EOF' > repos.csv
owner_repo
dathere/qsv
dathere/datapusher-plus
dathere/qsv-lookup-tables
BurntSushi/xsv
johnkerl/miller
EOF

Solution

1. Fetch — GET each row's URL (NOAA)

qsv fetch \
  --url-template "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/{station_id}.csv" \
  --new-column raw_csv \
  --disk-cache \
  --disk-cache-dir ~/.qsv-cache/noaa \
  stations.csv > stations_with_data.csv

--url-template substitutes column values into the URL template. --disk-cache persists responses to ~/.qsv-cache/noaa/ keyed by URL — a second run is instant for already-fetched stations.

The raw_csv column contains the full NOAA response (a CSV-shaped text blob) per row. That's not always what you want — usually you want to extract specific fields.

2. Fetch + jaq — extract fields from JSON responses (GitHub)

qsv fetch \
  --url-template 'https://api.github.com/repos/{owner_repo}' \
  --http-header "Authorization: Bearer $GITHUB_TOKEN" \
  --new-column stars \
  --jaq '.stargazers_count' \
  --disk-cache \
  --disk-cache-dir ~/.qsv-cache/github \
  repos.csv > repos_with_stars.csv

The --jaq filter parses the JSON response and extracts the value at .stargazers_count. The new column contains just the number.

3. Multiple jaq projections in one fetch

qsv fetch \
  --url-template 'https://api.github.com/repos/{owner_repo}' \
  --http-header "Authorization: Bearer $GITHUB_TOKEN" \
  --new-column metrics \
  --jaq '{ stars: .stargazers_count, forks: .forks_count, open_issues: .open_issues_count }' \
  --disk-cache \
  repos.csv > repos_with_metrics.csv

The metrics column contains a small JSON object per row. Use qsv json afterward to flatten.

4. Rate-limit handling

qsv fetch \
  --url-template 'https://api.example.com/lookup/{key}' \
  --rate-limit 10 \
  --new-column response \
  --jaq '.result' \
  data.csv > enriched.csv

--rate-limit 10 caps you at 10 RPS. Additionally, qsv auto-throttles when the server returns RFC RateLimit headers — no extra config needed.

5. Redis cache for cross-machine / CI runs

qsv fetch \
  --url-template 'https://api.example.com/{id}' \
  --redis-cache \
  --redis-cache-conn 'redis://my-redis:6379' \
  --new-column response \
  data.csv > enriched.csv

Two CI runners hitting the same API will share cache hits via Redis — saves API quota AND time.

6. Per-row report (which calls succeeded?)

qsv fetch \
  --url-template 'https://api.example.com/{id}' \
  --report \
  data.csv > report.tsv
# report.tsv has: row, url, status, elapsed_ms, ...

--report mode skips the main output and writes a per-call audit TSV. Useful for diagnostics.

7. fetchpost — POST per row

You want to send each row to an OCR endpoint. Build a JSON payload with MiniJinja:

{# ocr_payload.j2 #}
{
  "model": "ocr-v2",
  "image_url": {{ image_url | tojson }},
  "language": {{ language | tojson }}
}
qsv fetchpost https://ocr.example.com/extract \
  --payload-tpl ocr_payload.j2 \
  --new-column text \
  --jaq '.text' \
  --rate-limit 5 \
  --disk-cache \
  images.csv > images_with_text.csv

For simple form-encoded POSTs (no template needed):

qsv fetchpost https://api.example.com/submit \
  name,email,score \
  --new-column response_id \
  --jaq '.id' \
  responses.csv > with_ids.csv

The columns name,email,score become name=...&email=...&score=... form fields.

Variations

Conditional fetch — only fetch rows that need it

# Split the input: rows that need fetching vs rows that already have data
qsv search --select 'stars' '^$' repos.csv > needs_fetch.csv
qsv search --select 'stars' '^$' --invert-match repos.csv > already_fetched.csv

# Fetch only the rows that need it
qsv fetch --url-template '...' needs_fetch.csv > newly_fetched.csv

# Merge back
qsv cat rows already_fetched.csv newly_fetched.csv > combined.csv

Test mode — preview first 5 calls

qsv slice --len 5 data.csv | qsv fetch --url-template '...' -

Run a quick preview against the first 5 rows before scaling up.

Chain fetch → describegpt for AI-driven enrichment

qsv fetch --url-template 'https://api.example.com/{id}' \
  --new-column raw \
  data.csv \
  | qsv describegpt --prompt "Summarize each raw response in one sentence" -

(In practice you'd save the intermediate and run describegpt on a sample, but the pattern is composable.)

Use a local LLM endpoint for inference

qsv fetchpost http://localhost:11434/api/generate \
  --payload-tpl ollama_payload.j2 \
  --new-column summary \
  --jaq '.response' \
  rows.csv > with_summaries.csv

Where ollama_payload.j2 is:

{
  "model": "gpt-oss-20b",
  "prompt": "Summarize: {{ text | tojson }}",
  "stream": false
}

Cache management

Cache Best for Reset / inspect
In-memory LRU (default) One-shot runs, small datasets Process-lifetime; lost on exit
Disk (--disk-cache) Repeated runs against stable APIs rm -rf ~/.qsv-cache/fetch/<subdir>
Redis (--redis-cache) Distributed / CI runs redis-cli FLUSHDB on the cache DB
No cache (--no-cache) Live data, pricing, stock n/a

For disk-cache TTL and Redis connection strings, see docs/Fetch.md.

Performance notes

  • HTTP/2 flow control means qsv adaptively raises and lowers in-flight requests based on the server's flow window.
  • RateLimit-header-aware throttling kicks in automatically when the server signals limits — no manual tuning needed for compliant APIs.
  • Disk cache hits are essentially free; Redis cache hits are a single round-trip to Redis (microseconds locally, low ms over LAN).
  • For very chatty APIs, parallelize via qsv splitxargs -P; cache hits will still dedupe across worker processes if you use the disk or Redis cache.

See also

Clone this wiki locally