-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Fetch and Cache
Tier: Advanced
Commands used: fetch, fetchpost, --disk-cache, --redis-cache, --jaq
Anchor datasets: NOAA GHCN-Daily (primary), GitHub stargazers (secondary)
Enriching a CSV with data from an external HTTP API, one row at a time, is the classic shape of:
- looking up weather for each row's date
- pulling GitHub repo metadata
- bulk OCR / ML inference
- geocoding via a paid service (when local geocode misses)
- pricing lookups
- compliance / sanctions list checks
Bare curl per row works but blows up under rate limits, doesn't cache, and serializes. qsv fetch and fetchpost handle all of this with HTTP/2 flow control, RateLimit-header-aware throttling, four cache options, and jaq for response extraction.
A list of weather stations and the dates we want to look up for each:
cat <<'EOF' > stations.csv
station_id,name
USW00094728,New York Central Park
USC00301309,Boston
USW00012842,Tampa
USW00023174,Los Angeles
USW00094846,Chicago O'Hare
EOFNOAA serves each station's daily history at:
https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/<STATION_ID>.csv
A list of qsv-related repos for which we want star counts:
cat <<'EOF' > repos.csv
owner_repo
dathere/qsv
dathere/datapusher-plus
dathere/qsv-lookup-tables
BurntSushi/xsv
johnkerl/miller
EOFqsv fetch \
--url-template "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/{station_id}.csv" \
--new-column raw_csv \
--disk-cache \
--disk-cache-dir ~/.qsv-cache/noaa \
stations.csv > stations_with_data.csv--url-template substitutes column values into the URL template. --disk-cache persists responses to ~/.qsv-cache/noaa/ keyed by URL — a second run is instant for already-fetched stations.
The raw_csv column contains the full NOAA response (a CSV-shaped text blob) per row. That's not always what you want — usually you want to extract specific fields.
qsv fetch \
--url-template 'https://api.github.com/repos/{owner_repo}' \
--http-header "Authorization: Bearer $GITHUB_TOKEN" \
--new-column stars \
--jaq '.stargazers_count' \
--disk-cache \
--disk-cache-dir ~/.qsv-cache/github \
repos.csv > repos_with_stars.csvThe --jaq filter parses the JSON response and extracts the value at .stargazers_count. The new column contains just the number.
qsv fetch \
--url-template 'https://api.github.com/repos/{owner_repo}' \
--http-header "Authorization: Bearer $GITHUB_TOKEN" \
--new-column metrics \
--jaq '{ stars: .stargazers_count, forks: .forks_count, open_issues: .open_issues_count }' \
--disk-cache \
repos.csv > repos_with_metrics.csvThe metrics column contains a small JSON object per row. Use qsv json afterward to flatten.
qsv fetch \
--url-template 'https://api.example.com/lookup/{key}' \
--rate-limit 10 \
--new-column response \
--jaq '.result' \
data.csv > enriched.csv--rate-limit 10 caps you at 10 RPS. Additionally, qsv auto-throttles when the server returns RFC RateLimit headers — no extra config needed.
qsv fetch \
--url-template 'https://api.example.com/{id}' \
--redis-cache \
--redis-cache-conn 'redis://my-redis:6379' \
--new-column response \
data.csv > enriched.csvTwo CI runners hitting the same API will share cache hits via Redis — saves API quota AND time.
qsv fetch \
--url-template 'https://api.example.com/{id}' \
--report \
data.csv > report.tsv
# report.tsv has: row, url, status, elapsed_ms, ...--report mode skips the main output and writes a per-call audit TSV. Useful for diagnostics.
You want to send each row to an OCR endpoint. Build a JSON payload with MiniJinja:
{# ocr_payload.j2 #}
{
"model": "ocr-v2",
"image_url": {{ image_url | tojson }},
"language": {{ language | tojson }}
}qsv fetchpost https://ocr.example.com/extract \
--payload-tpl ocr_payload.j2 \
--new-column text \
--jaq '.text' \
--rate-limit 5 \
--disk-cache \
images.csv > images_with_text.csvFor simple form-encoded POSTs (no template needed):
qsv fetchpost https://api.example.com/submit \
name,email,score \
--new-column response_id \
--jaq '.id' \
responses.csv > with_ids.csvThe columns name,email,score become name=...&email=...&score=... form fields.
# Split the input: rows that need fetching vs rows that already have data
qsv search --select 'stars' '^$' repos.csv > needs_fetch.csv
qsv search --select 'stars' '^$' --invert-match repos.csv > already_fetched.csv
# Fetch only the rows that need it
qsv fetch --url-template '...' needs_fetch.csv > newly_fetched.csv
# Merge back
qsv cat rows already_fetched.csv newly_fetched.csv > combined.csvqsv slice --len 5 data.csv | qsv fetch --url-template '...' -Run a quick preview against the first 5 rows before scaling up.
qsv fetch --url-template 'https://api.example.com/{id}' \
--new-column raw \
data.csv \
| qsv describegpt --prompt "Summarize each raw response in one sentence" -(In practice you'd save the intermediate and run describegpt on a sample, but the pattern is composable.)
qsv fetchpost http://localhost:11434/api/generate \
--payload-tpl ollama_payload.j2 \
--new-column summary \
--jaq '.response' \
rows.csv > with_summaries.csvWhere ollama_payload.j2 is:
{
"model": "gpt-oss-20b",
"prompt": "Summarize: {{ text | tojson }}",
"stream": false
}| Cache | Best for | Reset / inspect |
|---|---|---|
| In-memory LRU (default) | One-shot runs, small datasets | Process-lifetime; lost on exit |
Disk (--disk-cache) |
Repeated runs against stable APIs | rm -rf ~/.qsv-cache/fetch/<subdir> |
Redis (--redis-cache) |
Distributed / CI runs |
redis-cli FLUSHDB on the cache DB |
No cache (--no-cache) |
Live data, pricing, stock | n/a |
For disk-cache TTL and Redis connection strings, see docs/Fetch.md.
- HTTP/2 flow control means qsv adaptively raises and lowers in-flight requests based on the server's flow window.
- RateLimit-header-aware throttling kicks in automatically when the server signals limits — no manual tuning needed for compliant APIs.
- Disk cache hits are essentially free; Redis cache hits are a single round-trip to Redis (microseconds locally, low ms over LAN).
- For very chatty APIs, parallelize via
qsv split→xargs -P; cache hits will still dedupe across worker processes if you use the disk or Redis cache.
- HTTP & Web — every flag explained
-
docs/Fetch.md— canonical reference - jaq — jq-like JSON query language
- MiniJinja — payload template engine
- Recipe: Multi-Table Joins — asof-join the fetched weather to NYC 311
- Recipe: Geographic Enrichment — local geocoding as the cheap alternative to API-based
- Stats Cache & Caching — the wider qsv caching story
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation