-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Geographic Enrichment
Tier: Intermediate
Commands used: geocode, luau, apply, lookup tables
Anchor dataset: NYC 311 Service Requests with Latitude / Longitude columns
You have a CSV with raw lat/lon coordinates (or partial addresses). You want to enrich each row with:
- nearest city name
- US state and county (with FIPS codes for Census joins)
- a higher-resolution neighborhood — Neighborhood Tabulation Area (NTA) for NYC, ZCTA for the US, etc.
- fallback values when the local Geonames index misses (small towns, rural areas)
All without round-tripping to a paid API per row.
For this recipe, use the NYC 311 bundled sample:
curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv
# Or use resources/test/NYC_311_SR_2010-2020-sample-1M.csv from the qsv repoThe Location column has values like "(40.6782, -73.9442)" — a (lat, lon) pair. Latitude / Longitude are also available as separate columns.
# Trigger the Geonames index download (first run only)
qsv geocode suggestnow "Brooklyn"
# Optional: download MaxMind GeoLite2-City.mmdb for IP lookups
# (free, needs an account at https://www.maxmind.com)
# Copy to ~/.qsv-cache/ or point at it with QSV_GEOIP2_FILENAMEqsv geocode reverse Location \
--new-column 'nearest_city,admin1,admin2' \
--formatstr "%name,%admin1,%admin2" \
--country US \
nyc311samp.csv > step1.csv%name = city name; %admin1 = state (NY); %admin2 = county (Brooklyn, Queens, …).
Result: ~360,000 records/sec on an M2 Pro thanks to caching + multithreading.
qsv geocode reverse Location \
-f "%dyncols: {city:name},{state:admin1},{county:admin2},{state_fips:us_state_fips_code},{county_fips:us_county_fips_code}" \
--country US \
step1.csv > step2.csv%dyncols: expands to multiple new columns. Each key:format pair becomes a column named on the left and formatted on the right. State FIPS = 36 for NY; county FIPS = 047 for Kings (Brooklyn).
Reverse-geocoding gets you to the city/county level. For NYC NTAs you need a lookup table (e.g., a CSV of NTA names by zip code, or a polygon-based service).
-- nta_enrich.lua
BEGIN {
qsv_register_lookup("nta_by_zip", "https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/nyc_zip_to_nta.csv")
}!
local zip = string.match(_['Incident Address'] or '', "%d%d%d%d%d")
if zip and nta_by_zip[zip] then
return nta_by_zip[zip].nta_name
else
return _['county'] or "Unknown" -- fallback
endqsv luau map nta -x -f nta_enrich.lua step2.csv > step3.csvIf the lookup misses (small towns, rural ZIPs), the fallback returns county from step 1.
qsv geocode countryinfo country_code visitors.csv > with_country_info.csv
# Adds: name, capital, area, population, languages, currency, …# input has a "city" column with values like "Brooklyn", "san jose"
qsv geocode suggest city \
--new-column 'lat,lon' \
--formatstr "%latitude,%longitude" \
--country US \
cities.csv > with_coords.csvgeocode suggest uses Jaro-Winkler fuzzy matching, so "san jose" and "San Jose" both find the same record.
geocode suggest / reverse are city-level — they can't resolve a full street address like 1600 Pennsylvania Ave NW. When you need address-level geocoding, the opencage subcommand calls the online OpenCage API (forward and reverse, auto-detected per row).
# get a free key (2,500 lookups/day) at https://opencagedata.com/users/sign_up
export QSV_OPENCAGE_API_KEY=your-key-here
# forward-geocode street addresses to coordinates
qsv geocode opencage street_address --new-column coordinates -f %location addresses.csv > with_coords.csv
# reverse-geocode "(lat, lon)" coordinates to a full formatted address
qsv geocode opencage Location --reverse --new-column address nyc311samp.csv > with_address.csvResults are cached on disk (OpenCage's terms permit caching), so re-runs and duplicate queries don't re-hit the API. Calls are rate-limited (--rate-limit, default 1/sec) — reach for opencage for address-level quality, and offline suggest / reverse for city-level throughput. See Geospatial → geocode.
qsv geocode suggestnow "Manhattan"
qsv geocode reversenow "40.7589, -73.9851"
qsv geocode countryinfonow USqsv geocode iplookup --new-column country ip_col logs.csv > with_country.csvqsv apply operations wkt_point Latitude Longitude --new-column geometry step3.csv > step4.csv
qsv geoconvert step4.csv csv geojson --geometry geometry > nyc311.geojson
# Open in QGIS or upload to https://geojson.io# Rebuild the local index from the cities5000 Geonames file (population > 5,000)
qsv geocode index-update --cities-url https://download.geonames.org/export/dump/cities5000.zipNow qsv geocode suggest matches towns down to population 5,000 (~53k cities vs 26k).
-
360,000 records/sec with caching and multithreading on an M2 Pro. Cache is per-process LRU; for cross-run persistence, use
--cache-dir(it persists on disk in~/.qsv-cache/). - The Geonames index lives in
~/.qsv-cache/and is ~12 MB by default (cities15000). For cities5000 it's ~50 MB. - For massive datasets (10M+ rows), pair with
qsv indexand split into chunks for parallel processing:
qsv index huge.csv
qsv split chunks/ --size 1000000 -j 8 huge.csv
ls chunks/ | xargs -P 4 -I {} qsv geocode reverse Location ... chunks/{} -o enriched/{}
qsv cat rows enriched/*.csv > enriched_full.csv- Geospatial → geocode — every flag explained
- Geospatial → geoconvert — CSV ↔ GeoJSON / SHP / KML
- Scripting (Luau / Python) → luau — lookup tables and BEGIN/MAIN/END blocks
-
Lookup Tables —
qsv_register_lookupdeep-dive - qsv-lookup-tables — companion repo for community lookup CSVs
docs/help/geocode.md- Cookbook (legacy) — original short snippets
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation