Skip to content

Recipe Geographic Enrichment

Joel Natividad edited this page May 21, 2026 · 4 revisions

Recipe: Geographic Enrichment

Tier: Intermediate Commands used: geocode, luau, apply, lookup tables Anchor dataset: NYC 311 Service Requests with Latitude / Longitude columns

Problem

You have a CSV with raw lat/lon coordinates (or partial addresses). You want to enrich each row with:

  • nearest city name
  • US state and county (with FIPS codes for Census joins)
  • a higher-resolution neighborhood — Neighborhood Tabulation Area (NTA) for NYC, ZCTA for the US, etc.
  • fallback values when the local Geonames index misses (small towns, rural areas)

All without round-tripping to a paid API per row.

Data

For this recipe, use the NYC 311 bundled sample:

curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv
# Or use resources/test/NYC_311_SR_2010-2020-sample-1M.csv from the qsv repo

The Location column has values like "(40.6782, -73.9442)" — a (lat, lon) pair. Latitude / Longitude are also available as separate columns.

Setup

# Trigger the Geonames index download (first run only)
qsv geocode suggestnow "Brooklyn"

# Optional: download MaxMind GeoLite2-City.mmdb for IP lookups
# (free, needs an account at https://www.maxmind.com)
# Copy to ~/.qsv-cache/ or point at it with QSV_GEOIP2_FILENAME

Solution

1. Reverse-geocode each (lat, lon) to the nearest city + Borough + County

qsv geocode reverse Location \
  --new-column 'nearest_city,admin1,admin2' \
  --formatstr "%name,%admin1,%admin2" \
  --country US \
  nyc311samp.csv > step1.csv

%name = city name; %admin1 = state (NY); %admin2 = county (Brooklyn, Queens, …).

Result: ~360,000 records/sec on an M2 Pro thanks to caching + multithreading.

2. Pull US FIPS codes for Census joins (dyncols)

qsv geocode reverse Location \
  -f "%dyncols: {city:name},{state:admin1},{county:admin2},{state_fips:us_state_fips_code},{county_fips:us_county_fips_code}" \
  --country US \
  step1.csv > step2.csv

%dyncols: expands to multiple new columns. Each key:format pair becomes a column named on the left and formatted on the right. State FIPS = 36 for NY; county FIPS = 047 for Kings (Brooklyn).

3. Add high-resolution NTA via a lookup table (for unmatched cells)

Reverse-geocoding gets you to the city/county level. For NYC NTAs you need a lookup table (e.g., a CSV of NTA names by zip code, or a polygon-based service).

-- nta_enrich.lua
BEGIN {
  qsv_register_lookup("nta_by_zip", "https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/nyc_zip_to_nta.csv")
}!

local zip = string.match(_['Incident Address'] or '', "%d%d%d%d%d")
if zip and nta_by_zip[zip] then
  return nta_by_zip[zip].nta_name
else
  return _['county'] or "Unknown"   -- fallback
end
qsv luau map nta -x -f nta_enrich.lua step2.csv > step3.csv

If the lookup misses (small towns, rural ZIPs), the fallback returns county from step 1.

4. Add country information

qsv geocode countryinfo country_code visitors.csv > with_country_info.csv
# Adds: name, capital, area, population, languages, currency, …

Variations

Forward geocode (city name → lat/lon)

# input has a "city" column with values like "Brooklyn", "san jose"
qsv geocode suggest city \
  --new-column 'lat,lon' \
  --formatstr "%latitude,%longitude" \
  --country US \
  cities.csv > with_coords.csv

geocode suggest uses Jaro-Winkler fuzzy matching, so "san jose" and "San Jose" both find the same record.

Online geocoding for full street addresses (OpenCage)

geocode suggest / reverse are city-level — they can't resolve a full street address like 1600 Pennsylvania Ave NW. When you need address-level geocoding, the opencage subcommand calls the online OpenCage API (forward and reverse, auto-detected per row).

# get a free key (2,500 lookups/day) at https://opencagedata.com/users/sign_up
export QSV_OPENCAGE_API_KEY=your-key-here

# forward-geocode street addresses to coordinates
qsv geocode opencage street_address --new-column coordinates -f %location addresses.csv > with_coords.csv

# reverse-geocode "(lat, lon)" coordinates to a full formatted address
qsv geocode opencage Location --reverse --new-column address nyc311samp.csv > with_address.csv

Results are cached on disk (OpenCage's terms permit caching), so re-runs and duplicate queries don't re-hit the API. Calls are rate-limited (--rate-limit, default 1/sec) — reach for opencage for address-level quality, and offline suggest / reverse for city-level throughput. See Geospatial → geocode.

One-shot lookups from the shell

qsv geocode suggestnow "Manhattan"
qsv geocode reversenow "40.7589, -73.9851"
qsv geocode countryinfonow US

IP-based geolocation (needs MaxMind)

qsv geocode iplookup --new-column country ip_col logs.csv > with_country.csv

Convert lat/lon to WKT geometry for GeoJSON export

qsv apply operations wkt_point Latitude Longitude --new-column geometry step3.csv > step4.csv
qsv geoconvert step4.csv csv geojson --geometry geometry > nyc311.geojson
# Open in QGIS or upload to https://geojson.io

Use a different Geonames city set (smaller cities)

# Rebuild the local index from the cities5000 Geonames file (population > 5,000)
qsv geocode index-update --cities-url https://download.geonames.org/export/dump/cities5000.zip

Now qsv geocode suggest matches towns down to population 5,000 (~53k cities vs 26k).

Performance notes

  • 360,000 records/sec with caching and multithreading on an M2 Pro. Cache is per-process LRU; for cross-run persistence, use --cache-dir (it persists on disk in ~/.qsv-cache/).
  • The Geonames index lives in ~/.qsv-cache/ and is ~12 MB by default (cities15000). For cities5000 it's ~50 MB.
  • For massive datasets (10M+ rows), pair with qsv index and split into chunks for parallel processing:
qsv index huge.csv
qsv split chunks/ --size 1000000 -j 8 huge.csv
ls chunks/ | xargs -P 4 -I {} qsv geocode reverse Location ... chunks/{} -o enriched/{}
qsv cat rows enriched/*.csv > enriched_full.csv

See also

Clone this wiki locally