Skip to content
This repository

A coarse splitting geocoder in scala, based primarily on geonames data

bump 82.7

latest commit b8ccd0f594
David Blackman blackmad authored April 17, 2014
Octocat-spinner-32 core move around and break up a bunch of files around the storage/mongo pa… April 07, 2014
Octocat-spinner-32 data more osm name fields April 11, 2014
Octocat-spinner-32 docs add a strict mode to geocoding March 06, 2014
Octocat-spinner-32 indexer kill a flaky test :-/ April 15, 2014
Octocat-spinner-32 interface start recording a polygon "source" April 07, 2014
Octocat-spinner-32 project bump 82.7 April 17, 2014
Octocat-spinner-32 quadtree make point revgeo indexing work April 07, 2014
Octocat-spinner-32 replayer imports clean up April 05, 2014
Octocat-spinner-32 scripts bring in in-memory country code lookup April 07, 2014
Octocat-spinner-32 server dead code removal April 16, 2014
Octocat-spinner-32 util kill maxCells hint April 17, 2014
Octocat-spinner-32 .gitignore Stop ignoring .txt files January 16, 2014
Octocat-spinner-32 .gitmodules bring in in-memory country code lookup April 07, 2014
Octocat-spinner-32 .travis.yml only build 2.10 in travis April 16, 2014
Octocat-spinner-32 LICENSE.txt add apache 2.0 license April 05, 2012
Octocat-spinner-32 README.md note the need for higher mmap limits on linux April 05, 2014
Octocat-spinner-32 TODO fix indexing points April 08, 2014
Octocat-spinner-32 download-country.sh check for curl June 17, 2013
Octocat-spinner-32 download-world.sh overwrite when unzipping January 14, 2014
Octocat-spinner-32 eval.py Bug fix in eval.py January 30, 2014
Octocat-spinner-32 fetch_and_fix_flickr.py bounding boxes, move around some files. March 07, 2012
Octocat-spinner-32 geocoder.py geocoder.py update December 26, 2013
Octocat-spinner-32 match-flickr.py bump build # August 20, 2012
Octocat-spinner-32 parse.py fix updating latest symlink April 09, 2014
Octocat-spinner-32 sbt kill init.sh for sbt-launch.jar download January 23, 2014
Octocat-spinner-32 sbt-rebel work in progress on cleaning up some debugging and ambiguity August 15, 2012
Octocat-spinner-32 sbt-yjp use hfileservice's blockcache, speedup of 100x February 01, 2013
Octocat-spinner-32 serve.py Making warmup multithreaded April 16, 2014
Octocat-spinner-32 update-js-thrift.sh manually decode enums in debug pages November 27, 2013
README.md

A coarse, splitting geocoder and reverse geocoder in scala -- Prebuilt indexes and binaries available at twofishes.net. Discussion at google groups.

What is a Geocoder?

A geocoder is a piece of software that translates from strings to coordinates. "New York, NY" to "40.74, -74.0". This is an implementation of a coarse (city level, meaning it can't understand street addresses) geocoder that also supports splitting (breaking off the non-geocoded part in the final response).

Overview

This geocoder was designed around the geonames data, which is relatively small, and easy to parse in a short amount of time in a single thread without much post-processing. Geonames is a collection of simple text files that represent political features across the world. The geonames data has the nice property that all the features are listed with stable identifiers for their parents, the bigger political features that contain them (rego park -> queens county -> new york state -> united states). In one pass, we can build a database where each entry is a feature with a list of names for indexing, names for display, and a list of parents.

The Data

Geonames is great, but not perfect. Southeast Asia doesn't have the most comprehensive coverage. Geonames doesn't have bounding boxes, so we add some of those from http://code.flickr.com/blog/2011/01/08/flickr-shapefiles-public-dataset-2-0/ where possible.

Geonames is licensed under CC-BY http://www.geonames.org/. They take a pretty liberal interpretation of this and just ask for about page attribution if you make use of the data. Flickr shapefiles are public domain

Reverse Geocoding and Polygons

To enable reverse geocoding in twofishes, you need to add polygon data to the inputs. geonames does not distribute polygons, nor does the twofishes distribution contain shapefiles. Shapefiles must be in epsg:4326 projection. The following script will write a copy of your shapefile with an extra property that is the geonameid of the matching feature.

I will add automated scripts for this soon, but for now, if you have shapefiles that map to existing geonames features that you want to put into twofishes

examples:

US place (locality) data -- ftp://ftp2.census.gov/geo/tiger/TIGER2010/PLACE/2010/ ~/shputils/shape-gn-matchr.py --shp_name_keys=NAME10 tl_2010_35_place10.shp gn-tl_2010_35_place10.shp

US county data -- ftp://ftp2.census.gov/geo/tiger/TIGER2010/COUNTY/2010/ ../shputils/shape-gn-matchr.py --dbname=gis --shp_name_keys=NAME10 --allowed_gn_classes='' --allowed_gn_codes=ADM2 --fallback_allowed_gn_classes='' --fallback_allowed_gn_codes='' tl_2010_us_county10.shp gn-us-adm2.shp

MX locality data -- http://blog.diegovalle.net/2013/02/download-shapefiles-of-mexico.html ogr2ogr -t_srs EPSG:4326 mx-4326.shp MUNICIPIOS.shp ./shputils/shape-gn-matchr.py --dbname=gis --shp_name_keys=NOM_MUN mx-4326.shp gn-mx-localities.shp

Requirements

  • Java (jre and jdk)
  • Mongo
  • curl
  • unzip

First time setup

Data import

  • mongod --dbpath /local/directory/for/output/
  • ./init-database.sh # drops existing table and creates indexes
  • If you want to import countries: ./parse.py -c US /output/dir (Note that you can specify list of countries separating them by comma: US,GB,RU)
  • If you want to import world: ./parse.py -w /output/dir

Serving

  • ./serve.py -p 8080 /output/dir – Where /output/dir will contain a subdirectory whose name will be the date of the most recent build, for example 2013-02-25-01-08-23.803740. You need to point to this subdirectory or to a folder called latest which is created during the build process (in the twofishes directory) and is a symlink to the most recent dated subdirectory.
  • server should be responding to finagle-thrift on the port specified (8080 by default), and responding to http requests at the next port up: http://localhost:8081/?query=rego+park+ny http://localhost:8081/static/geocoder.html#rego+park
  • if you want to run vanilla thrift-rpc (not finagle). use ./sbt "server/run-main com.foursquare.twofishes.GeocodeThriftServer --port 8080 --hfile_basepath ." instead NOTE: mongod is not required for serving, only index building.

A better option is to run "./sbt server/assembly" and then use the resulting server/target/server-assembly-VERSION.jar. Serve that with java -jar JARFILE --hfile_basepath /directory

Troubleshooting

If you see a java OutOfMemory error at start, you may need to up your # of mapped files

on linux: sysctl -w vm.max_map_count = 131072

Talking to the Server

Technical Details

I use mongo to save state during the index building phase (so that, for instance, we can parse the alternateNames file, which adds name+lang pairs to features defined in a separate file, or adding the flickr bounding boxes). A final pass goes over the database, dereferences ids and outputs some hadoop mapfiles and hfiles. These two hfiles are all that is required for serving the data.

If we were doing heavier processing on the incoming data, a mapreduce that spits out hfiles might make more sense.

When we parse a query, we do a rough recursive descent parse, starting from the left. If being used to split geographic queries like "pizza new york" we expect the "what" to be on the left. All of the features found in a parse must be parents of the smallest

The geocoder currently may return multiple valid parses, however, it only returns the longest possible parses. For "Springfield, US" we will return multiple features that match that query (there are dozens of springfields in the US). It will not return a parse of "Springfield" near "US" with only US geocoded if it can find a longer parse, but it will return multiple valid interpretations of the longest parse.

Performance

Twofishes can handle 100s of queries a second at < 5ms/query on average.

Point reverse geocoding is absurdly performant -- 1000s of queries a second at < 1ms/query.

Future

I'd like to integrate more data from OSM and possibly an entire build solely from OSM. I'd also like to get supplemental data from the Foursquare database where possible. If I was feeling more US-centric, I'd parse the TIGER-line data for US polygons, but I'm expecting those to mostly be in OSM.

Also US-centric are zillow neighborhood polygons, also CC-by-SA. I might add an "attribution" field to the response for certain datasources. I'm not looking forward to writing a conflater with precedence for overlapping features from different data sets.

Me

David Blackman blackmad@foursquare.com

Contributors

Many thanks to @nsanch for cleaning up lots of this code and working out lots of performance issues @jorgeo has been helping me tune the java performance of this since day 1, and worked out lots of issues with spindle in the process @slackhappy has yet to make a contribution to the codebase, but has spent a lot of time reasoning about our internal deployments of twofishes

Unrelated

These are the two fishes I grilled the night I started coding the original python implementation https://twitter.com/#!/whizziwig/statuses/154431957630066688

Something went wrong with that request. Please try again.