Running your own Geocoder in Amazon EC2

This guide will explain how to run your own high volume geocoder inside Amazon EC2 cloud in less than 5 minutes.

Overview

We have put together a complete installed server that includes TIGER Database plus the Geocoder and also a REST JSON API node.js app that wraps all that functionality to offer you a simple boxed solution to start geocoding as if you were hitting Google Maps API. This is available as AMI (Amazon Machine Image) and you can fire up an EC2 (Amazon's Server Cloud) using this image and simply be up and going with your Geocoding API.

The following is currently installed on the AMI:

Ubuntu 14.04.1 LTS (Trusty)
Postgres 9.3
Postgis 2.1
TIGER 2013 database
Zillow Neighborhoods Shapefiles
Redis 2.6
NodeJs 10.32

All this comes free of charge as they are Open Source software. You should consult each of these maintainers' licensing terms to ensure they do not conflict with your own policies.

AMI Instance Sizing

TIGER Database is about 100GiB in size and includes all US states. Geocoding is 100% a Postgres task, hence the Database will take most of the load on this. As you can imagine, if you geocode across the US, ideally you will have the full 100GiB loaded into the RAM. The response times vary greatly depending on whether or not the requested data is already into RAM or it needs to be brought in. If it needs to be brought in, the HDD latency will greatly affect the response time.

In short, we have sized up the TIGER geocoder installation against 1x 30GiB SSD for the os and 4x 30GiB SSDs (placed them in RAID 0 to maximize the SSD read throughput) for the postgres database. The SSDs are available from Amazon EBS.

When choosing an EC2 Instance Type, the focus should be on RAM rather than number of vCPUs or disk storage. We recommend the Memory Optimized instance types in the R3.xx family.

r3.large should be minimum recommended
r3.xlarge is our ideal target

You can definitely install on less powerful instance types such as t2.medium but, the RPS (request per second) throughput will be greatly reduced. Geocoding Intersections is particularly heavy due to the nature of the operation that reads all matches for 2 streets and attempts to see if their GeoShape intersects. This can yield as much as several thousand matches that need intersected and tested.

Launching a TIGER Geocoder Instance

To get started, launch a new EC2 Instance as described here

At step 4, choose "AWS Marketplace" and search for: TIGER Geocoder

Complete the rest of the steps as outlined in the AWS Guide.

Types of installation and use

TIGER Geocoder AMI can be used in 2 modes:

As TIGER Postgress Geocoder database only. In this mode you simply connect on Postgres port 5432 and execute queries against TIGER database directly as you find fit. Your benefit in this case is having a full TIGER database and PostGis support in your application.
As HTTP REST JSON API server. In this mode you leverage extra code we put on the server to give you a full API level interface accessible on HTTP Port 80. You can issue REST transactions against the API in a similar manner to Google Maps API. Details on the REST API methods. The advantage here is that you don't need to form your Postgres queries, also the Geocoding responses are automatically cached in local Redis server and used for same identical address query. Redis response time is in the 1-2 ms vs Tiger's 100-400ms range. In addition, the geocoded response returns the TIGER geoids, that is, ids for each geo part such as city, state, county etc). This is useful to cross reference with external data such as CENSUS. There is fallback on Google Maps API in case TIGER can't resolve a certain address or the accuracy makes it unusable.

Security and Amazon EC2

By default when you launch the API inside EC2 you will be asked to configure the security and what ports are open and who has access to them. TIGER Geocoder opens up two ports 5432 for Postgres access and 80 for REST API access. You control access to these ports via your security policy.

If you choose to make postgres port accessible from outside please note, that you will need to change the default postgres security model /etc/postgresql/9.3/main/pg_hba.conf file to allow remote connections. Uncomment the following line:

# IPv4 local connections:

#host geocoder tiger all password

The Ubuntu installation has no public access. You will need to connect using your PEM key issued under your EC2 account. This is standard security procedure that EC2 recommends.

Adjusting Postgres parameters

Based on the EC2 instance size you choose, you need to make few adjustments to tell Postgres how much memory you've allocated. More is always better when dealing with geocoding, however, based on our various load tests, we determined that 15GB of RAM is a decent minimum with 30GB of RAM being ideal for high volume daily jobs.

Once you connect to your instance as superuser, edit: /etc/postgresql/9.3/main/postgresql.conf

shared_buffers = INSTANCE RAM * 0.125 (example: 4GB). Min 900MB, Max 4GB
effective_cache_size = INSTANCE RAM * 0.70 (example 24GB). This has a huge effect on response time.

Everything else is already tuned.

NOTE:

Cold starting Postgres will set you up for some fun stuff. Tiger loads data/indexes into RAM on a per-state basis, hence if you have a super busy geocoding machine, make sure you setup a warm-up script that issues 3-5 queries for addresses in each US state. By default the Postgres query timeout is set at 10 seconds. You might want to relax that depending on your setup by adjusting statement_timeout in the conf file.

Lately I think starting the AMI from amazon may require you to issue an ANALYZE against the whole geocoder database to re-organize the statistics according to the newly created disk images. This will take potentially few hours to run one time but it's time well spent. Set the query timeout to 0 temporarily to run vacuum and then connect to postgres as follows:

sudo su postgres
psql
\c geocoder;
analyze verbose;

Adjusting Redis parameters

Redis is used to cache geocoding results when using REST JSON API. Every time a geocoding request is successfully resolved, it's also cached in redis for 3 months. Redis is configured to use default LRU (Least Recently Used) mechanism to scavenge the records when it runs out of memory. It also persists the memory to disk, if you need to restart your instance for whatever reason. The only parameter that needs adjusted is how much memory you allow Redis to use. The default is 500MB. To change this edit /etc/redis/redis.conf

maxmemory 500mb

When adjusting this, please consider the memory amount you allocated to Postgres as well.

Dealing with disasters

The good news is that the TIGER Geocoder AMI is a read only server (except redis caching). It does not need to write data on the Postgres database hence, in case your machine is lost, you simply can launch a new AMI and get going in few minutes. The only recoverable loss is the already geocoded addresses that were stored in Redis server cache, but that cache is being rebuilt as you resume geocoding again.

Troubleshooting

To monitor Postgres activity simply watch the logfile:

tail /var/log/postgresql/postgresql-9.3-main.log -f

If you are using the REST API, monitor the logfile as follows:

tail /var/log/node.log -f

To restart the TIGER Geocoder Service run:

sudo service tiger-geocoder restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly