Skip to content
A Tutorial on Geocoding and Reverse Geocoding with Python
HTML Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Geocoding_and_Reverse_Geocoding_with_Python-HTML.html
Geocoding_and_Reverse_Geocoding_with_Python.ipynb
README.md

README.md

Geocoding-and-Reverse-Geocoding-with-Python

Authorship Disclaimer: This tutorial titled 'Geocoding and Reverse Geocoding with Python' was originally by me and submitted to DataCamp on "Jan 27, 2018". Since they didn't publish it on their platform, I have decided to do it here so that someone out there may find it useful.

DataCamp Tutorial - Geocoding and Reverse Geocoding with Python

The increasing use of location-aware data and technologies that are able to give directions relative to location and access geographically aware data has given rise to category of data scientists with strong knowledge of geospatial data - Geo-data Scientists.

In this tutorial, you will discover how to use PYTHON to carry out geocoding task. Specifically, you will learn to use GeoPy, Pandas and Folium PYTHON libraries to complete geocoding tasks. Because this is a geocoding tutorial, the article will cover more of GeoPy than Pandas. If you are not familiar with Pandas, you should definitely consider studying the Pandas Tutorial by Karlijn Willems so also this Pandas cheat sheet will be handy to your learning.

Tutorial Overview

  • What is Geocoding?
  • Geocoding with Python
  • Putting it all together – Bulk Geocoding
  • Accuracy of the Result
  • Mapping Geocoding Result
  • Conclusion


What is Geocoding?

A very common task faced by Geo-data Scientist is the conversion of physical human-readable addresses of places into latitude and longitude geographical coordinates. This process is known as “Geocoding” while the reverse case (that is converting latitude and longitude coordinates into physical addresses) is known as “Reverse Geocoding”. To clarify this explanation, here is an example using the datacamp USA office address:-

Geocoding: is converting an address like “Empire State Building 350 5th Ave, Floor 77 New York, NY 10118” to “latitude 40.7484284, longitude -73.9856546”.

Reverse Geocoding: is converting “latitude 40.7484284, longitude -73.9856546” to address “Empire State Building 350 5th Ave, Floor 77 New York, NY 10118”.

Now that you have seen how to do forward and reverse geocoding manually, let’s see how it can be done programmatically in PYTHON on larger dataset by calling some APIs.


Geocoding with Python

There is good number of PYTHON modules for Geocoding and Reverse Geocoding. In this tutorial, you will use the PYTHON Geocoding Toolbox named GeoPy which provides support for several popular geocoding web services including Google Geocoding API, OpenStreetMap Nominatim, ESRI ArcGIS, Bing Maps API etc.

You will make use of OpenStreetMap Nominatim API because it is completely open source and has no limit to the number of requests you can make. But first, you need to install the libraries (geopy, pandas and folium) on your PYTHON environment using “pip install geopy, pandas, folium”.

Let's import the libraries...

In [1]:
# Importing the necessary modules for this tutorial
# Folium Library for visualizing data on interactive map
# Pandas Library for fast, flexible, and expressive data structures designed

import folium import pandas as pd from geopy.geocoders import Nominatim, ArcGIS, GoogleV3 # Geocoder APIs

Note: You don’t have to import all the three geocoding APIs namely Nominatim, ArcGIS and GoogleV3 from the geopy module. However, I did so you can test and compare the result from the different APIs to find out which is more accurate with your specific dataset. To follow along and to get you familiar with geocoding, make use of “OpenStreetMap Nominatim API” for this article.

To do forward geocoding (convert address to latitude/longitude), you first create a geocoder API object by calling the Nominatim() API class.

In [2]:
g = Nominatim() # You can tryout ArcGIS or GoogleV3 APIs to compare the results

In the next few lines of code below, you will do forward Geocoding and Reverse Geocoding respectively.

In [3]:
# Geocoding - Address to lat/long

n = g.geocode('Empire State Building New York', timeout=10) # Address to geocode print(n.latitude, n.longitude)

40.7484284 -73.9856546198733

By calling the geocode() method on the defined API object, you will supply an address as the first parameter to get it corresponding latitude and longitude attributes.

In [4]:
# Reverse Geocoding - lat/long to Address

n = g.reverse((40.7484284, -73.9856546198733), timeout=10) # Lat, Long to reverse geocode print(n.address)

Empire State Building, 350, 5th Avenue, Korea Town, Manhattan Community Board 5, New York County, NYC, New York, 10018, United States of America

To reverse the process, you will call the reverse() method on the same API object and supply latitude and longitude coordinate values in that order to obtain their corresponding address attribute.

The process above is the very basic of geocoding a single address and reverse geocoding of a pair of latitude and longitude coordinate using PYTHON.

Now, let’s process a lager dataset in the next section. You will use Pandas library for the data handling/wrangling and Folium to subsequently visualize the geocoded result.

In [ ]:
 

Putting it all together – Bulk Geocoding

In the previous section, you geocoded a single place/address; "Empire State Building, New York". Now, you will work with bulk dataset, which is broadened to contain list of similar places (buildings) in New York City.

On this wikipedia page, there is an awesome list of tallest buildings in New York City. Unfortunately, the table has no detailed addresses or geographic coordinates of the buildings.

You will fix this missing data by applying geocoding technique you learned in the previous section. Specifically, you are going to look at the 'Name' column on the first table on the page where "Empire State Building" is the third ranked tallest building.

There are many methods of importing such a tabulated list into a PYTHON environment, in this case use pandas read_clipboard() method. Copy “Rank and Name” columns to your clipboard and create a dataframe.

In [5]:
# Create a dataframe from the copied table columns on the clipboard and display its first 10 records

df = pd.read_clipboard() df.head(10)

Out[5]:
Rank Name
0 1 One World Trade Center
1 2 432 Park Avenue
2 3 Empire State Building
3 4 Bank of America Tower
4 5 Three World Trade Center*
5 6= Chrysler Building
6 6= The New York Times Building
7 8 One57
8 9 Four World Trade Center
9 10 220 Central Park South

Just like with any other data science dataset, you should do some clean up on the data. In particular, remove special characters (such as * “ ? # ‘ \ %) in the input dataset. This will enable the system read the names correctly without mixing there meaning.

In [6]:
# Remove all characters except letters belonging to english alphabet, spaces and tabs

df['Name'] = df['Name'].str.replace('[^A-Za-z\s0-9]+', '') df.head(10)

Out[6]:
Rank Name
0 1 One World Trade Center
1 2 432 Park Avenue
2 3 Empire State Building
3 4 Bank of America Tower
4 5 Three World Trade Center
5 6= Chrysler Building
6 6= The New York Times Building
7 8 One57
8 9 Four World Trade Center
9 10 220 Central Park South

Also, the names may likely be in use in some other part of the world, you can help the system better know that you are primarily concerned with the building names in New York City by appending “New York City” to each building name as follow.

In [7]:
# Create a new column "Address_1" to hold the updated building names

df['Address_1'] = (df['Name'] + ', New York City') df.head(10)

Out[7]:
Rank Name Address_1
0 1 One World Trade Center One World Trade Center, New York City
1 2 432 Park Avenue 432 Park Avenue, New York City
2 3 Empire State Building Empire State Building, New York City
3 4 Bank of America Tower Bank of America Tower, New York City
4 5 Three World Trade Center Three World Trade Center, New York City
5 6= Chrysler Building Chrysler Building, New York City
6 6= The New York Times Building The New York Times Building, New York City
7 8 One57 One57, New York City
8 9 Four World Trade Center Four World Trade Center, New York City
9 10 220 Central Park South 220 Central Park South, New York City

Next step is the loop through the each record on 'Address_1' column and get the corresponding address and geographic coordinates.

In [8]:
add_list = [] # an empty list to hold the geocoded results

for add in df['Address_1']: print ('Processing .... ', add)

<span class="k">try</span><span class="p">:</span>
    <span class="n">n</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">geocode</span><span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    
    <span class="n">data</span> <span class="o">=</span> <span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">latitude</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">longitude</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">address</span><span class="p">)</span>
    <span class="n">add_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    
<span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">)</span>
    <span class="n">add_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>

Processing ....  One World Trade Center, New York City
Processing ....  432 Park Avenue, New York City
Processing ....  Empire State Building, New York City
Processing ....  Bank of America Tower, New York City
Processing ....  Three World Trade Center, New York City
Processing ....  Chrysler Building, New York City
Processing ....  The New York Times Building, New York City
Processing ....  One57, New York City
Processing ....  Four World Trade Center, New York City
Processing ....  220 Central Park South, New York City
Processing ....  70 Pine Street, New York City
Processing ....  30 Park Place, New York City
Processing ....  40 Wall Street, New York City
Processing ....  Citigroup Center, New York City
Processing ....  10 Hudson Yards, New York City
Processing ....  8 Spruce Street, New York City
Processing ....  Trump World Tower, New York City
Processing ....  30 Rockefeller Plaza, New York City
Processing ....  56 Leonard Street, New York City
Processing ....  CitySpire Center, New York City
Processing ....  28 Liberty Street, New York City
Processing ....  4 Times Square, New York City
Processing ....  MetLife Building, New York City
Processing ....  731 Lexington Avenue, New York City
Processing ....  Woolworth Building, New York City
Processing ....  50 West Street, New York City
Processing ....  One Worldwide Plaza, New York City
Processing ....  Madison Square Park Tower, New York City
Processing ....  Carnegie Hall Tower, New York City
Processing ....  383 Madison Avenue, New York City
Processing ....  1717 Broadway, New York City
Processing ....  AXA Equitable Center, New York City
Processing ....  One Penn Plaza, New York City
Processing ....  1251 Avenue of the Americas, New York City
Processing ....  Time Warner Center South Tower, New York City
Processing ....  Time Warner Center North Tower, New York City
Processing ....  200 West Street, New York City
Processing ....  60 Wall Street, New York City
Processing ....  One Astor Plaza, New York City
Processing ....  7 World Trade Center, New York City
Processing ....  One Liberty Plaza, New York City
Processing ....  20 Exchange Place, New York City
Processing ....  200 Vesey Street, New York City
Processing ....  Bertelsmann Building, New York City
Processing ....  Times Square Tower, New York City
Processing ....  Metropolitan Tower, New York City
Processing ....  252 East 57th Street, New York City
Processing ....  100 East 53rd Street, New York City
Processing ....  500 Fifth Avenue, New York City
Processing ....  JP Morgan Chase World Headquarters, New York City
Processing ....  General Motors Building, New York City
Processing ....  3 Manhattan West, New York City
Processing ....  Metropolitan Life Insurance Company Tower, New York City
Processing ....  Americas Tower, New York City
Processing ....  Solow Building, New York City
Processing ....  Marine Midland Building, New York City
Processing ....  55 Water Street, New York City
Processing ....  277 Park Avenue, New York City
Processing ....  5 Beekman, New York City
Processing ....  Morgan Stanley Building, New York City
Processing ....  Random House Tower, New York City
Processing ....  Four Seasons Hotel New York, New York City
Processing ....  1221 Avenue of the Americas, New York City
Processing ....  Lincoln Building, New York City
Processing ....  Barclay Tower, New York City
Processing ....  Paramount Plaza, New York City
Processing ....  Trump Tower, New York City
Processing ....  One Court Square, New York City
Processing ....  Sky, New York City
Processing ....  1 Wall Street, New York City
Processing ....  599 Lexington Avenue, New York City
Processing ....  Silver Towers I, New York City
Processing ....  Silver Towers II, New York City
Processing ....  712 Fifth Avenue, New York City
Processing ....  Chanin Building, New York City
Processing ....  245 Park Avenue, New York City
Processing ....  Sony Tower, New York City
Processing ....  Tower 28, New York City
Processing ....  225 Liberty Street, New York City
Processing ....  1 New York Plaza, New York City
Processing ....  570 Lexington Avenue, New York City
Processing ....  MiMA, New York City
Processing ....  345 Park Avenue, New York City
Processing ....  400 Fifth Avenue, New York City
Processing ....  W R Grace Building, New York City
Processing ....  Home Insurance Plaza, New York City
Processing ....  1095 Avenue of the Americas, New York City
Processing ....  W New York Downtown Hotel and Residences, New York City
Processing ....  101 Park Avenue, New York City
Processing ....  One Dag Hammarskjld Plaza, New York City
Processing ....  Central Park Place, New York City
Processing ....  888 7th Avenue, New York City
Processing ....  Waldorf Astoria New York, New York City
Processing ....  1345 Avenue of the Americas, New York City
Processing ....  Trump Palace Condominiums, New York City
Processing ....  Olympic Tower, New York City
Processing ....  Mercantile Building, New York City
Processing ....  425 Fifth Avenue, New York City
Processing ....  One Madison, New York City
Processing ....  919 Third Avenue, New York City
Processing ....  New York Life Building, New York City
Processing ....  750 7th Avenue, New York City
Processing ....  The Epic, New York City
Processing ....  Eventi, New York City
Processing ....  Tower 49, New York City
Processing ....  555 10th Avenue, New York City
Processing ....  The Hub, New York City
Processing ....  Calyon Building, New York City
Processing ....  Baccarat Hotel and Residences, New York City
Processing ....  250 West 55th Street, New York City
Processing ....  The Orion, New York City
Processing ....  590 Madison Avenue, New York City
Processing ....  11 Times Square, New York City
Processing ....  1166 Avenue of the Americas, New York City

Save the result into a dataframe.

In [9]:
# make a new dataframe to hold geocoded reult

add_list_df = pd.DataFrame(add_list, columns=['Address_1', 'Latitude', 'Longitude', 'Full Address']) add_list_df.head(10)

Out[9]:
Address_1 Latitude Longitude Full Address
0 One World Trade Center, New York City 40.713 -74.0132 One World Trade Center, 1, Fulton Street, Batt...
1 432 Park Avenue, New York City 40.7615 -73.9719 432 Park Avenue, 432, Manhattan Community Boar...
2 Empire State Building, New York City 40.7484 -73.9857 Empire State Building, 350, 5th Avenue, Korea ...
3 Bank of America Tower, New York City 40.7555 -73.9847 Bank of America Tower, 115, West 42nd Street, ...
4 Three World Trade Center, New York City None None None
5 Chrysler Building, New York City 40.7516 -73.9753 Chrysler Building, East 43rd Street, Tudor Cit...
6 The New York Times Building, New York City 40.7559 -73.9893 The New York Times Building, 620, 8th Avenue, ...
7 One57, New York City 40.7655 -73.9791 One57, West 57th Street, Diamond District, Man...
8 Four World Trade Center, New York City None None None
9 220 Central Park South, New York City 40.767 -73.9806 220 Central Park South, Manhattan Community Bo...
In [ ]:
 

Accuracy of the Result

A quick inspection of the latest data frame reveals that the obtained geographical coordinates of the buildings lies within the latitude and longitude territory of New York City (that is: 40°42′46″N, 74°00′21″W). There are some buildings that were not geocoded (their results were not found). This indicates that there geocode results are not available in the OpenStreetMap Nominatim API.

Now, you can make use of some other APIs to check if their geocode results are available within the new API.

First, use the pandas “loc” method to separate the records whose geocode results were found from those that were not found.

In [10]:
# Extract the records where value of Latitude and Longitude are the same (that is: None)

geocode_found = add_list_df.loc[add_list_df['Latitude'] != add_list_df['Longitude']]

geocode_not_found = add_list_df.loc[add_list_df['Latitude'] == add_list_df['Longitude']] geocode_not_found

Out[10]:
Address_1 Latitude Longitude Full Address
4 Three World Trade Center, New York City None None None
8 Four World Trade Center, New York City None None None
27 Madison Square Park Tower, New York City None None None
34 Time Warner Center South Tower, New York City None None None
35 Time Warner Center North Tower, New York City None None None
49 JP Morgan Chase World Headquarters, New York City None None None
50 General Motors Building, New York City None None None
71 Silver Towers I, New York City None None None
72 Silver Towers II, New York City None None None
77 Tower 28, New York City None None None
87 W New York Downtown Hotel and Residences, New ... None None None
89 One Dag Hammarskjld Plaza, New York City None None None
92 Waldorf Astoria New York, New York City None None None
In [ ]:
 

There are many ways to get this done, in this case you simply compare the latitude and longitude columns knowing that their numeric values can never be the same. Wherever the latitude and longitude cells have the same value, it will be a string value of “None”, which means a geocode result wasn’t found for that building’s name.

Now, will you redefine the geocoder API object to call a different API (ArcGIS API for example) by calling the ArcGIS() API class.

In [11]:
g = ArcGIS() # redefine the API object

Then you can now loop through “geocode_not_found” data frame to see if you can get some results from the new API.

In [12]:
add_list = []

for add in geocode_not_found['Address_1']: print ('Processing .... ', add)

<span class="k">try</span><span class="p">:</span>
    <span class="n">n</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">geocode</span><span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    
    <span class="n">data</span> <span class="o">=</span> <span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">latitude</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">longitude</span><span class="p">,</span> <span class="n">n</span><span class="o">.</span><span class="n">address</span><span class="p">)</span>
    <span class="n">add_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    
<span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">(</span><span class="n">add</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">,</span> <span class="s2">&quot;None&quot;</span><span class="p">)</span>
    <span class="n">add_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>

Processing ....  Three World Trade Center, New York City
Processing ....  Four World Trade Center, New York City
Processing ....  Madison Square Park Tower, New York City
Processing ....  Time Warner Center South Tower, New York City
Processing ....  Time Warner Center North Tower, New York City
Processing ....  JP Morgan Chase World Headquarters, New York City
Processing ....  General Motors Building, New York City
Processing ....  Silver Towers I, New York City
Processing ....  Silver Towers II, New York City
Processing ....  Tower 28, New York City
Processing ....  W New York Downtown Hotel and Residences, New York City
Processing ....  One Dag Hammarskjld Plaza, New York City
Processing ....  Waldorf Astoria New York, New York City

Here you can see that ArcGIS was able to retrieve geocode results for the buildings that Nominatim API couldn’t retrieve.

In [13]:
add_list_df = pd.DataFrame(add_list, columns=['Address_1', 'Latitude', 'Longitude', 'Full Address'])
add_list_df.head(10)
Out[13]:
Address_1 Latitude Longitude Full Address
0 Three World Trade Center, New York City 40.709690 -74.011670 World Trade Center
1 Four World Trade Center, New York City 40.709900 -74.012090 Four World Trade Center
2 Madison Square Park Tower, New York City 40.741500 -73.987580 Madison Square
3 Time Warner Center South Tower, New York City 40.767857 -73.982391 Time Warner Ctr, New York, 10019
4 Time Warner Center North Tower, New York City 40.767857 -73.982391 Time Warner Ctr, New York, 10019
5 JP Morgan Chase World Headquarters, New York City 40.727050 -73.825910 Headquarters
6 General Motors Building, New York City 40.879330 -73.871330 GM
7 Silver Towers I, New York City 40.843822 -73.847128 Silver St, Bronx, New York, 10461
8 Silver Towers II, New York City 40.843822 -73.847128 Silver St, Bronx, New York, 10461
9 Tower 28, New York City 40.593850 -74.186119 28 Towers Ln, Staten Island, New York, 10314

You could also import the latitudes and longitudes as points unto Google maps to further validate their positional accuracy. As seen below, the latitude and longitude positions are at least more than 95% accurately geocoded.

In [ ]:
 

Mapping Geocoding Result

An obvious purpose of geocoding is to visualize places/addresses on a map. Here, you will learn to visualize the “geocode_found” data frame on a simple interactive map using the folium library (recall you have imported the library at the beginning of this tutorial). Folium makes it easy to visualize data that's been manipulated in PYTHON on an interactive LeafletJS map.

In [14]:
# convert Full Address, Latitude and Longitude dataframe columns to list
full_address_list = list(geocode_found['Full Address'])
long_list = list(geocode_found["Longitude"])
lat_list = list(geocode_found["Latitude"])

# create folium map object geocoded_map = folium.Map(location=[40.7484284, -73.9856546], zoom_start=13) # location=[Lat, Long]

# loop through the lists and create markers on the map object for long, lat, address in zip(long_list, lat_list, full_address_list): geocoded_map.add_child(folium.Marker(location=[lat, long], popup=address)) geocoded_map.add_child(folium.CircleMarker(location=[lat, long], popup=address, radius=5, color='green', fill_color='green', fill_opacity=.2))

# Display the map inline geocoded_map

Out[14]:
In [ ]:
 

Conclusion

You have just learned about geocoding and reverse geocoding in Python primarily using third party GeoPy module. The knowledge you have learned here will definitely help to locate addresses and places when working on datasets that are amenable to maps. Geocoding is useful for plotting and extracting places/addresses on a map for obvious reasons which may include:-

  • To visualize distances such as roads and pipelines
  • To deliver insight into public health information,
  • To determine voting demographics,
  • To analyze law enforcement and intelligence data, etc

Be skeptical of your geocoding results. Always inspect actual address match locations against other data sources, like street basemaps. Compare your results to more than one geocode API sources if possible. For example, if geocoded in OpenStreetMap Nominatim, import the results to Google Maps to see if they match its basemap.

In [ ]:
 
In [ ]:
 
You can’t perform that action at this time.