# Reverse-geocode Google location history

In the previous notebook, we clustered the location history data to reduce the size of the data set. This reduced set was saved as 'location-history-clustered.csv'. Now we'll reverse-geocode it from lat/long to neighborhood, city, state, country. 

First, copy that csv file and rename the copy 'google-history-to-geocode.csv'. We'll use this file as our working file to do the reverse-geocoding. As Google limits your IP address to 2,500 requests per day, we might need to do the entire data set in multiple passes. Hence the working file.

Sample request: https://maps.googleapis.com/maps/api/geocode/json?latlng=39.9058153,-86.054788

In [1]:
import pandas as pd, time, requests, json

## Load the working file for geocoding

In [2]:
pause = 0.1 #google limits you to 10 requests per second
use_second_geocoder = False #only set True on your last pass, if multiple
max_requests = 2500 #how many requests to make of google

working_file = 'data/google-history-to-geocode.csv'

In [3]:
df = pd.read_csv(working_file, encoding='utf-8')
print('{:,} rows in dataset'.format(len(df)))

3,482 rows in dataset


If there are more than 2,500 rows in the dataset, you need to run this notebook multiple times because Google limits you to 2,500 requests per day. Or fall back on the nominatim API, with `use_second_geocoder=True`.

## Prep the data for geocoding

In [4]:
# create city, state, country columns only if they don't already exist
new_cols = ['city', 'state', 'country', 'geocode_results', 'geocode_results_nominatum']
for col in new_cols:
    if not col in df.columns:
        df[col] = None
        
# drop the locations and timestamp_ms columns if they are still here
cols_to_remove = ['locations', 'timestamp_ms']
for col in cols_to_remove:
    if col in df.columns:
        df.drop(col, axis=1, inplace=True)
        
df.head()

Unnamed: 0,lat,lon,datetime,city,state,country,geocode_results,geocode_results_nominatum,latlng,neighborhood
0,22.310794,114.170237,2015-05-28 05:07:24,,Kowloon,Hong Kong,"{""place_id"": ""ChIJR9ZKTsAABDQRLKH_T7JIO_M"", ""f...",,"22.310794199999997,114.1702368",Yau Ma Tei
1,37.798857,-122.279611,2016-02-13 16:10:43,Oakland,California,United States,"{""place_id"": ""ChIJT3t1CLmAj4AR48M7qyzExFk"", ""f...",,"37.7988567,-122.27961100000002",Downtown Oakland
2,37.862522,-122.275418,2015-04-02 04:18:04,Berkeley,California,United States,"{""place_id"": ""ChIJKTuMaoV-hYAR_BFYzZarN14"", ""f...",,"37.862522399999996,-122.2754184",South Berkeley
3,22.009811,-159.338031,2015-01-19 10:46:41,Lihue,Hawaii,United States,"{""place_id"": ""ChIJkQWKeDMeB3wR2XhA2nIz4Ds"", ""f...",,"22.009811399999997,-159.3380307",
4,16.86057,96.121303,2015-05-21 12:15:19,Yangon,Yangon Region,Myanmar (Burma),"{""place_id"": ""ChIJ1daFCfqUwTARixlqX-_wCJ4"", ""f...",,"16.860569899999998,96.1213028",Mayangone


In [5]:
# put latlng in the format google likes so it's easy to call their api
df['latlng'] = df.apply(lambda row: '{},{}'.format(row['lat'], row['lon']), axis=1)

In [6]:
# if this isn't the first pass through the reverse-geocoder, we will already have some saved results
# they were saved as json strings, so load them as python dicts now to work with them
f = lambda x: json.loads(x) if isinstance(x, str) else x
df['geocode_results'] = df['geocode_results'].map(f)

In [7]:
ungeocoded_rows = df[pd.isnull(df['geocode_results']) & pd.isnull(df['geocode_results_nominatum'])]
print('{:,} out of {:,} rows lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))
print('We will attempt to reverse-geocode up to {:,} rows'.format(max_requests))

993 out of 3,482 rows lack reverse-geocode results
We will attempt to reverse-geocode up to 2,500 rows


## Now reverse-geocode the google location history to city/country

In [8]:
# pass the Google API latlng data to reverse geocode it
count_requests = 0
def reverse_geocode(row):
    global count_requests
    if row.name % 100 == 0: print(row.name, end=' ')
    
    # first check if either geocode result column already has data
    if pd.notnull(row['geocode_results']):
        return row['geocode_results']
    elif pd.notnull(row['geocode_results_nominatum']):
        return None
    elif count_requests < max_requests:
        # this row has not yet been reverse geocoded and we have not yet made the max # of requests
        time.sleep(pause)
        url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
        request = url.format(row['latlng'])
        response = requests.get(request)
        count_requests += 1
        data = response.json()
        if len(data['results']) > 0:
            return data['results'][0] #if we got results, return the first result

In [9]:
df['geocode_results'] = df.apply(reverse_geocode, axis=1)

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 

In [10]:
ungeocoded_rows = df[pd.isnull(df['geocode_results']) & pd.isnull(df['geocode_results_nominatum'])]
print('{:,} out of {:,} rows lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))

14 out of 3,482 rows lack reverse-geocode results


## Now parse city, state, country from the results

In [11]:
def get_neighborhood(row):
    if pd.notnull(row['geocode_results']):
        if 'address_components' in row['geocode_results']:
            for component in row['geocode_results']['address_components']:
                if 'neighborhood' in component['types']:
                    return component['long_name']
                elif 'sublocality_level_1' in component['types']:
                    return component['long_name']
                elif 'sublocality_level_2' in component['types']:
                    return component['long_name']                
                
# to find city, return the finest-grain address component 
# google returns these components in order from finest to coarsest grained
def get_city(row):
    if pd.notnull(row['geocode_results']):
        if 'address_components' in row['geocode_results']:
            for component in row['geocode_results']['address_components']:
                if 'locality' in component['types']:
                    return component['long_name']
                elif 'postal_town' in component['types']:
                    return component['long_name']              
                elif 'administrative_area_level_5' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_4' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_3' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_2' in component['types']:
                    return component['long_name']

# to find state, you want the lowest-level admin area available
# but, google returns admin areas listed from highest-level to lowest
# so you can't just return as soon as you find the first match
# this is is opposite of the previous, because this time we want the coarsest-grain match
# otherwise we end up with counties and so forth instead of states
def get_state(row):
    if pd.notnull(row['geocode_results']):
        state = None
        if 'address_components' in row['geocode_results']:
            for component in row['geocode_results']['address_components']:
                if 'administrative_area_level_1' in component['types']:
                    state = component['long_name']
                elif 'administrative_area_level_2' in component['types']:
                    state = component['long_name']
                elif 'administrative_area_level_3' in component['types']:
                    state = component['long_name']
                elif 'locality' in component['types']:
                    state = component['long_name']
        return state

def get_country(row):
    if pd.notnull(row['geocode_results']):
        if 'address_components' in row['geocode_results']:
            for component in row['geocode_results']['address_components']:
                if 'country' in component['types']:
                    return component['long_name']

In [12]:
# now apply our functions to extract neighborhood, city, state, country
df['neighborhood'] = df.apply(get_neighborhood, axis=1)
df['city'] = df.apply(get_city, axis=1)
df['state'] = df.apply(get_state, axis=1)
df['country'] = df.apply(get_country, axis=1)

In [13]:
mask = pd.isnull(df['city']) & pd.isnull(df['state']) & pd.isnull(df['country'])
print('{:,} out of {:,} rows lack city, state, and country'.format(len(df[mask]), len(df)))
ungeocoded_rows = df[pd.isnull(df['geocode_results']) & pd.isnull(df['geocode_results_nominatum'])]
print('{:,} out of {:,} rows lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))

14 out of 3,482 rows lack city, state, and country
14 out of 3,482 rows lack reverse-geocode results


### If use_second_geocoder is True, use OSM Nominatum API to reverse-geocode any remaining missing rows

Only do this on the final pass. This is useful for places like Kosovo that Google does not return results for.

In [14]:
# pass latlng data to osm nominatum to reverse geocode it
def reverse_geocode_nominatum(label, lat, lon):
    print(label, end=' ')
    time.sleep(pause)
    url = 'https://nominatim.openstreetmap.org/reverse?format=json&lat={}&lon={}&zoom=18&addressdetails=1'
    request = url.format(lat, lon)
    response = requests.get(request)
    data = response.json()
    return data

In [15]:
def parse_nominatum_data(data):
    country = None
    state = None
    city = None
    if isinstance(data, dict):
        if 'address' in data:
            if 'country' in data['address']:
                country = data['address']['country']

            #state
            if 'region' in data['address']:
                state = data['address']['region']
            if 'state' in data['address']:
                state = data['address']['state']

            #city
            if 'county' in data['address']:
                county = data['address']['county']
            if 'village' in data['address']:
                city = data['address']['village']
            if 'city' in data['address']:
                city = data['address']['city']
    return city, state, country

In [16]:
if use_second_geocoder:
    df['geocode_results_nominatum'] = None
    for label, row in df.iterrows():
        if pd.isnull(row['geocode_results']):
            result = reverse_geocode_nominatum(label, row['lat'], row['lon'])
            city, state, country = parse_nominatum_data(result)
            df.loc[label, 'city'] = city
            df.loc[label, 'state'] = state
            df.loc[label, 'country'] = country
            df.loc[label, 'geocode_results_nominatum'] = json.dumps(result, ensure_ascii=False)

228 314 710 1041 1144 1362 1490 1664 1717 2077 2146 3032 3154 3401 

In [17]:
mask = pd.isnull(df['city']) & pd.isnull(df['state']) & pd.isnull(df['country'])
print('{:,} out of {:,} rows lack city, state, and country'.format(len(df[mask]), len(df)))
ungeocoded_rows = df[pd.isnull(df['geocode_results']) & pd.isnull(df['geocode_results_nominatum'])]
print('{:,} out of {:,} rows lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))

0 out of 3,482 rows lack city, state, and country
0 out of 3,482 rows lack reverse-geocode results


## Done: Save to CSV

In [18]:
# dump the geocode_results to json string before saving so it saves nicely as text
f = lambda x: x if isinstance(x, str) else json.dumps(x, ensure_ascii=False)
df['geocode_results'] = df['geocode_results'].map(f)

In [19]:
# save the entire data set to the working file
df.to_csv(working_file, encoding='utf-8', index=False)

# save the useful columns to a final output file
cols_to_retain = ['datetime', 'neighborhood', 'city', 'state', 'country', 'lat', 'lon']
df[cols_to_retain].to_csv('data/google-location-history.csv', encoding='utf-8', index=False)