# Geocoding data with Python and Open Street Map - Improved version for multiple repeated queries

Nominatim only allows a limited number of queries and when you have a large dataFrame to geolocate you may not be able to do it because you are sending to many requests and Nominatim stops responding (even worse, it blocks you from sending any queries for a few hours).
Our current code, not only interrogates Nominatim for each row in the dataFrame irrespective of the fact that locations are repeated several times in the table, but it also interrogates Nominatim twice for each row, once to get the longitute and one to get the latitude.
We will fix this problem by doing the following:

- first we create a dataFrame containing only and all the individual locations
- then we send Nominatim queries only once per location and geocode each location
- finally, we merge the information about the geocoding in our original dataFrame

A second improvement is to pass our email as a parameter of the query. This tells Nominatim that we are not a robot, making it less likely that we end up being black-listed

In [None]:
import pandas as pd
import requests
import json

# url to access geolocation data
url = 'https://nominatim.openstreetmap.org/search'

In [None]:
# Read the csv file with the clean Twitter data
df=pd.read_csv('cleanDataFromTwitter.txt')

In the previous example we had a call to nominatim for each one of the entries in the table. However, although we have 71 rows, we only have fewer individual locations (see next two commands)

In [None]:
df

In [None]:
df['location'].unique()

In [None]:
# we build an array of individual locations (loc) 
# and then a dataFrame with just one colum with the locations
loc = df['location'].unique()
loc_df = pd.DataFrame(loc,columns=['location'])
loc_df

We define a lambda function that returns the geocoding in json format
*** make sure you change the email address to yours ***

In [None]:
json_loc = lambda x: requests.get(url, params={'q': x, 'format': 'json', 'email':'croda@aup.edu'}).json()

In [None]:
# example of a simple call to the lambda function
json_loc('texas')

We apply the lambda functions to the dataFrame containing the individual locations

In [None]:
loc_df['jason_loc']=loc_df['location'].apply(json_loc)

In [None]:
# this is what it contains
loc_df

We can create a column that tells us how many locations Nominatim found with the name we gave

In [None]:
# explanation only
loc_df['NumberOfLocationsFound']=loc_df['jason_loc'].apply(len)
loc_df

Although Nominatim finds several results for each name we provide, we will assume that the first location found is the correct one.
We write two functions that respectively retrieve the latitude and longitude of each location

In [None]:
# retrieve the latitude of the first location from the json output of nominatim
def getLat(x) :
    # check that at least one location has been found
    if len(x)>0 : 
        # return the latitude of the first location (the location in position 0)
        return x[0]['lat'] 
    # otherwise (if no location has been found) 
    else : 
        return 'NaN'

In [None]:
# retrieve the longitude of the first location from the json output of nominatim
def getLon(x) :
    # check that at least one location has been found
    if len(x)>0 : 
        # return the longitude of the first location (the location in position 0)
        return x[0]['lon'] 
    # otherwise (if no location has been found) 
    else : 
        return 'NaN'

Now we apply the two functions to our table of locations

In [None]:
loc_df['lat']=loc_df['jason_loc'].apply(getLat)
loc_df['lon']=loc_df['jason_loc'].apply(getLon)
loc_df

Finally we use our loc_df dataframe to geolocate all items in our original dataFrame.
There are several ways to do this, for those of you familiar with excel vlookup may find [this explanation](https://towardsdatascience.com/name-your-favorite-excel-function-and-ill-teach-you-its-pandas-equivalent-7ee4400ada9f) useful

In [None]:
df['lat']=df.location.map(loc_df.set_index('location')['lat'].to_dict())
df['lon']=df.location.map(loc_df.set_index('location')['lon'].to_dict())
df

There is one column that has been added and we want to remove; this is what we do next

In [None]:
# remove the Unnamed: 0 column
df.drop(['Unnamed: 0'], inplace=True, axis=1)
df

### And we save the data in a file so we can import them in QGIS

In [None]:
f = open("geocodedTwitterData_V2.csv", "w")
f.write(df.to_csv(index=False))
f.close()