# Converting Descriptive Place Names into Geo-coordinates (Longitudes and Latitudes)!

Many social science dataset contain data on events of interest that are marked as place names and students find it difficult to convert those names into geo-coordinates (longitudes and latitudes). 

This tutorial uses Python's **Geopy** library for geocoding. It has functions that convert addresses or location names into geographic coordinates (latitude and longitude). Geopy provides a simple and consistent interface to interact with multiple geocoding services, such as Google Maps, OpenStreetMap, Bing Maps, and others.

* **Nominatim** is one of those geocoding service that converts addresses and location names into geographic coordinates (latitude and longitude) using OpenStreetMap. We will import that in the following cell.

* **GeocoderTimedOut** handles exceptions is raised when geocoding request times out. We will import that in the following cell too.

We will take **raw data** from *Anti-Defamation League (ADL)'s H.E.A.T. (Hate, Extremism, Antisemitism, Terrorism) database*, which contains various incidents of hate, extremism, antisemitism and terrorism, and geo-code the location of the events.

In [1]:
#importing libraries
import pandas as pd
import folium
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type
from tqdm import tqdm

Importing "shootout" events from the H.E.A.T database for demonstration purpose.

In [2]:
data = pd.read_excel("shootout.xlsx")

Once imported, the data looks like this. It has descriptive city and state names, and we would like to convert them into geo-coordinates for further mapping. 

In [3]:
data

Unnamed: 0.1,Unnamed: 0,id,date,city,state,type,ideology,subideology,group,description,image,latitude,longitude
0,2426,33502,2022-09-11,Walled Lake,MI,Extremist Murder;Extremist/Police Shootout,Right Wing (Other),,,Igor Lanis fatally shot his wife and wounded o...,,42.537811,-83.481048
1,6881,25776,2022-03-31,Dudleyville,AZ,Extremist/Police Shootout,Right Wing (White Supremacist),,,According to the Pinal County Sheriff's Office...,,32.908704,-110.722061
2,9750,20441,2021-12-27,Denver,CO,Extremist Murder;Extremist/Police Shootout,Right Wing (Other),,,Lyndon McLeod went on a multi-location shootin...,,39.739236,-104.984862
3,10915,20464,2021-11-05,Hoschton,GA,Extremist Murder;Extremist/Police Shootout,Right Wing (Anti-Government),,,"Jessica Worsham, an anti-government sovereign ...",,34.096496,-83.761284
4,11647,20390,2021-10-29,Tecumseh,OK,Extremist Murder;Extremist/Police Shootout,Right Wing (Anti-Government),,,"Braedon Chesser, an anti-government extremist ...",,35.257850,-96.936689
...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,35948,710,2009-10-28,Dearborn,MI,Extremist/Police Shootout,Islamist,,Ummah,Law enforcement agents fatally shot Luqman Ame...,,42.322260,-83.176315
126,35969,1948,2009-04-04,Pittsburgh,PA,Extremist Murder;Extremist/Police Shootout,Right Wing (White Supremacist),,,Richard Andrew Poplawski shot and killed three...,,40.441694,-79.990086
127,35976,335,2009-01-21,Brockton,MA,Extremist Murder;Extremist/Police Shootout;Ter...,Right Wing (White Supremacist),,,"Keith Luke, 22, went on a racially motivated s...",,42.083433,-71.018379
128,36006,192,2007-08-10,Bastrop,LA,Extremist Murder;Extremist/Police Shootout,Right Wing (White Supremacist),,Aryan Circle,Aryan Circle member Dennis Clem shot and kille...,,32.776145,-91.908297


**Now we will convert the locations to coordinates (longitude and latitude) using the city and state information.**

For that we will first add empty columns for latitude and longitude and write a function to fill them in interatively using the Nominatim geocoding service.

In [4]:
data['latitude'] = None
data['longitude'] = None

The following code creates a Nominatim geolocator object with a custom user agent ("heat_map_geocoder") and a timeout of 20 seconds. 

*The **user_agent** parameter helps identify our application to the geocoding service (name is arbitrary), while the **timeout** parameter specifies the maximum time (in seconds) the geolocator should wait for a response from the service.

In [5]:
geolocator = Nominatim(user_agent="heat_map_geocoder", timeout=20)

The **retry** decorator from the **tenacity** library aplies to the **geocode_with_retry** function defined in the next line. 

It adds retry behavior to the function, with the following configuration:

* **stop=stop_after_attempt(10)**: Retry the function up to a maximum of 10 attempts. The servers might be busy in one attempt so it tells the system to do it again. 

* **wait=wait_fixed(2)**: Wait for 2 seconds between each retry attempt. Some servers limit that time between requests (OSM does too). It happens mostly with APIs.

* **retry=retry_if_exception_type(GeocoderTimedOut)**: Retry the function only if it raises an exception of type GeocoderTimedOut, which is an exception raised by geopy when the geocoding request times out.

THe function itself does the following things:

* **def geocode_with_retry(location_str)** defines the **geocode_with_retry function**, which takes a single argument **location_str** representing the location string to be geocoded. We willd efined this location string in the next line.

* **return geolocator.geocode(location_str)**: functions processes the location string, puts it through Nominatim's geolocator and returns the geographical coordinates (latitude and longitude) of the input location.

In [6]:
@retry(stop=stop_after_attempt(10), wait=wait_fixed(2), retry=retry_if_exception_type(GeocoderTimedOut))

def geocode_with_retry(location_str):
    return geolocator.geocode(location_str)


The following code contains a **for loop** that processes each row, geocodes its 'city' and 'state' columns using the geocode_with_retry function, and updates the DataFrame with the resulting latitude and longitude.

The **tdqm** library invoked here calculates time required to process each individual row to show a progress bar until every row is processed.

**Note:** Sometimes this code chunk would fail. It is because the Nominatim geolocation service we use is provided by OSM maps. The code fails because the servers are too busy to process our requests or our internet connection is bad.

In [7]:
for index, row in tqdm(data.iterrows(), total=data.shape[0]):
    location = geocode_with_retry(f"{row['city']}, {row['state']}")
    if location:
        data.at[index, 'latitude'] = location.latitude
        data.at[index, 'longitude'] = location.longitude


100%|██████████| 130/130 [01:07<00:00,  1.93it/s]


Sometimes geolocation goes smoothly and sometimes not. It is because of spelling mistakes/difference, and the geolocator cannot locate the coordinates for the given location's name. It would then leave the observation empty/missing. 

The following code helps us locate those missing values before plotting.

In [8]:
missing_values = data[['latitude', 'longitude', 'type']].isna().sum()
print("Missing values in each column:")
print(missing_values)


Missing values in each column:
latitude     0
longitude    0
type         0
dtype: int64


The following code helps us identify which locations were not geocoded properly. If they are terribly important for our project, we can find and correct them. Here it seems like the report combines two localities (i) Fort Walton Beach and (ii) Crestview into one name that the geolocator then couldn't process. 

In [9]:
missing_locations = data[data['latitude'].isna() | data['longitude'].isna() | data['type'].isna()]
print("Rows with missing latitude, longitude, or type values:")
print(missing_locations[['city', 'state']])


Rows with missing latitude, longitude, or type values:
Empty DataFrame
Columns: [city, state]
Index: []


If we use the columns with missing values, we can get an error in plotting these points because the program cannot plot what's missing. We can deal with it by dropping the locations that our program could not geolocate.

Best practice however is to locate them and make necessary corrections.

In [10]:
data = data.dropna(subset=['latitude', 'longitude', 'type'])
data

Unnamed: 0.1,Unnamed: 0,id,date,city,state,type,ideology,subideology,group,description,image,latitude,longitude
0,2426,33502,2022-09-11,Walled Lake,MI,Extremist Murder;Extremist/Police Shootout,Right Wing (Other),,,Igor Lanis fatally shot his wife and wounded o...,,42.537811,-83.481048
1,6881,25776,2022-03-31,Dudleyville,AZ,Extremist/Police Shootout,Right Wing (White Supremacist),,,According to the Pinal County Sheriff's Office...,,32.908704,-110.722061
2,9750,20441,2021-12-27,Denver,CO,Extremist Murder;Extremist/Police Shootout,Right Wing (Other),,,Lyndon McLeod went on a multi-location shootin...,,39.739236,-104.984862
3,10915,20464,2021-11-05,Hoschton,GA,Extremist Murder;Extremist/Police Shootout,Right Wing (Anti-Government),,,"Jessica Worsham, an anti-government sovereign ...",,34.096496,-83.761284
4,11647,20390,2021-10-29,Tecumseh,OK,Extremist Murder;Extremist/Police Shootout,Right Wing (Anti-Government),,,"Braedon Chesser, an anti-government extremist ...",,35.25785,-96.936689
...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,35948,710,2009-10-28,Dearborn,MI,Extremist/Police Shootout,Islamist,,Ummah,Law enforcement agents fatally shot Luqman Ame...,,42.32226,-83.176315
126,35969,1948,2009-04-04,Pittsburgh,PA,Extremist Murder;Extremist/Police Shootout,Right Wing (White Supremacist),,,Richard Andrew Poplawski shot and killed three...,,40.441694,-79.990086
127,35976,335,2009-01-21,Brockton,MA,Extremist Murder;Extremist/Police Shootout;Ter...,Right Wing (White Supremacist),,,"Keith Luke, 22, went on a racially motivated s...",,42.083433,-71.018379
128,36006,192,2007-08-10,Bastrop,LA,Extremist Murder;Extremist/Police Shootout,Right Wing (White Supremacist),,Aryan Circle,Aryan Circle member Dennis Clem shot and kille...,,32.776145,-91.908297


### Mapping

**Now we create a Folium map centered at the approximate geographical center of the United States to plot our coordinates.**

* **map_center** = [39.8283, -98.5795] is a list that contains the latitude and longitude coordinates of the approximate center of the United States.
* **base_map** creates a Folium map object that has sets the initial center of the map to the coordinates specified in the map_center list, and sets the initial zoom level of the map. 

**Note:** A lower zoom level (e.g., 5) shows a larger area, while a higher zoom level (e.g., 15) shows a smaller area with more detail.

In [11]:
map_center = [39.8283, -98.5795]  # Approximate center of the United States
base_map = folium.Map(location=map_center, zoom_start=5)


The following code has an element **grouped_data**, which is useful when we are working with data of different types. We have already filtered our type of interest to ease the geocoding process, but it still has many different types that contain the string 'Extremist/Police Shootout'. This could be even more useful when we have different nature of event. 

Anyhow, it takes the latitude and longitude points for each observation and add them to a **heatmap** object, and then adds that to the **base_map**.

It also adds a layer control interface that allows users to interactively toggle the visibility of different map layers on the map.

In [12]:
grouped_data = data.groupby('type') #useful when different type of events are in the dataframe

for name, group in grouped_data:
    heat_data = [[row['latitude'], row['longitude']] for index, row in group.iterrows()]
    HeatMap(heat_data, name=name).add_to(base_map)
    
folium.LayerControl().add_to(base_map)

<folium.map.LayerControl at 0x2ac6a004490>

The following code exports the basemap with the heatmap layer to an interactive html file.

In [13]:
base_map.save('shootout.html')

The following code exports the filtered 'shootout' data to an excel file for reproducibility purpose.

**Note:** you can also export it as csv using *data.to_csv('shoutout.csv')*.