## Geocoding Script
For heatseek data analysis 2018-2019  
Last updated: Oct 16 2019

### 1. Importing necessary libraries
**Note**: will need to install `geopy` and `folium`, which are not included by default in the Anaconda distribution.

In [1]:
import pandas as pd
import geopy as gp
from geopy.geocoders import GoogleV3
import folium

### 2. Importing data

In [2]:
temp_01 = pd.read_csv("data/data_1_Oct_01_2018_to_Jan_31_2019.csv")
temp_02 = pd.read_csv("data/data_2_Feb_01_2019_to_May_31_2019.csv")
df = pd.concat([temp_01, temp_02])

In [None]:
df.head()

In [None]:
df.info()

### 3. EDA + Cleaning

In [5]:
# changing data type
df.address = df.address.astype(str)

# checking data
print(df['address'].nunique()) # only 38 unique addresses
print(df['address'].isna().sum()) # there is no missing address

# creating a full address with city, state and country info to improve geocoding accuracy
df = df.assign(add_full = df.address + ", New York City, New York, USA")

# create clean dataset for geocoding
df_gc = df.drop_duplicates(subset = "add_full")

38
0


### 4. Geocoding
As geocoding is computationally intensive, only unique addresses will be geocoded. The latitude and longitude will then be joined back.  
**Note**: The API key is generated from my (Jolene) Google Cloud account.

In [6]:
# generate api
api_file = open("api/geocoding_api.txt", "r")
api_key = api_file.readline()
api_file.close()

In [None]:
# geocoding unique addresses
geocoder = GoogleV3(api_key = api_key)

df_gc['lat'] = 0
df_gc['lon'] = 0

for i in range(0, len(df_gc)):
    address = df_gc.add_full.iloc[i]
    location = geocoder.geocode(address, timeout = 15)
    df_gc['lat'].iloc[i] = location.latitude
    df_gc['lon'].iloc[i] = location.longitude

In [8]:
print(df_gc.lat.isna().sum())
print(df_gc.lon.isna().sum())

0
0


In [9]:
# join lat, lon to all addresses
address_dict = df_gc[['address', 'lat', 'lon']]

df_full = pd.merge(df, address_dict, how = "left", on = "address")

### 5. Export Geocoded Data

In [10]:
df_full.to_csv("data/heatseek_geocoded.csv")

### 6. Visualization
Simple visualization to make sure results of geocoding make sense

In [12]:
hs_map = folium.Map(
    location = [40.7128, - 74.0060],
    tiles = 'OpenStreetMap',
    zoomstart = 15)

address_dict.apply(lambda row: folium.CircleMarker(location = [row["lat"], row["lon"]]).add_to(hs_map), axis = 1)

hs_map