# Advanced Spatial Analysis
# Module 08: APIs, Geocoding, Geolocation

You'll need a Google API key to use the Google Maps Geocoding API and the Google Places API Web Service. These APIs require you to set up billing info, but we won't use them beyond the free threshold. Complete the following steps before the class session.

  1. Go to the Google API console: https://console.developers.google.com/
  1. Sign in, create a new project for class, then click enable APIs.
  1. Enable the Google Maps Geocoding API and then the Google Places API.
  1. Go to credentials, create an API key, then copy it.
  1. Create a new file (in the same folder as this notebook) called `keys.py` with one line: `google_api_key = 'PASTE-YOUR-KEY-HERE'`

In [None]:
import geopandas as gpd
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import time
from geopy.geocoders import GoogleV3
from shapely.geometry import Point

from keys import google_api_key

%matplotlib inline

In [None]:
# define a pause duration between API requests
pause = 0.1

## 1: Geocoding addresses to lat-long

We will use the Google Maps geocoding API. Documentation: https://developers.google.com/maps/documentation/geocoding/start

In [None]:
locations = pd.DataFrame()
locations['address'] = ['350 5th Ave, New York, NY 10118',
                        '100 Larkin St, San Francisco, CA 94102',
                        'Snell Library, Boston, MA']
locations

In [None]:
# function that accepts an address string, sends it to the Google API, and returns the lat-long API result
def geocode(address):
    time.sleep(pause) #pause for some duration before each request, to not hammer their server
    url_template = 'https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}&sensor=false' #api url with placeholders
    url = url_template.format(address, google_api_key) #fill in the placeholder with a variable
    response = requests.get(url) #send the request to the server and get the response
    data = response.json() #convert the response json string into a dict
    
    if len(data['results']) > 0: #if google was able to geolocate our address, extract lat-long from result
        latitude = data['results'][0]['geometry']['location']['lat']
        longitude = data['results'][0]['geometry']['location']['lng']
        return '{},{}'.format(latitude, longitude) #return lat-long as a string in the format google likes

In [None]:
# test the function (you can provide famous site names instead of addresses)
geocode('Fenway Park, Boston, MA')

In [None]:
# for each value in the address column, geocode it, save results as new df column
locations['latlng'] = locations['address'].map(geocode)
locations

In [None]:
# parse the result into separate lat and lon columns for easy mapping
locations['latitude'] = locations['latlng'].map(lambda x: x.split(',')[0])
locations['longitude'] = locations['latlng'].map(lambda x: x.split(',')[1])
locations

In [None]:
# now it's your turn
# create a new pandas series of 3 addresses and use our function to geocode them
# create new variables to contain your work so as to not overwrite the locations df


In [None]:
# now it's your turn
# create a new pandas series of 3 famous site names and use our function to geocode them
# create new variables to contain your work so as to not overwrite the locations df


## 2. Google Places API

We will use Google's Places API to look up places in the vicinity of some location. Documentation: https://developers.google.com/places/web-service/intro

In [None]:
# google places API URL, with placeholders
url_template = 'https://maps.googleapis.com/maps/api/place/search/json?keyword={}&location={}&radius={}&key={}&sensor=false'

# what keyword to search for
keyword = 'restaurant'

# define the radius (in meters) for the search
radius = 500

# get the location coordinates (of snell library)
location = locations.loc[2, 'latlng']
location

In [None]:
# add our variables into the url, submit the request to the api, and load the response
url = url_template.format(keyword, location, radius, google_api_key)
response = requests.get(url)
data = response.json()

In [None]:
# how many results did we get?
len(data['results'])

In [None]:
# inspect the first 3
data['results'][3]

In [None]:
# turn the results into a dataframe of places
places = pd.DataFrame(data['results'], columns=['name', 'geometry', 'rating', 'vicinity'])
places.head()

In [None]:
# parse out lat-long and return it as a series -> this creates a dataframe of all the results when you .apply()
def parse_coords(geometry):
    if isinstance(geometry, dict):
        lng = geometry['location']['lng']
        lat = geometry['location']['lat']
        return pd.Series({'latitude':lat, 'longitude':lng})
    
# test our function
places['geometry'].head().apply(parse_coords)

In [None]:
# now run our function on the whole dataframe and save the output to 2 new dataframe columns
places[['latitude', 'longitude']] = places['geometry'].apply(parse_coords)
places_clean = places.drop('geometry', axis=1)

In [None]:
# sort the places by rating
places_clean = places_clean.sort_values(by='rating', ascending=False)
places_clean.head()

In [None]:
# now it's your turn
# find the five highest-rated bars within 1/2 mile of fenway park
# create new variables to contain your work so as to not overwrite places and places_clean


## 3. Reverse geocoding (address lookup)

We'll use Google's reverse geocoding API. Documentation: https://developers.google.com/maps/documentation/geocoding/intro#ReverseGeocoding

You can do this manually, just like in the previous two sections, but it's a little more complicated to parse Google's address components results. If we just want addresses, we can use [geopy](https://geopy.readthedocs.io/) to simply call Google's API automatically for us.

In [None]:
# for simplicity, we'll use the points from the Places API, but you could load any points dataset here
points = places_clean.loc[:, ['latitude', 'longitude']]
points.head()

In [None]:
# create a column to put lat-long into the format google likes - this just makes it easier to call their API
points['latlng'] = points.apply(lambda row: '{},{}'.format(row['latitude'], row['longitude']), axis=1)
points.head()

In [None]:
# tell geopy to reverse geocode some lat-long string using Google's API and return the address
def reverse_geopy(latlng):
    time.sleep(pause)
    geolocator = GoogleV3(api_key=google_api_key)
    address, _ = geolocator.reverse(latlng, exactly_one=True)
    return address

In [None]:
# now reverse-geocode the points to addresses
points['address'] = points['latlng'].map(reverse_geopy)
points.head()

#### What if you just want the city or state?
You could try to parse the address strings, but you're relying on them always having a consistent format. This might not be the case if you have international location data. In this case, you should call the API manually and extract the individual address components you are interested in.

In [None]:
# pass the Google API latlng data to reverse geocode it
def reverse_geocode(latlng):
    time.sleep(pause)
    url_template = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}&key={}'
    url = url_template.format(latlng, google_api_key)
    response = requests.get(url)
    data = response.json()
    if len(data['results']) > 0:
        return data['results'][0] #if we got results, return the first result
    
geocode_results = points['latlng'].map(reverse_geocode)

In [None]:
geocode_results.loc[0]

Now look inside each reverse geocode result to see if address_components exists. If it does, look inside each component to see if we can find the city or the state. Google calls the city name by the abstract term 'locality' and the state name by the abstract term 'administrative_area_level_1' ...this just lets them use the same terminology anywhere in the world.

In [None]:
def get_city(geocode_result):
     if 'address_components' in geocode_result:
        for address_component in geocode_result['address_components']:
            if 'locality' in address_component['types']:
                return address_component['long_name']
                
def get_state(geocode_result):
     if 'address_components' in geocode_result:
        for address_component in geocode_result['address_components']:
            if 'administrative_area_level_1' in address_component['types']:
                return address_component['long_name']

In [None]:
# now map our functions to extract city and state names
points['city'] = geocode_results.map(get_city)                
points['state'] = geocode_results.map(get_state)
points.head()

In [None]:
# now it's your turn
# write a new function get_neighborhood() to parse the neighborhood name and add it to the points df


## 4. Reverse geocoding to FIPS

We'll use the FCC's Census Block Conversions API to turn lat/long into a block FIPS code. FIPS codes contain from left to right: the location's 2-digit state code, 3-digit county code, 6-digit census tract code, and 4-digit census block code (the first digit of which is the census block group code). Now you can join your data to tract (etc) level census data without doing a spatial join.

  - Documentation: https://geo.fcc.gov/api/census/
  - Example request: https://geo.fcc.gov/api/census/block/find?format=json&latitude=42.340970&longitude=-71.081658
  
You can do similar work with the census geocoder: https://geocoding.geo.census.gov/

In [None]:
# pass the FCC API lat/long and get FIPS data back - return block fips and county name
def get_fips(row):
    time.sleep(pause)
    url_template = 'https://geo.fcc.gov/api/census/block/find?format=json&latitude={}&longitude={}'
    url = url_template.format(row['latitude'], row['longitude'])
    response = requests.get(url)
    data = response.json()
    
    # return values as a series: when applied, this will create a dataframe with multiple columns
    return pd.Series({'fips_code':data['Block']['FIPS'], 'county':data['County']['name']})

In [None]:
# get block fips code and county name from FCC as new dataframe
fips = points.apply(get_fips, axis=1)

In [None]:
fips.head()

In [None]:
# concatenate to join points df and new fips/county df
points_fips = pd.concat([points, fips], axis=1)
points_fips.head()

In [None]:
# now it's your turn
# take your geocoded series from section 1 and reverse-geocode it to get block fips codes
# then parse out the tract fips code from each row and save as a new series


## 5. Other APIs and Data Portals

Using the Cambridge Open Data Portal... browse the portal for public datasets: https://data.cambridgema.gov/browse

The API is built on Socrata... documentation: https://dev.socrata.com/

First we'll look at tax assessor data in Cambridge: https://data.cambridgema.gov/Assessing/Assessing-Building-Information-FY2015/crnm-mw9n

### 5.1. Tax assessor data

In [None]:
# define API endpoint
endpoint_url = 'https://data.cambridgema.gov/resource/crnm-mw9n.json'

# request the URL and download its response
response = requests.get(endpoint_url)

# parse the json string into a Python dict
data = response.json()

In [None]:
len(data)

There are more than 1000 rows in the dataset, but we're limited by the API to only 1000 per request. We have to use pagination to get the rest.

In [None]:
# recursive function to keep requesting more rows until there are no more
def request_data(endpoint_url, limit=1000, offset=0, data=[]):
    
    url = endpoint_url + '?$limit={limit}&$offset={offset}'
    request_url = url.format(limit=limit, offset=offset)
    response = requests.get(request_url)
    
    rows = response.json()
    data.extend(rows)
    
    if len(rows) >= limit:
        data = request_data(endpoint_url, offset=offset+limit, data=data)

    return data

In [None]:
# get all the data from the API, using our recursive function
endpoint_url = 'https://data.cambridgema.gov/resource/crnm-mw9n.json'
data = request_data(endpoint_url)
len(data)

In [None]:
# turn the json data into a dataframe
df = pd.DataFrame(data)
df.shape

In [None]:
# what columns are in our data?
df.columns

In [None]:
# inspect the assessed values
df['assessed_value'].dropna().astype(int).describe()

In [None]:
# inspect the years built
built = df['actual_year_built'].dropna().astype(int)
built.describe()

In [None]:
ax = built[built > 1600].hist(bins=20)

In [None]:
# now it's your turn
# what is the mean year built for the 10 properties with the highest assessed value?


#### Now map the data

In [None]:
# downloaded from https://data.cambridgema.gov/api/geospatial/rst6-227j?method=export&format=GeoJSON
parcels = gpd.read_file('data/parcels.geojson')

In [None]:
# merge parcel geometries with assessor dataset
parcels_assess = pd.merge(parcels, df, how='left', left_on='ml', right_on='gis_id')
parcels_assess[['assessed_value', 'land_area', 'living_area']] = parcels_assess[['assessed_value', 'land_area', 'living_area']].astype(float)

In [None]:
# calculate value per sq ft (living area + land area), then drop inifinities and nulls
parcels_assess['value_per_area'] = parcels_assess['assessed_value'] / (parcels_assess['living_area'] + parcels_assess['land_area'])
parcels_assess = parcels_assess.replace([np.inf, -np.inf], np.nan).dropna(subset=['value_per_area'])

In [None]:
# clip outliers to min/max values at the 1 percentile and 99 percentile
lower = parcels_assess['value_per_area'].quantile(0.01)
upper = parcels_assess['value_per_area'].quantile(0.99)
parcels_assess['value_per_area'] = parcels_assess['value_per_area'].clip(lower=lower, upper=upper)

In [None]:
# map the parcels
fig, ax = plt.subplots(figsize=(10,10), facecolor='k')
ax = parcels_assess.plot(ax=ax, column='value_per_area')
ax.axis('off')
plt.show()

In [None]:
# now it's your turn
# choose another variable from the geodataframe and map it: do you notice any clusters or trends you can explain?


### 5.2. Crash data

https://data.cambridgema.gov/Public-Safety/Police-Department-Crash-Data-Historical/ybny-g9cv

In [None]:
# get all the data from the API, using our recursive function
endpoint_url = 'https://data.cambridgema.gov/resource/39tu-m8zx.json'
data = request_data(endpoint_url)
len(data)

In [None]:
# turn the json data into a dataframe
df = pd.DataFrame(data)
len(df)

In [None]:
# turn all the rows with lat-lng data into a geopandas geoseries of points
df_geo = df[pd.notnull(df['latitude']) & pd.notnull(df['longitude'])]
df_geo = df_geo[['latitude', 'longitude']].astype(float)
crash_points = gpd.GeoSeries(df_geo.apply(lambda row: Point((row['longitude'], row['latitude'])), axis=1))
len(crash_points)

#### Now map it

In [None]:
# shapefiles downloaded from https://www.census.gov/cgi-bin/geo/shapefiles/index.php
cities = gpd.read_file('data/tl_2018_25_place/')
tracts = gpd.read_file('data/tl_2018_25_tract/')

In [None]:
# get cambridge's boundaries
cambridge_polygon = cities[cities['NAME'].str.contains('Cambridge')]['geometry'].iloc[0]
cambridge_polygon

In [None]:
# do our CRSs match? we need them to, to do spatial analysis
cities.crs == tracts.crs

In [None]:
# how many tracts are entirely within cambridge's boundaries?
cambridge_tracts = tracts[tracts.within(cambridge_polygon)]
len(cambridge_tracts)

In [None]:
# map the tracts and the crash points
ax = cambridge_tracts.plot(color='w', edgecolor='gray')
ax = crash_points.plot(ax=ax, color='r', markersize=0.1)
ax.axis('off')
plt.show()

In [None]:
# turn the crash points into a geodataframe and project to the tracts' CRS
gdf_crashes = gpd.GeoDataFrame(geometry=crash_points)
gdf_crashes.crs = {'init':'epsg:4326'}
gdf_crashes = gdf_crashes.to_crs(cambridge_tracts.crs)

In [None]:
# spatial join tracts to crashes (i.e., assign a tract ID to each crash)
crash_tracts = gpd.sjoin(gdf_crashes, cambridge_tracts, how='left', op='intersects')

In [None]:
# which tracts contain the most crashes?
tract_crash_counts = crash_tracts['GEOID'].value_counts()
tract_crash_counts.name = 'crashes'
tract_crash_counts.head()

In [None]:
# merge the crash counts and the tracts
cambridge_tracts = cambridge_tracts.set_index('GEOID')
cambridge_tracts_crashes = pd.merge(cambridge_tracts, tract_crash_counts, how='left', left_index=True, right_index=True)

In [None]:
# how many crashes per square meter?
cambridge_tracts_crashes['crash_density'] = cambridge_tracts_crashes['crashes'] / cambridge_tracts_crashes['ALAND']

In [None]:
# map the count of crashes per tract
fig, ax = plt.subplots(facecolor='k')
ax = cambridge_tracts_crashes.plot(ax = ax, column='crashes')
ax.axis('off')
plt.show()

In [None]:
# map the crashes/m2 per tract
fig, ax = plt.subplots(facecolor='k')
ax = cambridge_tracts_crashes.plot(ax = ax, column='crash_density')
ax.axis('off')
plt.show()

## In-class exercise

1. Visit the Cambridge data portal (link provided above) and identify another data set of interest (pick one with spatial data like lat/longs or polygon boundaries)
1. Download it using Python as we did above
1. Clean the data set if necessary and calculate descriptive stats for 2 or more columns
1. Map the data, colored by column values. Do you see any patterns of interest?