# Geocode football stadiums

This notebook uses the Foursquare API to geocode locations based on stadium name + city/state. The dataset comprises all the NCAA Division I college football stadiums in the US (FBS and FCS) and comes from Wikipedia [here](https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums) and [here](https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FCS_football_stadiums). The Foursquare API works well because it takes a query (stadium name) and searches for it near some city + state.


  1. Register a new Foursquare API app under your account to get a client ID and client secret: https://foursquare.com/developers/apps
  2. Create a file called keys.py in the same directory as these IPython notebooks
  3. Edit your keys.py file and add two lines of code. The first should be client_id = 'your-id-here' and the second should be client_secret = 'your-secret-here'
  4. Run the notebook
  5. The geocoded data is visualized in [visualize-football-stadiums.ipynb](visualize-football-stadiums.ipynb)

In [1]:
import pandas as pd, re, time, requests, json
from keys import client_id, client_secret

In [2]:
# load the data
df_fbs = pd.read_csv('data/fbs-stadiums-original.csv', dtype=str, encoding='utf-8')
df_fbs['div'] = 'fbs'
df_fcs = pd.read_csv('data/fcs-stadiums-original.csv', dtype=str, encoding='utf-8')
df_fcs['div'] = 'fcs'
df = pd.concat([df_fbs, df_fcs], axis=0)

## Clean up the data set

In [3]:
# fill nans in expanded col, drop rows with nan in team col, reset index, drop image and old index cols
df['expanded'] = df['expanded'].fillna('')
df = df.dropna(subset=['team'])
df = df.reset_index()
df = df.drop(['image', 'index'], axis=1)
df.head()

Unnamed: 0,stadium,city,state,team,conference,capacity,built,expanded,div
0,Michigan Stadium,Ann Arbor,MI,Michigan,Big Ten,"107,601[98]",1927,2015,fbs
1,Beaver Stadium,University Park,PA,Penn State,Big Ten,"106,572[12]",1960,2001,fbs
2,Ohio Stadium,Columbus,OH,Ohio State,Big Ten,"104,944[106]",1922[108],2014[108],fbs
3,Kyle Field,College Station,TX,Texas A&M,SEC,"102,733[76]",1927[76],2015[76],fbs
4,Neyland Stadium,Knoxville,TN,Tennessee,SEC,"102,455[102]",1921[103],2010[103],fbs


In [4]:
# remove pluses, commas, and any footnotes in square brackets
regex = re.compile(u'\\+|,|\\[.*]')
df = df.applymap(lambda x: regex.sub(u'', x))

# now convert the cleaned-up columns to int
df['capacity'] = df['capacity'].astype(int)
df['built'] = df['built'].astype(int)
df.head()

Unnamed: 0,stadium,city,state,team,conference,capacity,built,expanded,div
0,Michigan Stadium,Ann Arbor,MI,Michigan,Big Ten,107601,1927,2015,fbs
1,Beaver Stadium,University Park,PA,Penn State,Big Ten,106572,1960,2001,fbs
2,Ohio Stadium,Columbus,OH,Ohio State,Big Ten,104944,1922,2014,fbs
3,Kyle Field,College Station,TX,Texas A&M,SEC,102733,1927,2015,fbs
4,Neyland Stadium,Knoxville,TN,Tennessee,SEC,102455,1921,2010,fbs


## Geocode the data set to lat-long

In [5]:
# specify how many results to return and what api version to call
limit = 1
version = '20160105'

In [6]:
# function to geocode stadiums to lat/long with foursquare api
def geocode(row):
    if row.name % 10 == 0: print(row.name, end=' ')
    if 'latlng' in row and pd.notnull(row['latlng']):
        return row['latlng']
    time.sleep(0.2)
    url = u'https://api.foursquare.com/v2/venues/search?query={}&near={},{}&limit={}&v={}&client_id={}&client_secret={}'
    request = url.format(row['stadium'], row['city'], row['state'], limit, version, client_id, client_secret)
    response = requests.get(request)
    data = json.loads(response.text)
    if 'venues' in data['response'] and len(data['response']['venues']) > 0:
        latitude = data['response']['venues'][0]['location']['lat']
        longitude = data['response']['venues'][0]['location']['lng']
        return '{},{}'.format(latitude, longitude)

In [7]:
df['latlng'] = df.apply(geocode, axis=1)

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 

In [8]:
# parse out individual lats and longs from the result
df['latitude'] = df['latlng'].map(lambda x: x.split(',')[0] if isinstance(x, str) else None)
df['longitude'] = df['latlng'].map(lambda x: x.split(',')[1] if isinstance(x, str) else None)
df = df.drop(labels='latlng', axis=1)
df.head()

Unnamed: 0,stadium,city,state,team,conference,capacity,built,expanded,div,latitude,longitude
0,Michigan Stadium,Ann Arbor,MI,Michigan,Big Ten,107601,1927,2015,fbs,42.26586873251738,-83.7487256526947
1,Beaver Stadium,University Park,PA,Penn State,Big Ten,106572,1960,2001,fbs,40.81215273275043,-77.85620212554932
2,Ohio Stadium,Columbus,OH,Ohio State,Big Ten,104944,1922,2014,fbs,40.0016856893694,-83.01972806453705
3,Kyle Field,College Station,TX,Texas A&M,SEC,102733,1927,2015,fbs,30.61009757817476,-96.34072922859283
4,Neyland Stadium,Knoxville,TN,Tennessee,SEC,102455,1921,2010,fbs,35.95473437262258,-83.92533302307129


In [9]:
# save to csv
df.to_csv('data/stadiums-geocoded.csv', encoding='utf-8', index=False)