## Overview and Motivation  
<i>Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal. </i>

We seek to collect and examine horse racing data in order to 

## Related Work 
<i>Anything that inspired you, such as a paper, a web site, or something we discussed in class.</i>

We first conducted a throough study of horse racing and betting therein as follows to understand what we were working with. 

## Initial Questions 
<i>What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis? - Data: Source, scraping method, cleanup, storage, etc. </i>

## Final Analysis 
<i> What did you learn about the data? How did you answer the questions? How can you justify your answers? </i>

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from bs4 import BeautifulSoup
import requests
import json
import time

## Scraping 
We began our adventure with an extensive amount of scraping.  We used many resources to access a large amount of data and hit many hiccups along the way.  For one, the data sources we were using were extremely varied and very disorganized.  

### Weather Scraping 
We were able to access the Wunderground weather API thanks to special permission from the Weather Channel.  Using this, we obtained weather conditions for every date that we had race information available for the corresponding zip codes.  

An example of how to use the API for a set date and zip code: 

In [6]:
date = '20141225'
zip_code = '11420'
wunderground_url = 'http://api.wunderground.com/api/4a26cfc369eb7841/history_{}/q/{}.json'.format(date, zip_code)
examp = json.loads(requests.get(wunderground_url).text)

We decided on a list of weather conditions that would be most likely to effect the racetracks on any given day.  Given Wunderground had hundreds of weather parameters, we had to limit the scope of our scrape to avoid overfitting.  Many of the parameters given were duplicates, as each parameter was available in metric or imperial units, as well as max, min, and average. For the parameters we chose, we used metric units and average amounts if available and applicable.   

In [3]:
weather_data = ['fog','hail','maxhumidity','meandewptm','meanpressurem','meantempm','meanvism',
                'meanwdird', 'meanwindspdm', 'precipm', 'rain', 'snow', 'snowdepthm','snowfallm', 'thunder',
                'minhumidity']

In [4]:
# function to format data returned from wunderground api to have only the metrics we want
def output_dict(in_dict):
    temp = [(elem,in_dict[elem]) for elem in in_dict.keys() if elem in weather_data]
    return dict(temp)

In [7]:
output_dict(examp['history']['dailysummary'][0])

{u'fog': u'1',
 u'hail': u'0',
 u'maxhumidity': u'96',
 u'meandewptm': u'3',
 u'meanpressurem': u'1008',
 u'meantempm': u'10',
 u'meanvism': u'13',
 u'meanwdird': u'261',
 u'meanwindspdm': u'27',
 u'minhumidity': u'35',
 u'precipm': u'1.52',
 u'rain': u'1',
 u'snow': u'0',
 u'snowdepthm': u'0.00',
 u'snowfallm': u'0.00',
 u'thunder': u'0'}

In [8]:
# read in the csv of dates we want to get weather data for from previous scraping 
datesdf = pd.read_csv("tempdata/usadays.csv")
datesdf.drop('track.1',axis=1, inplace=True)
datesdf.head()

Unnamed: 0,track,year,month,day
0,AQU,1998,10,28
1,AQU,1998,10,29
2,AQU,1998,10,30
3,AQU,1998,10,31
4,AQU,1998,11,1


In [12]:
# convert date components to strings and add a 0 before single digit month/days
datesdf[['year', 'month', 'day']] = datesdf[['year', 'month', 'day']].astype(str)
for line in datesdf.index:
    if len(datesdf.loc[line]['month']) == 1:
        datesdf.loc[line]['month'] = '0' + datesdf.loc[line]['month']
    if len(datesdf.loc[line]['day']) == 1:
        datesdf.loc[line]['day'] = '0' + datesdf.loc[line]['day']

# dictionary keyed by track to store corresponding date strings
dates_dict = {}

# stores all dates on which races occurred for a given track identifier
for track in datesdf.track.unique():
    datestring = []
    for row in datesdf.index:
        if datesdf.iloc[row]['track'] == track:
            datestring.append(str(datesdf.iloc[row]['year']) + str(datesdf.iloc[row]['month']) + 
                              str(datesdf.iloc[row]['day']))
    dates_dict[track] = datestring 
    
# zipcodes for all U.S. tracks, looked up manually and arranged into alphabetical order (according to 
# track abbreviations)
zips1 = ['11420', '60005','94403', '11003', '43123', '33056', '55379','40208', '23124', '25438', '70668', 
         '92014', '19804', '42420', '98001', '70570', '70119', '14425', '91768', '62234', '49415', '94710', 
         '33009', '60804', '46013', '90305', '46176', '40510', '42134', '20725', '71111', '75050', '90720', 
         '07073', '07757', '26047', '71901', '85023', '17028', '19020', '21215', '50009', '73111', '78154', 
         '45230', '77064', '91007', '12866', '60804', '02128', '33626', '44128', '21093', '41042']

# obtained by manual lookup, arranged in alphabetical order (according to 3 letter track identifier)
locs = datesdf.track.unique()

# dictionary mapping track identifiers to zipcodes
zips_dict1 = dict(zip(locs, zips1))

In [11]:
print locs

NameError: name 'locs' is not defined

This is the 

In [None]:
# queries wunderground API for every track-date combination
# stores results in dictionary keyed by (track id, date) tuple
%%time
weather_dict = {}
except_list = []
for key in dates_dict.keys():
    for fdate in dates_dict[key]:
        wunderground_url = 'http://api.wunderground.com/api/4a26cfc369eb7841/history_{}/q/{}.json'.format(fdate, zips_dict1[key])
        try:
            temp = json.loads(requests.get(wunderground_url).text)['history']['dailysummary'][0]
            weather_dict[(key, fdate)] = output_dict(temp)
        except:
            except_list.append(zip(key,fdate))

In [None]:
## WARNING: do NOT run this line again 
# weather_df.to_csv('tempdata/weather.csv')

We were able to save our data into a csv by running all the scraping and cleaning commands above, which gave us the following dataframe.  

In [10]:
weather_df = pd.read_csv("tempdata/weather.csv", index_col=0)
weather_df.head(15)

Unnamed: 0,track,date,fog,hail,maxhumidity,meandewptm,meanpressurem,meantempm,meanvism,meanwdird,meanwindspdm,minhumidity,precipm,rain,snow,snowdepthm,snowfallm,thunder
0,REM,20040925,0,0,94.0,12.0,1021.0,21.0,11.0,13,8.0,23.0,0.00,0,0,,0.0,0
1,DPK,20130612,0,0,87.0,17.0,1010.0,24.0,16.0,299,14.0,47.0,0.00,0,0,0.0,0.0,0
2,FMT,20130601,0,0,94.0,17.0,1009.0,22.0,15.0,182,10.0,61.0,12.95,1,0,,0.0,1
3,PAR,20050225,0,0,80.0,6.0,1012.0,13.0,16.0,30,8.0,38.0,0.00,0,0,,0.0,0
4,BEU,20130504,0,0,66.0,6.0,1017.0,17.0,16.0,101,16.0,40.0,0.00,0,0,,0.0,0
5,AQU,20061231,0,0,70.0,-3.0,1030.0,4.0,16.0,222,15.0,44.0,0.00,0,0,0.0,0.0,0
6,BEL,20110615,0,0,80.0,11.0,1011.0,21.0,16.0,331,13.0,25.0,T,1,0,0.0,0.0,0
7,TUR,20011214,0,0,100.0,8.0,1009.0,10.0,8.0,194,19.0,69.0,9.14,1,0,,0.0,0
8,ARL,20060830,0,0,93.0,16.0,1018.0,19.0,14.0,30,13.0,70.0,0.00,1,0,,0.0,0
9,ARL,20030518,0,0,96.0,11.0,1019.0,14.0,11.0,63,11.0,66.0,0.00,0,0,,0.0,0


### INSERT WEATHER CLEANING IN HERE****  

### Test Data Scraping
We obtained the full race results for the three test races of interest: The Kentucky Derby, The Preakness Downs, and the Belmont Stakes.  We obtained all this data from many Wikipedia pages, which proved very frustrating, as they were atrociously messy.  We ran into many corner cases, so that took a lot of investigation to account for.  

In [None]:
## we create any empty dictionary which will hold all the html from the wikipedia pages 
pages = {}

## we create a dictionary of track years linked to the track names for the races 
## Belmont Stakes only had data from 2006 onward, while for the other two, we got data from 
## 1998 onward as that is how far our training data goes 
years1 = [str(i) for i in range(1998,2016)]
years2 = [str(i) for i in range(2006,2016)]
track_year_dict = {"_Kentucky_Derby": years1, "_Preakness_Stakes": years1, "_Belmont_Stakes": years2}

## obtaining all the html pages and putting them into our dictionary pages 
for key in track_year_dict.keys():
    for year in track_year_dict[key]:
        pages[year+key] = requests.get("https://en.wikipedia.org/wiki/{}".format(year+key)).text
        time.sleep(0.1)
        
# function to parse scraping output
# returns 2 data frames, one for payouts and one for results for a given race in a given year
def parser(key, page_dict):
    soup = BeautifulSoup(page_dict[key], "html.parser")
    tables = soup.find_all("table", attrs={"class": "wikitable"})
    
    if len(tables[0].find_all("tr")) <= 5:
        table1 = tables[0].find_all("tr")
        table2 = tables[1].find_all("tr")
    else:
        table1 = tables[1].find_all("tr")
        table2 = tables[0].find_all("tr")
    
    t1headers = [elem.get_text() for elem in table1[0].find_all("th")]
    t2headers = [elem.get_text() for elem in table2[0].find_all("th")]
    if (key == "2005_Kentucky_Derby"):
        t2headers.append("Time")
        t2headers[t2headers.index("Jockey")] = "Horse"
    
    t1 = []
    t2 = []
    for row1 in table1[1:]:
        r1_data = [cell.get_text() for cell in row1.find_all("td")]
        t1.append(r1_data)
    for row2 in table2[1:]:
        # handles cases where cells in horse column all have header tags
        if row2.find("th"):
            r2_data = [cell.get_text() for cell in row2.find_all("td")]
            r2_data.insert(2, row2.find("th").get_text())
            t2.append(r2_data)
        else:
            r2_data = [cell.get_text() for cell in row2.find_all("td")]
            t2.append(r2_data)       
    try:
        payout = pd.DataFrame(t1, columns=t1headers)
        results = pd.DataFrame(t2, columns=t2headers)
    except Exception,e:
        # handles 2015 Kentucky Derby results table, which doesn't have a header row
        if key == "2015_Kentucky_Derby":
            t1headers = [elem.get_text() for elem in table1[0].find_all("td")]
            payout = pd.DataFrame(t1, columns=t1headers)
            results = pd.DataFrame(t2, columns=t2headers)
        else:
            print str(e)
  
    return (payout, results)

# dictionary of data frames keyed by track-year string
# values are tuples of data frames returned by parser
bigdict = {key:parser(key, pages) for key in pages.keys()}

# constructs single payouts data frame by concatenating all payout data frames contained in bigdict
payouts_df = pd.DataFrame(columns=["Post", "Horse", "Win", "Place", "Show", "Track", "Year"])
for track in track_year_dict.keys():
    for year in track_year_dict[track]:
        access = year+track
        bigdict[access][0].columns = ["Post", "Horse", "Win", "Place", "Show"]
        bigdict[access][0]["Track"] = track
        bigdict[access][0]["Year"] = year
        payouts_df = pd.concat([payouts_df, bigdict[access][0]], ignore_index = True)

In [None]:
payouts_df = pd.read_csv("tempdata/payouts_df.csv", index_col=0)
payouts_df.head(15)

## Exploratory Data Analysis 
<i>What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions? </i>

In [31]:
## this function takes a payout and returns the first odds digit, assuming a comparison to 1, i.e. x-1 
## of that horse to win when betting occured assuming original bet of $2 and assumed take by race track of 15% 
def payoff_to_odds(payoff, bet_amount=2.0, take = .15):
    return round(((payoff/(1-take) - bet_amount)/bet_amount),4)

def odds_to_percent(odds): 
    return (float(str(odds)[2]))/(float(str(odds)[0])+float((str(odds)[2])))

def normalize_odds(odds): 
    x = odds.split("-")
    if len(x) > 1: 
        return float(x[0])/float(x[1])
    else: 
        return float(x[0])
    
def make_favorite(string): 
    if "favorite" in string: 
        return True 
    else
        return False 