# Tutorial: Practical Uses for Random Forests (using SciKit Learn)

### Introduction

This tutorial is intended as an introduction to random forests, which are an essential part of machine learning algorithms and regression. The tutorial will introduce random forests through practical examples, such as predicting results of **binary variables**, such as whether or not an NFL team should go for two points or kick one, or predicting results of **quantitative variables**, such as with what probability will an NFL team win a game given win history.

After this, the tutorial will go into a brief discussion about the pros and cons of using a random forest.

Note: The tutorial makes heavy use of the scikit-learn library (http://scikit-learn.org) - no prior experience is required though. It also makes use of the pandas and numpy libraries as well.

In [1]:
import sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier 

### Data Processing: Filtering Data and Setting up Variables

Our first data set is a set of all plays called throughout the 2012 NFL season (Go 49ers!!) - every single play is accounted for in this data set. Load the data set as follows.

In [2]:
nfl_stats = pd.read_csv("2012_nfl_pbp_data.csv")

Verify that the dataframe has been loaded into nfl_stats by making the following calls:

In [3]:
print nfl_stats.columns
print nfl_stats.head()

Index([u'gameid', u'qtr', u'min', u'sec', u'off', u'def', u'down', u'togo',
       u'ydline', u'description', u'offscore', u'defscore', u'season'],
      dtype='object')
             gameid  qtr   min sec  off  def  down  togo  ydline  \
0  20120905_DAL@NYG    1   NaN   0  DAL  NYG   NaN   NaN     NaN   
1  20120905_DAL@NYG    1  59.0  56  NYG  DAL   1.0  10.0    84.0   
2  20120905_DAL@NYG    1  59.0  49  NYG  DAL   2.0  10.0    84.0   
3  20120905_DAL@NYG    1  59.0   5  NYG  DAL   3.0   5.0    79.0   
4  20120905_DAL@NYG    1  58.0  58  NYG  DAL   4.0   5.0    79.0   

                                         description  offscore  defscore  \
0  D.Bailey kicks 69 yards from DAL 35 to NYG -4....         0         0   
1  (14:56) E.Manning pass incomplete deep left to...         0         0   
2  (14:49) E.Manning pass short middle to V.Cruz ...         0         0   
3  (14:05) (Shotgun) E.Manning pass incomplete sh...         0         0   
4  (13:58) S.Weatherford punts 56 yards t

As you can see, there are various columns regarding different aspects of the game. In this case, though, we only need to concern ourselves with extra points and two point conversions. So, we need to filter our dataframe by the **description** column. In addition, we drop unnecessary columns like togo, since yards to go doesn't really make sense when attempting an extra point or going for a two point conversion. It would make sense to drop "season" as well, as we are working under the assumption that all data is from 2012.

Finally, for any classification algorithm, an important variable is to have is a boolean variable indicating success or failure. For each data frame, we will include such a variable which is marked zero if the try was successful, and one if the try was not successful.

In [4]:
'''This function gets all entries in the dataframe that pertain to extra points'''

def getExtraPoints(df):
    newDf = pd.DataFrame()
    success = []
    for tup in df.iterrows():
        row = tup[1]
        if("extra point" in row["description"]):
            newDf = newDf.append(row)
    newDf.drop("down", axis = 1, inplace = True)
    newDf.drop("togo", axis = 1, inplace = True)
    newDf.drop("season", axis = 1, inplace = True)
    
    for tup in newDf.iterrows():
        row = tup[1]
        if("extra point is GOOD" in row['description']):
            success.append(1)
        else:
            success.append(0)
    res = pd.Series(success)
    newDf = newDf.assign(success = res)
    return newDf

def getTwoPointConv(df):
    newDf = pd.DataFrame()
    success = []
    for tup in df.iterrows():
        row = tup[1]
        if("CONVERSION" in row['description']):
            newDf = newDf.append(row)
    newDf.drop("down", axis = 1, inplace = True)
    newDf.drop("togo", axis = 1, inplace = True)
    newDf.drop("season", axis = 1, inplace = True)

    
    for tup in newDf.iterrows():
        row = tup[1]
        if("ATTEMPT SUCCEEDS" in row['description']):
            success.append(1)
        else:
            success.append(0)
    res = pd.Series(success)
    newDf = newDf.assign(success = res)
    return newDf

one_point = getExtraPoints(nfl_stats)
two_point = getTwoPointConv(nfl_stats)
print one_point.head()
print two_point.head()
print len(one_point)
print len(two_point)

     def  defscore                                        description  \
65   NYG       3.0  D.Bailey extra point is GOOD Center-L.Ladouceu...   
80   NYG       3.0  D.Bailey extra point is GOOD Center-L.Ladouceu...   
91   DAL      14.0  L.Tynes extra point is GOOD Center-Z.DeOssie H...   
123  NYG      10.0  D.Bailey extra point is GOOD Center-L.Ladouceu...   
137  DAL      24.0  L.Tynes extra point is GOOD Center-Z.DeOssie H...   

               gameid   min  off  offscore  qtr sec  ydline  success  
65   20120905_DAL@NYG  31.0  DAL       0.0  2.0   7    10.0      1.0  
80   20120905_DAL@NYG  25.0  DAL       7.0  3.0  32    40.0      1.0  
91   20120905_DAL@NYG  20.0  NYG       3.0  3.0  19    10.0      1.0  
123  20120905_DAL@NYG   6.0  DAL      17.0  4.0  12    34.0      1.0  
137  20120905_DAL@NYG   2.0  NYG      10.0  4.0  42     9.0      1.0  
      def  defscore                                        description  \
1139  MIN      20.0  TWO-POINT CONVERSION ATTEMPT. B.Gabbert 

Our filtered sets now contains all the extra points - 1 point attempts, and 2 point attempts, and an indicator variable telling us whether or not that given try was a success. With this classification variable, we are now ready to move on the random forest part of the tutorial.

### Random Forests

For some background on random forests (and a lot of information about how they work), check out the page: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm 

Simply put, a random forest is a collection of multiple decision trees (hence, forest), and it is used to make predictions about a data set, by training a classifier with a training set (in our case, that is the 2012 data above). We are going to use random forests on the **success** variable we defined above - we will use 2012 data as a training set (since we have defined success metric already for that data set), and then the 2013 data as a test set, and then see whether or not # of successful two point conversions matches that of the actual 2013 data set.