# Analysis Summary

Our goal from this analysis will be to predict the outcome of March Madness games. To do this, we will be using a random forest classifer that takes in historical data from NCAA tourney matchups as input and outputs predictions for the winning team for each game in an input set of March Madness games. 

### Baseline Model
The baseline model we will use to compare our model's results against will be solely tied to a team's RPI. Our baseline model poses the following hypothesis:

    That in any NCAA tournament game, the team with the lower RPI will win the game.
  
Intuitively, this is a reasonable prediction. RPI (Ratings Percentage Index), ranks teams based on their wins, losses, and strength of schedule for the past season. If team A has a lower RPI than team B at the end of a season, it's generally considered that team A's performance throughout the season has been at a higher level than team B's. For this reason, if we knew nothing else about the two teams, predicting the outcome based on RPI is a good starting point. 
### Random Forest Approach
We would like to build on our baseline model and see if we can develop an approach that more accurately predicts the outcome of games. One of the most exciting parts of March Madness is the array of upsets that occur throughout the tournament. In general, upsets occur when a team with a higher RPI beats a team with a lower RPI. We would like to create a model that performs better than our baseline model by more accurately predicting the outcome of games, in particular predicting when upsets occur. For us to accomplish this, we can utilize RPI as well as some additional attributes that provide more information surrounding each team's level of performance in the past season. These factors can be used to build a random forest in order to: 
    1. Predict when an upset is going to occur in March Madness
    2. Identify which factors are correlated to predicting the outcome of a game

We will build a random forest by passing as input a data frame where each row corresponds to an NCAA tournament game. Each row will contain data regarding each team's yearly averages and totals in statistical categories, RPI, the game's outcome, and whether or not the team with the lowest RPI won. This last piece of information will be our dependent variable. The random forest will utilize features in our training data describing the winning team and losing team's performance during the season, in order to learn which factors are tied to predicting the outcome of a game. Once the random forest has been trained, tournament data where the outcome of the each game has been excluded can be used as input to the model to generate a set of predictions for each game in that year's tournament and test the accuracy of our random forest classifer.

### Data Sources Overview
The data the was used to perform this analysis came from the Google Cloud & NCAA® ML Competition 2018-Men's Kaggle competition (Insert Link). The following is a brief summary of all the datasets that were used:
- NCAATourneyCompactResults: Contains records from each NCAA tournament game from 1985-2017, including score and region.  
- Teams: Contains information for each Division 1 (D1) basketball team including an ID, name, first and most recent year playing D1 basketball. 
- MasseyOrdinals_Prelim2018: Contains data from 2003-2017 surrounding each D1 team's rank from various rankings sources throughout. 
- RegularSeasonDetailedResults: Contains similar information as the NCAATourneyCompactResults dataset, with the addition that each row will also contain the totals in a variety of statistical categories for the winning team and losing team. These are categories that are often found in a boxscore. 

A more in depth description of each of the datasets that were used and additional datasets provided by Kaggle can be found at https://www.kaggle.com/c/mens-machine-learning-competition-2018/data


### Data Cleansing & Preparation

In order to create our random forest, we need to complete the following:
    1. Settle on a set of statistics to evaluate the performance of each team. 
    2. Create a data frame that represents a matchup that occurred in an NCAA tournament and contains our set of statistics for each team in the matchup 

To accomplish the first task, the following statistics were calculated for each team:
    - Defensive Rebounds (DRB)
    - Efficient Field Goal Percentage (EFG)
    - Free Throws Attempted (FTA)
    - Free Throw Percentage (FTP)
    - Margin of Loss (MOL)
    - Margin of Victory (MOV)
    - Offensive Rebounds (ORB)
    - Possessions (POSS)
    - Turnovers (TO)
    - Turnovers Forced (TOF)
    - Defensive Efficiency (dEff)
    - Offensive Effiency (oEff)
    - Rankings Percentage Index (RPI)
    - Conference Tourney Wins (confTournWins)
    - Wins vs. a Tournament Team (winsVsTourney)
    - Number of Games Played (numGamesPlayed)


With the exception of the last four, the yearly pre-tournament averages (regular season and conference tourney play) were found. The last three are cumulative totals, and RPI was found by using the latest ranking given to a team before the tournament. 

By parsing through the 4 datasets listed above, we can calculate the desired information for each team in a March Madness tournament from 2003-2017 as seen below:

In [1]:
matchups = getMatchupData()
matchups.head()

NameError: name 'getMatchupData' is not defined

Each row in the output data frame represents a game played in an March Madness tournament. The 16 statistics that were targeted are used features as well as some additional information to help identify which teams played and when the game occurred. A more detailed look at how the Matchups data frame was created can be seen in the getMatchupData function in the Appendix. 

### Random Forest Creation

The matchups data frame can be used as input to a random forest classifier to output predictions for a subset of NCAA tournament games. When creating a random forest classifer, a few parameters should be considered:

- _Maximum Number of Features_: The maximum number of features that our algorithm uses to create a decision tree that will be a part of our random forest. A value of sqrt(n) where n is the number of features in our input dataset was chosen. This is generally considered to be a good starting point for random forest classifiers because it allows the decision trees that are created to have a strong chance of selecting a unique set of features to train a model with. The more unique our set of decision trees are, the more confident we can be in our random forest's performance on test data. 
- _Number of Trees_: The number of trees in a random forest impacts the classifier's effectiveness for similar reason. If not enough decision trees are chosen, some features may not be included in our model and the classifer's effectiveness is hindered. The more trees in a forest, the more likely we cover the full feature space. To have a strong chance of covering the full feature space, 1000 trees were selected to be in our random forest.

For this project, random forests classifiers were created for two use cases:
    1. Predicting the outcome of all NCAA tournament games from one season between 2003-2017
    2. Predicting the outcome of all NCAA tournament championship games from 2003-2017

In order to accomplish both tasks, two methods were created, getPredictionsChips and getPredictions. Both methods behave similarly, except getPredictionChips uses championship games as its test data set to output predictions and getPredictions uses a specified year as its test data set. In each method, a data set of historical NCAA tournament games is cleaned up by removing any qualitative columns or any column specifying a result from our baseline model. Once the training and test data sets are cleaned, a random forest classifier is created using the training data and predictions are made on the test data using the generated classifier. Each method returns a tuple with the following information: 
- _Output Predictions_ (outputPreds): A dataframe describing the prediction for each matchup, the actual result, and the probability for the predicted outcome to occur
- _Baseline Model Accuracy_ (baselineAcc): The accuracy from our baseline model that predicted a game's outcome based on which team had the lower RPI
- _Random Forest Classifier Accuracy_ (modelAcc): The accuracy from our random forest classifier that predicted a game's outcome on a variety of input features

### TODO: Random Forest Classifier Results vs. Baseline Model
- 15 Year comparison with baseline model
- Championship game predictions vs. baseline model

- Two sets of results to look at
1. Results of chip game analysis
2. Results of per season analysis 

For each set of reslts, we compared the classifier's prediction accuracy to the baseline model's. In all years of games analyzed,our model always outperformed the baseline model. This tells us that we can be smarter when picking games and not simply rely on RPI.

In [416]:
indPredicts = []
outCols = ["Predict", "Actual", "A Name", "A ID", "B Name", "B ID", "Prob For", "Prob Against"]
baseAccs = []
modelAccs = []

# Chip games testing
outputPreds, baselineAcc, modelAcc = getPredictionsChips()
print("Baseline Model Accuracy: {}".format(round(baselineAcc, 2)))
print("Our Model's Accuracy: {}".format(round(modelAcc, 2)))

for row in outputPreds:
    indPredicts.append(row.tolist())
predsDF = pd.DataFrame(indPredicts, columns = outCols)
# predsDF
# pd.DataFrame(indPredicts).to_csv("data/output/chipTestResults.csv", index=False, header=True)
# outputPreds


[1 1 1 1 0 1 1 1 1 1 1 1 1 1 1]
[0 1 0 0 0 0 1 1 1 1 1 0 0 1 1]
-21.0
15
Baseline Model Accuracy: 0.53
Our Model's Accuracy: 0.6


In [172]:
indPredicts = []
outCols = ["Predict", "Actual", "L Name", "L ID", "W Name", "W ID", "Prob For", "Prob Against"]
baseAccs = []
modelAccs = []
models = []
for i in range(2003, 2018):
#     model = getForestDeltas(str(i))
    model = getForest(str(i))
    model.setFeatureImportances()
    importances = model.featureImportances
    model.setPredictions()
    models.append(model)
    outputPreds = model.getModelPredictionsAndProbs()
    baselineAcc = model.baselineAcc
    modelAcc = model.modelAcc
    
    baseAccs.append(round(baselineAcc, 3))
    modelAccs.append(round(modelAcc, 3))

    for row in outputPreds:
        indPredicts.append(row.tolist())

# pd.DataFrame(indPredicts).to_csv("data/output/results_deltas_with_seeds_D.csv", index=False, header=True)
accDF = pd.DataFrame({"Season": range(2003,2018), "Baseline": baseAccs, "RF": modelAccs})
predsDF = pd.DataFrame(indPredicts, columns = outCols)

[['Vermont' '1436_2003' 'Arizona' '1112_2003']
 ['Memphis' '1272_2003' 'Arizona St' '1113_2003']
 ['Creighton' '1166_2003' 'C Michigan' '1141_2003']
 ['NC State' '1301_2003' 'California' '1143_2003']
 ['BYU' '1140_2003' 'Connecticut' '1163_2003']
 ['Colorado St' '1161_2003' 'Duke' '1181_2003']
 ['Cincinnati' '1153_2003' 'Gonzaga' '1211_2003']
 ['WKU' '1443_2003' 'Illinois' '1228_2003']
 ['Utah St' '1429_2003' 'Kansas' '1242_2003']
 ['Holy Cross' '1221_2003' 'Marquette' '1266_2003']
 ['S Illinois' '1356_2003' 'Missouri' '1281_2003']
 ['WI Milwaukee' '1454_2003' 'Notre Dame' '1323_2003']
 ['S Carolina St' '1354_2003' 'Oklahoma' '1328_2003']
 ['San Diego' '1360_2003' 'Stanford' '1390_2003']
 ['Dayton' '1173_2003' 'Tulsa' '1409_2003']
 ['Weber St' '1451_2003' 'Wisconsin' '1458_2003']
 ["St Joseph's PA" '1386_2003' 'Auburn' '1120_2003']
 ['Mississippi St' '1280_2003' 'Butler' '1139_2003']
 ['Sam Houston St' '1358_2003' 'Florida' '1196_2003']
 ['Alabama' '1104_2003' 'Indiana' '1231_2003']
 [

[['Marquette' '1266_2006' 'Alabama' '1104_2006']
 ['Pacific' '1334_2006' 'Boston College' '1130_2006']
 ['Southern Univ' '1380_2006' 'Duke' '1181_2006']
 ['South Alabama' '1375_2006' 'Florida' '1196_2006']
 ['UNC Wilmington' '1423_2006' 'G Washington' '1203_2006']
 ['Xavier' '1462_2006' 'Gonzaga' '1211_2006']
 ['Air Force' '1102_2006' 'Illinois' '1228_2006']
 ['San Diego St' '1361_2006' 'Indiana' '1231_2006']
 ['Iona' '1233_2006' 'LSU' '1261_2006']
 ['Nevada' '1305_2006' 'Montana' '1285_2006']
 ['Winthrop' '1457_2006' 'Tennessee' '1397_2006']
 ['Syracuse' '1393_2006' 'Texas A&M' '1401_2006']
 ['Belmont' '1125_2006' 'UCLA' '1417_2006']
 ['Utah St' '1429_2006' 'Washington' '1449_2006']
 ['Oklahoma' '1328_2006' 'WI Milwaukee' '1454_2006']
 ['Seton Hall' '1371_2006' 'Wichita St' '1455_2006']
 ['Wisconsin' '1458_2006' 'Arizona' '1112_2006']
 ['Kansas' '1242_2006' 'Bradley' '1133_2006']
 ['Arkansas' '1116_2006' 'Bucknell' '1137_2006']
 ['Albany NY' '1107_2006' 'Connecticut' '1163_2006']
 ['M

[['Chattanooga' '1151_2009' 'Connecticut' '1163_2009']
 ['Binghamton' '1127_2009' 'Duke' '1181_2009']
 ['Akron' '1103_2009' 'Gonzaga' '1211_2009']
 ['Butler' '1139_2009' 'LSU' '1261_2009']
 ['California' '1143_2009' 'Maryland' '1268_2009']
 ['CS Northridge' '1169_2009' 'Memphis' '1272_2009']
 ['Clemson' '1155_2009' 'Michigan' '1276_2009']
 ['Radford' '1347_2009' 'North Carolina' '1314_2009']
 ['Morgan St' '1288_2009' 'Oklahoma' '1328_2009']
 ['Northern Iowa' '1320_2009' 'Purdue' '1345_2009']
 ['Minnesota' '1278_2009' 'Texas' '1400_2009']
 ['BYU' '1140_2009' 'Texas A&M' '1401_2009']
 ['VA Commonwealth' '1433_2009' 'UCLA' '1417_2009']
 ['American Univ' '1110_2009' 'Villanova' '1437_2009']
 ['Illinois' '1228_2009' 'WKU' '1443_2009']
 ['Mississippi St' '1280_2009' 'Washington' '1449_2009']
 ['Utah' '1428_2009' 'Arizona' '1112_2009']
 ['Temple' '1396_2009' 'Arizona St' '1113_2009']
 ['Wake Forest' '1448_2009' 'Cleveland St' '1156_2009']
 ['West Virginia' '1452_2009' 'Dayton' '1173_2009']
 [

[['S Dakota St' '1355_2012' 'Baylor' '1124_2012']
 ['UNLV' '1424_2012' 'Colorado' '1160_2012']
 ['West Virginia' '1452_2012' 'Gonzaga' '1211_2012']
 ['New Mexico St' '1308_2012' 'Indiana' '1231_2012']
 ['Connecticut' '1163_2012' 'Iowa St' '1235_2012']
 ['Southern Miss' '1379_2012' 'Kansas St' '1243_2012']
 ['WKU' '1443_2012' 'Kentucky' '1246_2012']
 ['Davidson' '1172_2012' 'Louisville' '1257_2012']
 ['BYU' '1140_2012' 'Marquette' '1266_2012']
 ['Colorado St' '1161_2012' 'Murray St' '1293_2012']
 ['Long Beach St' '1253_2012' 'New Mexico' '1307_2012']
 ['Loyola MD' '1259_2012' 'Ohio St' '1326_2012']
 ['UNC Asheville' '1421_2012' 'Syracuse' '1393_2012']
 ['Wichita St' '1455_2012' 'VA Commonwealth' '1433_2012']
 ['Harvard' '1217_2012' 'Vanderbilt' '1435_2012']
 ['Montana' '1285_2012' 'Wisconsin' '1458_2012']
 ['Texas' '1400_2012' 'Cincinnati' '1153_2012']
 ['Alabama' '1104_2012' 'Creighton' '1166_2012']
 ['Virginia' '1438_2012' 'Florida' '1196_2012']
 ['St Bonaventure' '1382_2012' 'Florida

[['TX Southern' '1411_2015' 'Arizona' '1112_2015']
 ['Wofford' '1459_2015' 'Arkansas' '1116_2015']
 ['Texas' '1400_2015' 'Butler' '1139_2015']
 ['Purdue' '1345_2015' 'Cincinnati' '1153_2015']
 ['E Washington' '1186_2015' 'Georgetown' '1207_2015']
 ['Baylor' '1124_2015' 'Georgia St' '1209_2015']
 ['Hampton' '1214_2015' 'Kentucky' '1246_2015']
 ['LSU' '1261_2015' 'NC State' '1301_2015']
 ['Harvard' '1217_2015' 'North Carolina' '1314_2015']
 ['Northeastern' '1318_2015' 'Notre Dame' '1323_2015']
 ['VA Commonwealth' '1433_2015' 'Ohio St' '1326_2015']
 ['Iowa St' '1235_2015' 'UAB' '1412_2015']
 ['SMU' '1374_2015' 'UCLA' '1417_2015']
 ['SF Austin' '1372_2015' 'Utah' '1428_2015']
 ['Lafayette' '1248_2015' 'Villanova' '1437_2015']
 ['Mississippi' '1279_2015' 'Xavier' '1462_2015']
 ['Providence' '1344_2015' 'Dayton' '1173_2015']
 ['Robert Morris' '1352_2015' 'Duke' '1181_2015']
 ['N Dakota St' '1295_2015' 'Gonzaga' '1211_2015']
 ['Davidson' '1172_2015' 'Iowa' '1234_2015']
 ['New Mexico St' '1308

In [173]:
print(models[1].feature_importances)

        importance
bMOV      0.146763
aMOV      0.135270
boEff     0.134216
aRPI      0.133252
bMOL      0.124211
bRPI      0.119394
aMOL      0.116190
a_seed    0.090705


In [174]:
accDF

Unnamed: 0,Season,Baseline,RF
0,2003,0.667,0.683
1,2004,0.746,0.698
2,2005,0.694,0.726
3,2006,0.667,0.683
4,2007,0.806,0.694
5,2008,0.783,0.8
6,2009,0.746,0.794
7,2010,0.677,0.71
8,2011,0.683,0.635
9,2012,0.726,0.742


In [60]:
for model in models:
    df = pd.DataFrame(model.confusionMatrix)
    df.rename(columns={0:'Pred Upset', 1:'Pred No Upset'}, 
                 index={0:'Actual Upset',1:'Actual No Upset'}, 
                 inplace=True)
    print(df)

                 Pred Upset  Pred No Upset
Actual Upset              3             18
Actual No Upset           1             42
                 Pred Upset  Pred No Upset
Actual Upset              2             15
Actual No Upset           3             44
                 Pred Upset  Pred No Upset
Actual Upset              3             17
Actual No Upset           3             41
                 Pred Upset  Pred No Upset
Actual Upset              2             19
Actual No Upset           2             41
                 Pred Upset  Pred No Upset
Actual Upset              3             10
Actual No Upset           5             46
                 Pred Upset  Pred No Upset
Actual Upset              3             12
Actual No Upset           4             45
                 Pred Upset  Pred No Upset
Actual Upset              1             15
Actual No Upset           4             44
                 Pred Upset  Pred No Upset
Actual Upset              3             18
Actual No U

In [171]:

def findChampionshipMatches():
    """
    Read in NCAA tourney matchups and return data frame containing additional column denoting (True/False) if that matchup was a championship game. 
    """
    matchups = getMatchupData()
    matchups = sortMatchups(matchups)
    ## group by season and with resulting groupby obj, find whether each row equals the dayNum max for each group
    ## store result as column in matchups defining whether championship played that day
    ## able to pass in functions to transform to perform calculations for each group
    matchups["chipGame"] = matchups.groupby(['season'])['dayNum'].transform(max) == matchups['dayNum']
    return matchups

def getPredictionsChips():
    """
    Outputs predictions for all championship games from 2003-2017 using a Random Forest classifier. Baseline model takes team with lower RPI as winner. 
    Returns a tuple consisting of a data frame containing the model's prediction for every matchup in our test dataset, the baseline model's accuracy, our model's accuracy
    """
    matchups = findChampionshipMatches()
    cols = list(matchups.columns)
    
    # Create training/test data sets
    train = matchups[matchups["chipGame"] == False]
    test = matchups[matchups["chipGame"] == True]
    baselineAcc = 1.0*sum(test["baseline"]) / test.shape[0]
    
    trainLabels = np.array(train["baseline"])
    trainLabels = trainLabels.astype(int)
    testLabels = np.array(test["baseline"])
    testLabels = testLabels.astype(int)
    testNames = np.column_stack((test["aname"], test["a_id"], test["bname"], test["b_id"]))
    # Drop qualitative & output columns
    train = train.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum", "chipGame"], axis = 1)
    test = test.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum", "chipGame"], axis = 1)
    feature_names = train.columns
    trainFeatures = np.array(train)
    testFeatures = np.array(test)
    maxFeatures = int(len(feature_names)**0.5)

    rf = RandomForestClassifier(n_estimators = 1000, random_state=42, oob_score=True, max_features=maxFeatures)
    rf.fit(trainFeatures, trainLabels)
    ## Draw sample classification tree
    # drawTree(rf, "sampleTree")

    predictions = rf.predict(testFeatures)
    predictProbs = rf.predict_proba(testFeatures)
    modelAcc = 1.0*(predictions.shape[0] - sum(predictions ^ testLabels)) / predictions.shape[0]
    stack = np.column_stack((predictions.T, testLabels.T, testNames[:,0], testNames[:,1], testNames[:,2], testNames[:,3], predictProbs[:,0], predictProbs[:,1]))
    outputPreds = stack[stack[:,0].argsort()]
    return outputPreds, baselineAcc, modelAcc

### Utilize historical matchup data to build RF model. 
def getForest(year, train=None, test=None):
    """
    Outputs predictions for games from test data set using a Random Forest classifier. Baseline model takes team with lower RPI as winner. 
    Returns a tuple consisting of a data frame containing the model's prediction for every matchup in our test dataset, the baseline model's accuracy, our model's accuracy
    """
    model = mod.Model()
    matchups = getMatchupData()
    matchups = mergeRankings(matchups)
    matchups = sortMatchups(matchups)
    matchups = matchups[matchups["a_seed"] != matchups["b_seed"]]

    cols = list(matchups.columns)
    train = matchups[~matchups["b_id"].str.contains(year)]
    test = matchups[matchups["b_id"].str.contains(year)]
    baselineAcc = 1.0*sum(test["baseline"]) / test.shape[0]
    
    # Create training/test data sets
    trainLabels = np.array(train["baseline"])
    trainLabels = trainLabels.astype(int)
    testLabels = np.array(test["baseline"])
    testLabels = testLabels.astype(int)
    testNames = np.column_stack((test["aname"], test["a_id"], test["bname"], test["b_id"]))
    model.trainBaseline = trainLabels
    model.testBaseline = testLabels
    model.testNames = testNames
    
    # Drop qualitative & output columns
    train = train.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum"], axis = 1)
    test = test.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum"], axis = 1)
    # Drop insignificant quantitative columns
#     train = train.drop(["aMOV", "aMOL", "aFTA", "aFTP", "anumGamesPlayed", "aRPI", "bMOV", "bMOL", "bFTA", "bFTP", "bnumGamesPlayed", "bRPI"], axis = 1)
#     test = test.drop(["aMOV", "aMOL", "aFTA", "aFTP", "anumGamesPlayed", "aRPI", "bMOV", "bMOL", "bFTA", "bFTP", "bnumGamesPlayed", "bRPI"], axis = 1)

    train = train[['a_seed','bMOV', 'aMOV', 'boEff', 'aMOL', 'bMOL', "aRPI", "bRPI"]]
    test = test[['a_seed','bMOV', 'aMOV', 'boEff', 'aMOL', 'bMOL', 'aRPI', "bRPI"]]
# aRPI               0.050391
# bMOV               0.047005
# boEff              0.041432
# aMOV               0.040750
# a_seed             0.037222
# bMOL               0.034516
# bRPI               0.034289
# aMOL               0.032449
# adEff              0.032296
# bFTP               0.031561
# bPOSS              0.031423
# bDRB               0.031367
# bTOF               0.031037
# bORB               0.031020
# bFTA               0.030186
# aEFG               0.029658
# bTO                0.029633
# bdEff              0.029222
# bEFG               0.028613
# aFTA               0.028416
# aTOF               0.027987
# aFTP               0.027763
# awinsVsTourney     0.027578
# aTO                0.027538
# aORB               0.027136
# aPOSS              0.026246
# aoEff              0.026036
# aDRB               0.025817
# bwinsVsTourney     0.021221
# b_seed             0.020643
# anumGamesPlayed    0.018999
# bnumGamesPlayed    0.015599
# bconfTournWins     0.013323
# aconfTournWins     0.011627
    
    model.train = train
    model.test = test

    feature_names = train.columns
    trainFeatures = np.array(train)
    testFeatures = np.array(test)
    maxFeatures = int(len(feature_names)**0.5)

    rf = RandomForestClassifier(n_estimators = 1000, random_state=42, oob_score=True, max_features=maxFeatures)
    rf.fit(trainFeatures, trainLabels)
    model.forest = rf
    return model

### Utilize historical matchup data to build RF model. 
def getForestDeltas(year, train=None, test=None):
    """
    Outputs predictions for games from test data set using a Random Forest classifier. Baseline model takes team with lower RPI as winner. 
    Returns a tuple consisting of a data frame containing the model's prediction for every matchup in our test dataset, the baseline model's accuracy, our model's accuracy
    """
    model = mod.Model()
    matchups = getMatchupData()
    seededMatches = mergeRankings(matchups)
    sortedMatches = sortMatchups(seededMatches)
    deltas = pd.DataFrame(findAllDeltas(sortedMatches))
    
    cols = list(deltas.columns)
    trainOrig = deltas[~deltas["b_id"].str.contains(year)]
    testOrig = deltas[deltas["b_id"].str.contains(year)]
    baselineAcc = 1.0*sum(testOrig["baseline"]) / testOrig.shape[0]
    
    # Create training/test data sets
    trainLabels = np.array(trainOrig["baseline"])
    trainLabels = trainLabels.astype(int)
    testLabels = np.array(testOrig["baseline"])
    testLabels = testLabels.astype(int)
    testNames = np.column_stack((testOrig["aname"], testOrig["a_id"], testOrig["bname"], testOrig["b_id"]))
    model.trainBaseline = trainLabels
    model.testBaseline = testLabels
    model.testNames = testNames
    
    # Drop qualitative & output columns
    train = trainOrig.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum", "seed"], axis = 1)
    test = testOrig.drop(["b_id", "a_id", "baseline", "bname", "aname", "season", "dayNum", "seed"], axis = 1)
    # Drop insignificant quantitative columns
    train = train.drop(["MOV", "MOL", "FTP", "DRB", "ORB", "EFG", "numGamesPlayed", "winsVsTourney", "confTourneyWins", "POSS", "FTA"], axis = 1)
    test = test.drop(["MOV", "MOL", "FTP", "DRB", "ORB", "EFG", "numGamesPlayed", "winsVsTourney", "confTourneyWins", "POSS", "FTA"], axis = 1)
    
    model.train = train
    model.test = test

    feature_names = train.columns
    trainFeatures = np.array(train)
    testFeatures = np.array(test)
    maxFeatures = int(len(feature_names)**0.5)

    rf = RandomForestClassifier(n_estimators = 1000, random_state=42, oob_score=True, max_features=maxFeatures)
    rf.fit(trainFeatures, trainLabels)
    model.forest = rf
    return model

def drawTree(rf, treeName):
    dot_data = StringIO()
    export_graphviz(rf.estimators_[0], out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names=feature_names)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf("{}.pdf".format(treeName))
    
    
def sortRowByRPI(matchup):
    if matchup["lRPI"] > matchup["wRPI"]:
        numCols = matchup.shape[0]
        newOrder = [0, 1, 4, 5, 2, 3] + list(range(22, numCols - 3)) + list(range(6,22)) + [numCols - 2, numCols - 3, numCols - 1]
        matchup = matchup[matchup.index[newOrder]]
        return list(matchup.values)
    return list(matchup.values)

def sortRowBySeed(matchup):
    if matchup["w_seed"] > matchup["l_seed"]:
        numCols = matchup.shape[0]
        newOrder = list(range(6)) + list(range(22, numCols - 3)) + list(range(6,22)) + [numCols - 2, numCols - 3, numCols - 1]
        matchup = matchup[matchup.index[newOrder]]
        return list(matchup.values)
    elif matchup["w_seed"] == matchup["l_seed"] and matchup["wRPI"] > matchup["lRPI"]:
        numCols = matchup.shape[0]
        newOrder = list(range(6)) + list(range(22, numCols - 3)) + list(range(6,22)) + [numCols - 2, numCols - 3, numCols - 1]
        matchup = matchup[matchup.index[newOrder]]
        return list(matchup.values)
    return list(matchup.values)

def executeSeedTiebreaker(matchup):
    if matchup["w_seed"] == matchup["l_seed"]:
        matchup["baseline"] = matchup["wRPI"] < matchup["lRPI"]
    return matchup

def findDeltasForMatch(matchup):
    numCols = matchup.shape[0]
    return list(matchup.iloc[0:6].values) + list(matchup.iloc[6:22].values - matchup.iloc[22:numCols - 3].values) + [matchup.iloc[numCols - 3] - matchup.iloc[numCols - 2]] + [matchup.iloc[numCols - 1]]
        

In [151]:
def sortMatchups(matchups):
#     matchups["baseline"] = matchups["wRPI"] < matchups["lRPI"]
    matchups["baseline"] = matchups["w_seed"] < matchups["l_seed"]
    matchups = matchups.apply(executeSeedTiebreaker, axis = 1, result_type = "broadcast")
    matchups["baseline"].replace(False, 0, inplace=True)
    matchups["baseline"].replace(True, 1, inplace=True)
    sortedMatchups = matchups.apply(sortRowBySeed, axis = 1, result_type = "broadcast")

    newColNames = []
    columns = sortedMatchups.columns
    for name in columns:
        if name[0] == 'l':
            newColNames.append("a" + name[1:])
        elif name[0] == "w":
            newColNames.append("b" + name[1:])
        else:
            newColNames.append(name)
    sortedMatchups.columns = newColNames
    return sortedMatchups

def findAllDeltas(matchups):
    # expand results to extract each item from list returned each iteration of apply function
    deltas = matchups.apply(findDeltasForMatch, axis = 1, result_type = "expand")
    colNames = list(matchups.columns[0:6]) + ["DRB", "EFG", "FTA", "FTP", "MOL", "MOV", "ORB", "POSS", "RPI",
                                             "TO", "TOF", "confTourneyWins", "dEff", "numGamesPlayed", "oEff",
                                             "winsVsTourney", "seed", "baseline"]
    deltas.columns = colNames
    return deltas

def mergeRankings(matchups):
    seeds = pd.read_csv("data/NCAATourneySeeds.csv")
    seeds["id"] = seeds.TeamID.astype(str).str.cat(seeds.Season.astype(str), sep='_')
    seeds["Seed"] = seeds.Seed.str.extract('(\d+)', expand=False).astype(int)
    seeds = seeds.drop(["Season", "TeamID"], axis = 1)
    temp = matchups.merge(seeds, how='left', left_on = "w_id", right_on = "id")
    temp = temp.rename(index=str, columns={"Seed": "w_seed"})
    temp = temp.merge(seeds, how='left', left_on = "l_id", right_on = "id")
    temp = temp.rename(index=str, columns = {"Seed": "l_seed"})
    merged = temp.drop(["id_x", "id_y"], axis = 1)
    return merged

In [47]:
import pandas as pd 
import numpy as np
import team, game as g
import Model as mod
from sklearn.ensemble import RandomForestClassifier
# Used for developing visual of Random Forest if desired
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

def getTeamNames():
    """
    Return dictionary where key is team ID and value is team name
    """
    names = {}
    teams = pd.read_csv("Data/Teams.csv")
    for index, row in teams.iterrows():
        teamId = row["TeamID"]
        name = row["TeamName"]
        names[teamId] = name
    return names

def getSeasonStats(ncaaTourneyTeams):
    """
    Use regular season results and RPI rankings to create a 
    dictionary where key is the team's ID and the value is a 
    Team object. Team objects contain yearly avg stats for each 
    team in various categories.
    """
    teams = {}
    names = getTeamNames()
    unfiltRanks = pd.read_csv("data/MasseyOrdinals_Prelim2018.csv")
    rankings = unfiltRanks[(unfiltRanks["SystemName"] == "RPI") & (unfiltRanks["RankingDayNum"] == 133)]
    regSeasonResults = pd.read_csv("data/RegularSeasonDetailedResults.csv")
    for index, row in regSeasonResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)
        wRPI = None
        lRPI = None
        try:
            wRPI = rankings[(rankings["Season"] == season) & (rankings["TeamID"] == wTeamId)].iloc[0]["OrdinalRank"]
            lRPI = rankings[(rankings["Season"] == season) & (rankings["TeamID"] == lTeamId)].iloc[0]["OrdinalRank"]
        except Exception as e:
            pass
            # print str(lTeamId) + " " + str(season) + " not found"
        
        if customWId not in teams:
            teams[customWId] = team.Team(customWId)
        if customLId not in teams:
            teams[customLId] = team.Team(customLId)
        wTeam = teams[customWId]
        wTeam.RPI = wRPI
        wTeam.name = names[wTeamId]
        wTeam.updateStats(row, True)
        if customLId in ncaaTourneyTeams:
            wTeam.winsVsTourney += 1
        lTeam = teams[customLId]
        lTeam.name = names[lTeamId]
        lTeam.RPI = lRPI
        lTeam.updateStats(row, False)
    return teams

def getTeamStats(teams):
    """
    Get season stats for each team in a Data Frame. Able to see yearly averages and totals for stored
    statistical categories in each team object in teams dictionary
    """
    allTeamData = []
    for team in teams:
        idAndSeason = team.split("_")
        season = idAndSeason[1]
        _id = idAndSeason[0]
        teamData = teams[team].objToDict()
        teamData['id'] = _id
        teamData['season'] = season
        allTeamData.append(teamData)
    return pd.DataFrame(allTeamData)

def getOrdinals(teamStats):
    """
    Get DataFrame where team stats are converted to ordinal rankings, grouped by season
    Certain categories are ranked where the min of the group is the highest ranking (Ex: RPI)
    """
    toReturn = teamStats.copy()
    copy = toReturn.copy()
    copy['season'] = copy['season'].astype(int)
    copy.drop(copy.select_dtypes(['object']), inplace=True, axis=1)
    new = copy.groupby('season').rank(axis = 1, ascending = False, method = 'min')
    new.drop(columns=['numGamesPlayed'])
    copy.update(new)
    toSwap = ['RPI', 'MOL', 'TO', 'confTournWins', 'dEff', 'winsVsTourney']
    for item in toSwap:
        copy[item] = copy.groupby('season')[item].rank(ascending = False, method = 'max')
    toReturn.update(copy)
    return toReturn

def populateNCAATourneyTeams():
    """
    Create an ID for each team using a combination of its id and the season the team played in. 
    Output a dictionary with an entry for each team whose key is its newly created id
    """
    ncaaTourneyTeams = {}
    ncaaTournResults = pd.read_csv("data/NCAATourneyCompactResults.csv")
    for index, row in ncaaTournResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)

        if customWId not in ncaaTourneyTeams:
            ncaaTourneyTeams[customWId] = 1
        if customLId not in ncaaTourneyTeams:
            ncaaTourneyTeams[customLId] = 1
    return ncaaTourneyTeams

def getMatchups(teams):
    """
    Use NCAA Tournament results to return data frame of matchups where each row contains data for one matchup between two teams, including their yearly avg totals in statistical categories, RPI, and game result.
    """
    matchups = []
    ncaaTournResults = pd.read_csv("data/NCAATourneyCompactResults.csv")
    for index, row in ncaaTournResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)

        if customWId in teams and customLId in teams:
            wTeamData = teams[customWId].objToDict().copy()
            for key in wTeamData.keys():
                if key[0] == "w":
                    wTeamData[key] = wTeamData[key]
                else:
                    wTeamData["w" + key] = wTeamData[key]
                del wTeamData[key]
            lTeamData = teams[customLId].objToDict().copy()
            for key in lTeamData.keys():
                if key[0] == "l":
                    lTeamData[key] = lTeamData[key]
                else:
                    lTeamData["l" + key] = lTeamData[key]
                del lTeamData[key]
            matchupData = wTeamData.copy()
            matchupData.update(lTeamData)
            matchupData["dayNum"] = dayNum
            matchupData["season"] = season
            matchups.append(matchupData)
    colOrder = ["dayNum", "season", "l_id", "lname", "w_id", "wname", "lDRB", "lEFG", "lFTA", "lFTP", "lMOL", "lMOV", "lORB", "lPOSS",
                "lRPI", "lTO", "lTOF", "lconfTournWins", "ldEff", "lnumGamesPlayed", "loEff", "lwinsVsTourney",
                "wDRB", "wEFG", "wFTA", "wFTP", "wMOL", "wMOV", "wORB", "wPOSS", "wRPI", "wTO", "wTOF", 
                "wconfTournWins", "wdEff", "wnumGamesPlayed", "woEff", "wwinsVsTourney"]
    df = pd.DataFrame.from_dict(matchups)
    df = df[colOrder]
    return df

def getMatchupData():
    """
    Returns data frame of historical matchups in NCAA tournament.
    Reads in existing CSV if available. Otherwise, produces data frame by creating Team objects, calculating yearly avg totals for each team, and joining with historical NCAA tourney matchup data
    """
    try:
        matchups = pd.read_csv("Data/output/matchups.csv")
        return matchups
    except Exception as e:
        ncaaTourneyTeams = populateNCAATourneyTeams()
        teamObjs = getSeasonStats(ncaaTourneyTeams)
        matchups = getMatchups(teamObjs)
        matchups.to_csv("Data/output/matchups.csv", index=False)
        return matchups

In [13]:
ncaaTourneyTeams = populateNCAATourneyTeams()
teamObjs = getSeasonStats(ncaaTourneyTeams)

In [32]:
    matchups = getMatchupData()
    seededMatches = mergeRankings(matchups)
    sortedMatches = sortMatchups(seededMatches)

In [46]:
teamStats = getTeamStats(teamObjs)
ordinals = getOrdinals(teamStats)
ordinals


         DRB       EFG        FTA       FTP       MOL        MOV        ORB  \
0  23.928571  0.549180  20.928571  0.705168  4.426604  12.983510  13.571429   
1  24.966667  0.552083  18.600000  0.714351  6.357793  15.399183  12.133333   
2  25.965517  0.539216  22.896552  0.628299  3.711207  12.428413  14.068966   
3  26.896552  0.318750  23.620690  0.687824  7.712401  14.241855  14.310345   
4  24.071429  0.620370  23.607143  0.778503  4.381287  15.303645  13.107143   

        POSS   RPI         TO        TOF  confTournWins        dEff  \
0  65.264286  38.0  13.285714  13.857143              0   97.424497   
1  63.640000   3.0  11.800000  13.700000              3   92.663844   
2  68.882759  26.0  13.793103  15.068966              2   93.960911   
3  70.965517   9.0  13.620690  14.448276              2   98.529129   
4  66.157143  10.0  13.571429  12.500000              1  103.161002   

   numGamesPlayed        oEff  season  winsVsTourney  
0              28  103.628238    2003      

Unnamed: 0,DRB,EFG,FTA,FTP,MOL,MOV,ORB,POSS,RPI,TO,TOF,_id,confTournWins,dEff,id,name,numGamesPlayed,oEff,season,winsVsTourney
0,110.0,104.0,143.0,113.0,27.0,83.0,47.0,252.0,38.0,52.0,235.0,1104_2003,146.0,63.0,1104,Alabama,146.0,138.0,2003,307.0
1,50.0,101.0,251.0,93.0,61.0,44.0,130.0,294.0,3.0,13.0,248.0,1328_2003,318.0,16.0,1328,Oklahoma,28.0,41.0,2003,323.0
2,19.0,116.0,66.0,307.0,17.0,99.0,29.0,114.0,26.0,94.0,144.0,1272_2003,296.0,23.0,1272,Memphis,73.0,80.0,2003,293.0
3,7.0,327.0,57.0,179.0,87.0,61.0,24.0,46.0,9.0,83.0,190.0,1393_2003,296.0,89.0,1393,Syracuse,73.0,15.0,2003,307.0
4,101.0,39.0,58.0,4.0,26.0,46.0,69.0,228.0,10.0,76.0,305.0,1266_2003,233.0,198.0,1266,Marquette,146.0,1.0,2003,307.0
5,120.0,248.0,91.0,97.0,125.0,125.0,16.0,97.0,70.0,240.0,43.0,1437_2003,146.0,186.0,1437,Villanova,28.0,84.0,2003,262.0
6,195.0,83.0,84.0,262.0,227.0,197.0,74.0,230.0,117.0,282.0,189.0,1296_2003,296.0,246.0,1296,N Illinois,9.0,89.0,2003,120.0
7,108.0,99.0,114.0,305.0,215.0,215.0,136.0,163.0,183.0,146.0,100.0,1457_2003,146.0,132.0,1457,Winthrop,146.0,107.0,2003,262.0
8,14.0,127.0,51.0,87.0,9.0,47.0,3.0,100.0,4.0,68.0,214.0,1400_2003,233.0,88.0,1400,Texas,146.0,11.0,2003,293.0
9,67.0,287.0,113.0,91.0,47.0,136.0,92.0,116.0,5.0,10.0,259.0,1208_2003,233.0,236.0,1208,Georgia,230.0,8.0,2003,326.0
