### Men's NCAA Tournament Lab

Welcome!  This lab is designed to introduce you all to building features and scoring models on game data from the NCAA tournament.  

When you're done, you should be able to work through the basics of using predictive models in these types of situations.

**Step 1:** Import files for the seeds, ncaa tournament games, and regular season games.  Also import the exported csv you made from class for the initial one variable model you fit.

In [264]:
import pandas as pd
import numpy as np

seeds = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit4/data/MNCAATourneySeeds.csv')
tourneyresults = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit4/data/MNCAATourneyCompactResults.csv')
regularresults = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit4/data/MRegularSeasonCompactResults.csv')
game_data = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit4/Class18/game_data.csv')

**Step 2:** Create a Training & Test Set, With the Test Set Comprising of All Games 2015 & After.  Use the exported csv from class for this, since it's already prepped.

In [265]:
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff
0,1985,1116,63,1234,54,1,9,8,1
1,1985,1116,65,1385,68,0,9,1,8
2,1985,1120,59,1345,58,1,11,6,5
3,1985,1120,66,1242,64,1,11,3,8
4,1985,1120,56,1314,62,0,11,2,9


In [266]:
pre2015 = game_data.Season<2015
pre2015

0        True
1        True
2        True
3        True
4        True
        ...  
4485    False
4486    False
4487    False
4488    False
4489    False
Name: Season, Length: 4490, dtype: bool

In [267]:
train = game_data[pre2015].copy()
test = game_data[~pre2015].copy()

In [268]:
train.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff
0,1985,1116,63,1234,54,1,9,8,1
1,1985,1116,65,1385,68,0,9,1,8
2,1985,1120,59,1345,58,1,11,6,5
3,1985,1120,66,1242,64,1,11,3,8
4,1985,1120,56,1314,62,0,11,2,9


**Step 3:** Find an initial validation score with the 1 seed model, and a RandomForest Classifier, right out of the box.

 - Run KFold, using 10 splits
 - Just use the seed difference for X
 - FYI: The score being returned is prediction accuracy

In [269]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X = train[['SeedDiff']]
y = train['Result']

rf = RandomForestClassifier()

scores = cross_val_score(estimator=rf, X=X, y=y, cv=10)



What is your initial validation score?

In [270]:
np.mean(scores)

0.7009304628085429

**Step 4:** Create new data that captures the won-loss record of each team

We're going to break this down into smaller steps to make it easier to digest

**a).** Use `groupby()` to group teams based on `Season` and `WTeamID` in the dataset for regular season games.  Apply the `count()` aggregator to one of the columns to determine how many games each team won.

In [271]:
regularresults.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [272]:
regularresults.groupby(['Season', 'WTeamID'])['WScore'].count()

Season  WTeamID
1985    1102        5
        1103        9
        1104       21
        1106       10
        1108       19
                   ..
2019    1462       18
        1463       21
        1464       10
        1465       12
        1466        7
Name: WScore, Length: 11227, dtype: int64

In [273]:
regularresults.groupby(['Season', 'LTeamID'])['LScore'].count()

Season  LTeamID
1985    1102       19
        1103       14
        1104        9
        1106       14
        1108        6
                   ..
2019    1462       15
        1463        7
        1464       20
        1465       14
        1466       22
Name: LScore, Length: 11238, dtype: int64

**b).** Save the grouping from the previous step as it's own variable, but with the following additions:

 - tack on the `reset_index()` method at the end -- note what this does
 - as an argument for the `reset_index()` method, pass in `name=Wins`

In [274]:
wins = regularresults.groupby(['Season', 'WTeamID'])['WScore'].count().reset_index(name='Wins')


**c).** Repeat steps `a` and `b`, but this time group in `LTeamID` and make the new column called `Losses` instead of `Wins`.

In [275]:
losses = regularresults.groupby(['Season', 'LTeamID'])['LScore'].count().reset_index(name='Losses')

At this point -- look at the two variables you created, and just make sure you can make sense out of what they're telling you.  You should have two separate dataframes that tell you how many wins & losses each team in each season had from 1985 until tolday.

**Step 5:** Merge your data back into your original data set

This can be a little tedious and time consuming, but you have to be careful in order to make sure you get it right.

**Part 1:** Building Features for Team 1

**a).** How many games did team 1 win?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'WTeamID'`
 - **new column name:** `'T1Wins'`

In [276]:
game_data = game_data.merge(wins, left_on=['Season', 'T1TeamID'], right_on=['Season', 'WTeamID'], how='left')
game_data.columns.values[-1] = 'T1Wins'

**b).** How many games did team 1 lose?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'LTeamID'`
 - **new column name:** `'T1Losses'`

In [277]:
game_data = game_data.merge(losses, left_on=['Season', 'T1TeamID'], right_on=['Season', 'LTeamID'], how='left')
game_data.rename({'Losses': 'T1Losses'}, axis=1, inplace=True)

In [278]:
game_data

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,WTeamID,T1Wins,LTeamID,T1Losses
0,1985,1116,63,1234,54,1,9,8,1,1116,21,1116,12
1,1985,1116,65,1385,68,0,9,1,8,1116,21,1116,12
2,1985,1120,59,1345,58,1,11,6,5,1120,18,1120,11
3,1985,1120,66,1242,64,1,11,3,8,1120,18,1120,11
4,1985,1120,56,1314,62,0,11,2,9,1120,18,1120,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4485,2019,1243,64,1414,70,0,4,13,-9,1243,25,1243,8
4486,2019,1433,58,1416,73,0,8,9,-1,1433,25,1433,7
4487,2019,1205,56,1438,71,0,16,1,15,1205,20,1205,11
4488,2019,1387,52,1439,66,0,13,4,9,1387,23,1387,12


**c).** Some teams have gone undefeated.  If that's the case there will be no entries for them in the loss column.  Fill in these values with 0 now.

In [279]:
game_data['T1Losses'].fillna(0, inplace=True)

**d).** You probably have some unnecessary columns right now.  Remove unnecessary columns created from the merges if they exist.  These are most likely going to be the `WTeamID` and `LTeamID` columns.

In [280]:
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

**e).** Now create a new column called `T1WinPCT` that's the winning percentage of team 1.

In [281]:
game_data['T1WinPCT'] = game_data['T1Wins']/(game_data['T1Wins']+game_data['T1Losses'])

In [282]:
game_data

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,T1Wins,T1Losses,T1WinPCT
0,1985,1116,63,1234,54,1,9,8,1,21,12,0.636364
1,1985,1116,65,1385,68,0,9,1,8,21,12,0.636364
2,1985,1120,59,1345,58,1,11,6,5,18,11,0.620690
3,1985,1120,66,1242,64,1,11,3,8,18,11,0.620690
4,1985,1120,56,1314,62,0,11,2,9,18,11,0.620690
...,...,...,...,...,...,...,...,...,...,...,...,...
4485,2019,1243,64,1414,70,0,4,13,-9,25,8,0.757576
4486,2019,1433,58,1416,73,0,8,9,-1,25,7,0.781250
4487,2019,1205,56,1438,71,0,16,1,15,20,11,0.645161
4488,2019,1387,52,1439,66,0,13,4,9,23,12,0.657143


**Part II:**  Build the same features for Team II

Your turn:  Try and recreate the exact same features you just created for the first team, but for the second.

**Hint:**  In your original dataset, swap out `T1TeamID` for `T2TeamID` for the merges.

In [283]:
game_data = game_data.merge(wins, left_on=['Season', 'T2TeamID'], right_on=['Season', 'WTeamID'], how='left')
game_data.columns.values[-1] = 'T2Wins'

In [284]:
game_data = game_data.merge(losses, left_on=['Season', 'T2TeamID'], right_on=['Season', 'LTeamID'], how='left')
game_data.rename({'Losses': 'T2Losses'}, axis=1, inplace=True)

In [285]:
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

In [286]:
#game_data = game_data.merge(losses, left_on=['Season', 'T1TeamID'], right_on=['Season', 'LTeamID'], how='left')
#game_data.rename({'Losses': 'T1Losses'}, axis=1, inplace=True)

In [287]:
game_data['T2WinPCT'] = game_data['T2Wins']/(game_data['T2Wins']+game_data['T2Losses'])

**Step 6:** Recreate your training and test sets from the original data source, using the same criteria as before

In [297]:
train = game_data[pre2015].copy()
test = game_data[~pre2015].copy()

In [298]:
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,T1Wins,T1Losses,T1WinPCT,T2Wins,T2Losses,T2WinPCT
0,1985,1116,63,1234,54,1,9,8,1,21,12,0.636364,20,10.0,0.666667
1,1985,1116,65,1385,68,0,9,1,8,21,12,0.636364,27,3.0,0.9
2,1985,1120,59,1345,58,1,11,6,5,18,11,0.62069,17,8.0,0.68
3,1985,1120,66,1242,64,1,11,3,8,18,11,0.62069,23,7.0,0.766667
4,1985,1120,56,1314,62,0,11,2,9,18,11,0.62069,23,8.0,0.741935


**Step 7:** Recreate `X` and `y`, except this time include the new features that you added -- Wins and losses for each team, as well as their winning percentage

In [299]:
X = [['SeedDiff', 'T1Wins', 'T1Losses', 'T1WinPCT', 'T2Wins', 'T2Losses', 'T2WinPCT']]
y = train['Result']

**Step 8:** Re-check your validation scores with the new data, using the same conditions that we did in the previous step.  See if your validation scores improved at all.

In [300]:
scores = cross_val_score(estimator=rf, X=X, y=y, cv=10)

ValueError: Found input variables with inconsistent numbers of samples: [1, 3825]

Did your results improve?

In [292]:
# your answer here

**Step 9:** Of the two different versions of our model that we just tested, take the best one, fit your random forest on its training data, and then score it on your test set to see how your final results come out.

In [293]:
# your answer here

**Step 10:** How close were your validation and test results?  Ie, how reliable were our validation results?

**Bonus:** If time permits, you can try a few different permutations of what we just did to continue to improve your results.  Including:

 - Trying to add more features beyon each team's winning percentage (perhaps average point differential would be more informative)
 - Using a grid search to find the best parameters of your random forest and seeing how that improves your results

In [294]:
# your answer here