# MARCH MADNESS MACHINE

### 2.0 Updates:
- Added playoff game dataset (MNCAATourneyDetailedResults.csv)
    - Using this as the train/test set instead of ALL games

This notebook is an attempt to predict March Madness brackets through the methods of machine learning. The tournament starts in less than two weeks from the start of this project, so one goal is to keep it simple. There is limited time for in-depth analysis and exploration. It is through this experiment I hope to be able to use very limited information to build a simplified model that still produces results. It seems that very often March Madness models are created that try to examine every excruciating detail and statistic to perfectly predict the outcomes. This is impossible. There are too many unknowns in the world of sports, especially March Madness, to create a model with perfect prediction. Most of the time, this leads to overfitting. Instead, I aim to experiment on the ability of a simple model to perform in a complicated world. Instead of looking for every possible detail, to create a quick and simple model to perform better than the average, educated NCAA basketball fan.

Once a prediction method is created, it will be tested against other "testing" brackets. I will create two brackets will little knowledge about NCAA basketball. The two brakets will mainly consist of picking based on seeds, while accounting for some upsets. These will be the "guess" brackets. Next, I will create two more brackets after carefully examining expert opinions and rankings. These will be the "educated guess" brackets. Finally I will pick the two best models from the experiment and create two brackets from the results accordingly. I will them examine which brackets perform the best, seeing if the randomness of march madness can create chaos within the experiment. The results remain to be seen.

In [2]:
#All necessary imports
import sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import tree
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.svm import LinearSVR
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor

# STEP 1: EXPLORING THE DATA

For starters, I plan on using data from the Kaggle March Madness competition. Most of the data is presented in a very clean format and will eliminate the need for data scraping and excess data cleaning to save time. However, many other datasets exist that could provide useful information and I plan to be more prepared in future years with more time to explore this data. For now, the Kaggle data will be the main source.

The first dataset of interest is the regular season detailed results data, which contains regular season statistics all the way back to 2003. 

In [3]:
season_stats = pd.read_csv("MRegularSeasonDetailedResults.csv")
season_stats.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14


Some of these columns could use some explanation. They are broken into the winning team (W) and losing team(L) stats. For the winning team, the stats would be as follows:


WFGM - field goals made (by the winning team)

WFGA - field goals attempted (by the winning team)

WFGM3 - three pointers made (by the winning team)

WFGA3 - three pointers attempted (by the winning team)

WFTM - free throws made (by the winning team)

WFTA - free throws attempted (by the winning team)

WOR - offensive rebounds (pulled by the winning team)

WDR - defensive rebounds (pulled by the winning team)

WAst - assists (by the winning team)

WTO - turnovers committed (by the winning team)

WStl - steals (accomplished by the winning team)

WBlk - blocks (accomplished by the winning team)

WPF - personal fouls committed (by the winning team)


The stats for the losing team would be the same except the W at the start would be replaced by an L.

Using this dataframe alone, there is plenty of information to build a simple model. 

In [14]:
playoff_games = pd.read_csv("MNCAATourneyDetailedResults.csv")
playoff_games.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,134,1421,92,1411,84,N,1,32,69,...,31,14,31,17,28,16,15,5,0,22
1,2003,136,1112,80,1436,51,N,0,31,66,...,16,7,7,8,26,12,17,10,3,15
2,2003,136,1113,84,1272,71,N,0,31,59,...,28,14,21,20,22,11,12,2,5,18
3,2003,136,1141,79,1166,73,N,0,29,53,...,17,12,17,14,17,20,21,6,6,21
4,2003,136,1143,76,1301,74,N,1,27,64,...,21,15,20,10,26,16,14,5,8,19


In [20]:
playoff_stat = playoff_games[["Season", "WTeamID", "LTeamID", "WScore", "LScore", "WStl", "LStl", "WOR", "WDR", "LOR", "LDR", "WTO", "LTO"]]
print(playoff_stat.tail())

#It should be the same
no2022_playoff_stat = playoff_stat[playoff_stat['Season'] != 2022]
no2022_playoff_stat.tail()

      Season  WTeamID  LTeamID  WScore  LScore  WStl  LStl  WOR  WDR  LOR  \
1176    2021     1211     1425      85      66     6     7   11   27    7   
1177    2021     1417     1276      51      49     5     5    6   21    8   
1178    2021     1124     1222      78      59     6     4   11   17   13   
1179    2021     1211     1417      93      90     8     4    4   19    7   
1180    2021     1124     1211      86      70     8     4   14   20    1   

      LDR  WTO  LTO  
1176   20    9    9  
1177   24    8   14  
1178   12    8   10  
1179   24   10    9  
1180   16    7   14  


Unnamed: 0,Season,WTeamID,LTeamID,WScore,LScore,WStl,LStl,WOR,WDR,LOR,LDR,WTO,LTO
1176,2021,1211,1425,85,66,6,7,11,27,7,20,9,9
1177,2021,1417,1276,51,49,5,5,6,21,8,24,8,14
1178,2021,1124,1222,78,59,6,4,11,17,13,12,8,10
1179,2021,1211,1417,93,90,8,4,4,19,7,24,10,9
1180,2021,1124,1211,86,70,8,4,14,20,1,16,7,14


In [4]:
team_names = pd.read_csv("MTeams.csv")
team_names.head()

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2022
1,1102,Air Force,1985,2022
2,1103,Akron,1985,2022
3,1104,Alabama,1985,2022
4,1105,Alabama A&M,2000,2022


The above dataframe contains all the teams names and team IDs that will be useful in identifying teams in future processes.

In [5]:
rankings = pd.read_csv("MMasseyOrdinals.csv")
rankings.tail()

Unnamed: 0,Season,RankingDayNum,SystemName,TeamID,OrdinalRank
4521715,2022,100,WOL,1468,183
4521716,2022,100,WOL,1469,259
4521717,2022,100,WOL,1470,209
4521718,2022,100,WOL,1471,270
4521719,2022,100,WOL,1472,296


The above dataframe contains information of the team rankings based on different days and systems. I do not think earlier rankings will be as important and instead the rankings right before the start of the NCAA tournament to be the most important. These occur on `RankingDayNum` 133

In [6]:
final_rankings = rankings[rankings['RankingDayNum'] == 133]
final_rankings.head()
print(final_rankings['SystemName'].value_counts())
print("All rating systems: ", final_rankings['SystemName'].unique())

MOR    6180
SAG    6180
WLK    6177
POM    6177
DOL    6176
       ... 
REI     326
RM      326
PH      326
D1A     200
TRX     120
Name: SystemName, Length: 172, dtype: int64
All rating systems:  ['AP' 'ARG' 'BIH' 'BOB' 'BRZ' 'COL' 'DOL' 'DUN' 'DWH' 'ECK' 'ENT' 'ERD'
 'GRN' 'GRS' 'HER' 'HOL' 'IMS' 'MAS' 'MKV' 'MOR' 'POM' 'RPI' 'RTH' 'SAG'
 'SAU' 'SE' 'SEL' 'STR' 'TSR' 'USA' 'WLK' 'WOB' 'WOL' 'WTE' 'BD' 'CNG'
 'DES' 'JON' 'LYN' 'MGY' 'NOR' 'REI' 'RM' 'SIM' 'ACU' 'BCM' 'CMV' 'DC'
 'KLK' 'REN' 'RIS' 'ROH' 'SAP' 'SCR' 'WIL' 'DOK' 'JCI' 'KPK' 'MB' 'PH'
 'PIG' 'PKL' 'TRX' 'CPR' 'ISR' 'KRA' 'LYD' 'RTR' 'UCS' 'BKM' 'CPA' 'JEN'
 'PGH' 'REW' 'RSE' 'SPW' 'STH' 'BPI' 'DC2' 'DCI' 'HKB' 'LMC' 'NOL' 'OMY'
 'RTB' 'KEL' 'KMV' 'RT' 'TW' 'AUS' 'KOS' 'PEQ' 'PTS' 'ROG' 'RTP' 'TMR'
 '7OT' 'ADE' 'BBT' 'BNM' 'BUR' 'CJB' 'CRO' 'EBP' 'HAT' 'MSX' 'SFX' 'TBD'
 'BLS' 'D1A' 'DII' 'KBM' 'TPR' 'MvG' 'PPR' 'SP' 'SPR' 'STF' 'STS' 'TRP'
 'UPS' 'WMR' 'BWE' 'LOG' 'TRK' 'DAV' 'FAS' 'FSH' 'HAS' 'HRN' 'KPI' 'MCL'
 'CRW' 'DD

Looks like there are a bunch of different rating systems out there. The Pomeroy (POM) rating is one of the most popular and considered one of the best. I plan to use this as an extra feature. 

In [7]:
pom = final_rankings[final_rankings['SystemName'] == 'POM']
print(pom.head())

#I decided to prepare some other ratings systems as well and, time-permitting, may add them to the model
mor = final_rankings[final_rankings['SystemName'] == 'MOR']
sag = final_rankings[final_rankings['SystemName'] == 'SAG']
wlk = final_rankings[final_rankings['SystemName'] == 'WLK']
dol = final_rankings[final_rankings['SystemName'] == 'DOL']

        Season  RankingDayNum SystemName  TeamID  OrdinalRank
127259    2003            133        POM    1102          160
127260    2003            133        POM    1103          163
127261    2003            133        POM    1104           33
127262    2003            133        POM    1105          307
127263    2003            133        POM    1106          263


# STEP 2: FEATURE SELECTION

Now that our preliminary data is selected, it is now time to decide what features we want our model to contain. 

The list of features incude:

Number of wins

Average points scored

Average points against

POM, MOR, SAG, WLK, and DOL rankings

Average steals

Average rebounds

Average turnovers

I decided to use more than just the Ken Pom rankings and to utilize the MOR, SAG, WALK, and DOL rankings. This will provide more variability than the "most popular" POM rankings that are commonly referred when making brackets

Now to clean up that `season_stats` dataframe to only the stats we are going to use...

In [8]:
season_stat = season_stats[["Season", "WTeamID", "LTeamID", "WScore", "LScore", "WStl", "LStl", "WOR", "WDR", "LOR", "LDR", "WTO", "LTO"]]
print(season_stat.tail())

#need season stats without 2022 season, since rankings have not reached day 133 yet. This will be used until day 133 results arrive
no2022_season_stat = season_stat[season_stat['Season'] != 2022]
print(no2022_season_stat.head())

        Season  WTeamID  LTeamID  WScore  LScore  WStl  LStl  WOR  WDR  LOR  \
100418    2022     1400     1242      79      76     7     3   14   18    5   
100419    2022     1411     1126      66      63    10    13   12   27    5   
100420    2022     1422     1441      68      49    11     8   11   22   10   
100421    2022     1438     1181      69      68    10     3   10   20   11   
100422    2022     1439     1338      74      47     6     3    9   26    0   

        LDR  WTO  LTO  
100418   24    6   15  
100419   23   19   19  
100420   18   15   16  
100421   25    5   14  
100422   18    8   11  
   Season  WTeamID  LTeamID  WScore  LScore  WStl  LStl  WOR  WDR  LOR  LDR  \
0    2003     1104     1328      68      62     7     9   14   24   10   22   
1    2003     1272     1393      70      63     4     8   15   28   20   25   
2    2003     1266     1437      73      61     5     2   17   26   31   22   
3    2003     1296     1457      56      50    14     4    6   19

In [17]:
#function to get team name from ID and vice versa
def getTeamName(teamID):
    name = team_names[team_names['TeamID'] == teamID]
    return name.iloc[0][1]

def getTeamID(teamName):
    teamID = team_names[team_names['TeamName'] == teamName]
    return teamID.iloc[0][0]

#verifying they work
print(getTeamName(1234))
print(getTeamID('Iowa'))

Iowa
1234


In [10]:
#Get functions for all stat features and testing on Gonzaga 2019
def getNumWins(teamID, year):
    all_wins = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    wins = all_wins.shape[0]
    return wins

print("Gonzaga Wins 2019: ", getNumWins(1211, 2019))

def getPointsScored(teamID, year):
    win_games = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    lose_games = season_stat[(season_stat["LTeamID"] == teamID) & (season_stat["Season"] == year)]
    points = (win_games["WScore"].sum() + lose_games["LScore"].sum()) / (win_games.shape[0] + lose_games.shape[0])
    return points

print("Gonzaga Average Points Scored 2019: ", getPointsScored(1211, 2019))

def getPointsAgainst(teamID, year):
    win_games = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    lose_games = season_stat[(season_stat["LTeamID"] == teamID) & (season_stat["Season"] == year)]
    points = (win_games["LScore"].sum() + lose_games["WScore"].sum()) / (win_games.shape[0] + lose_games.shape[0])
    return points

print("Gonzaga Average Points Against 2019: ", getPointsAgainst(1211, 2019))

def getSteals(teamID, year):
    win_games = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    lose_games = season_stat[(season_stat["LTeamID"] == teamID) & (season_stat["Season"] == year)]
    steals = (win_games["WStl"].sum() + lose_games["LStl"].sum()) / (win_games.shape[0] + lose_games.shape[0])
    return steals

print("Gonzaga Average Steals 2019: ", getSteals(1211, 2019))

def getRebounds(teamID, year):
    win_games = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    lose_games = season_stat[(season_stat["LTeamID"] == teamID) & (season_stat["Season"] == year)]
    rebounds = (win_games["WOR"].sum() + win_games["WDR"].sum() + lose_games["LOR"].sum() + lose_games["LDR"].sum()) / (win_games.shape[0] + lose_games.shape[0])
    return rebounds

print("Gonzaga Average Rebounds 2019: ", getRebounds(1211, 2019))

def getTurnovers(teamID, year):
    win_games = season_stat[(season_stat["WTeamID"] == teamID) & (season_stat["Season"] == year)]
    lose_games = season_stat[(season_stat["LTeamID"] == teamID) & (season_stat["Season"] == year)]
    turnovers = (win_games["WTO"].sum() + lose_games["LTO"].sum()) / (win_games.shape[0] + lose_games.shape[0])
    return turnovers

print("Gonzaga Average Turnovers 2019: ", getTurnovers(1211, 2019))

Gonzaga Wins 2019:  30
Gonzaga Average Points Scored 2019:  88.84848484848484
Gonzaga Average Points Against 2019:  65.06060606060606
Gonzaga Average Steals 2019:  7.545454545454546
Gonzaga Average Rebounds 2019:  38.63636363636363
Gonzaga Average Turnovers 2019:  10.363636363636363


In [11]:
#Get functions for the POM and more ratings
def getPOM(teamID, year):
    rank_line = pom[(pom["TeamID"] == teamID) & (pom["Season"] == year)]
    rank = rank_line.iloc[0][4]
    return rank

print("Gonzaga POM ranking 2019: ", getPOM(1211, 2019))

def getMOR(teamID, year):
    rank_line = mor[(mor["TeamID"] == teamID) & (mor["Season"] == year)]
    rank = rank_line.iloc[0][4]
    return rank

print("Gonzaga MOR ranking 2019: ", getMOR(1211, 2019))

def getSAG(teamID, year):
    rank_line = sag[(sag["TeamID"] == teamID) & (sag["Season"] == year)]
    rank = rank_line.iloc[0][4]
    return rank

print("Gonzaga SAG ranking 2019: ", getSAG(1211, 2019))

def getWLK(teamID, year):
    rank_line = wlk[(wlk["TeamID"] == teamID) & (wlk["Season"] == year)]
    rank = rank_line.iloc[0][4]
    return rank

print("Gonzaga WLK ranking 2019: ", getWLK(1211, 2019))

def getDOL(teamID, year):
    rank_line = dol[(dol["TeamID"] == teamID) & (dol["Season"] == year)]
    rank = rank_line.iloc[0][4]
    return rank

print("Gonzaga DOL ranking 2019: ", getDOL(1211, 2019))


Gonzaga POM ranking 2019:  2
Gonzaga MOR ranking 2019:  2
Gonzaga SAG ranking 2019:  5
Gonzaga WLK ranking 2019:  5
Gonzaga DOL ranking 2019:  7


In [12]:
#create team vectors
def getTeamVector(teamID, year):
    wins = getNumWins(teamID, year)
    points_scored = getPointsScored(teamID, year)
    points_against = getPointsAgainst(teamID, year)
    steals = getSteals(teamID, year)
    rebounds = getRebounds(teamID, year)
    turnovers = getTurnovers(teamID, year)
    pom = getPOM(teamID, year)
    mor = getMOR(teamID, year)
    sag = getSAG(teamID, year)
    wlk = getWLK(teamID, year)
    dol = getDOL(teamID, year)
    return [wins, points_scored, points_against, steals, rebounds, turnovers, pom, mor, sag, wlk, dol]

#test for Gonzaga 2019 vector
print(getTeamVector(1211, 2019))

[30, 88.84848484848484, 65.06060606060606, 7.545454545454546, 38.63636363636363, 10.363636363636363, 2, 2, 5, 5, 7]


In [13]:
def getDifferenceVector(team1_ID, team2_ID, year):
    team1 = getTeamVector(team1_ID, year)
    team2 = getTeamVector(team2_ID, year)
    difference = np.subtract(team1, team2)
    return difference

print("Gonzaga Vector: ",  '\n', getTeamVector(1211, 2019))
print("Iowa Vector: ", '\n', getTeamVector(1234, 2019))
print("Difference Vector: ", '\n', (getDifferenceVector(1211, 1234, 2003)))

Gonzaga Vector:  
 [30, 88.84848484848484, 65.06060606060606, 7.545454545454546, 38.63636363636363, 10.363636363636363, 2, 2, 5, 5, 7]
Iowa Vector:  
 [22, 78.3030303030303, 73.63636363636364, 6.181818181818182, 35.72727272727273, 12.151515151515152, 36, 34, 42, 37, 24]
Difference Vector:  
 [  8.           7.52880184   1.47926267   0.16359447   0.90092166
  -0.48732719 -48.         -16.         -44.         -62.
 -37.        ]


# Step 3: Create Training Set

I can now create a training set using the previous functions. I added an error exception in the code due to missing rankings for some of the teams. The teams with missing rankings are smaller teams who would not make the march madness tournament anyways and thus losing these rows of data most likely will not affect the results very much.

In [21]:
#create training set
# NOTE: CURRENTLY USING NO2022_SEASON_STAT, ONCE DAY 133 RATINGS ARE RELEASED, CHANGE TO SEASON_STAT
# Usage: 2
def createTrain():
    numGames = len(no2022_playoff_stat)
    numFeatures = len(getTeamVector(1211, 2019))
    years = range(2003, 2022)
    xTrain = np.zeros((numGames, numFeatures))
    yTrain = np.zeros((numGames))
    counter = 0
    for index, row in no2022_playoff_stat.iterrows():
        year = row["Season"]
        team1ID = row["WTeamID"]
        team2ID = row["LTeamID"]
        try:
            gameVector = getDifferenceVector(team1ID, team2ID, year)
        except:
            gameVector = np.zeros((numFeatures))
        if(counter % 2 == 0):
            xTrain[counter] = gameVector
            yTrain[counter] = 1
        else:
            xTrain[counter] = np.negative(gameVector)
            yTrain[counter] = 0
        counter += 1
    print("Done Training!")
    return xTrain, yTrain
        

xTrain_, yTrain_ = createTrain()

Done Training!


Looks like I ran into a few issues while creating a couple difference vectors. ~5000 iterations ran into errors. I did some digging and I think it was due to some of the teams that are in the `season_stat` dataframe never got a ranking from most ranking formats and therefore do not exist in the `final_rankings` dataframe. There is nothing to do by this point other than going through and inputting data manually, and we ain't got time for that. So, I will just have to take these rows out of the training set.

In [22]:
errorRows2 = np.all(xTrain_ == 0, axis=1)
xTrain = xTrain_[~errorRows2]
yTrain = yTrain_[~errorRows2]

print(xTrain.shape)
print(yTrain.shape)

(1181, 11)
(1181,)


We now have our training data. `xTrain` contains 91,294 rows of data and 11 feature columns. `yTrain` contains the corresponding win/loss for the difference vector.

# Step 4: Testing Machine Learning Models

Now comes the fun part! Time to see which machine learning models best predict the data!

In [88]:
#Model Selection and Testing

#model = linear_model.LinearRegression()
#model = linear_model.Ridge(alpha=0.5)
model = linear_model.Lasso(alpha=0.1)
#model = LogisticRegression(max_iter=400)
#model = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3))
#model = make_pipeline(StandardScaler(), SGDClassifier(loss="log", max_iter=1000, tol=1e-3))
#model = linear_model.BayesianRidge()
#model = ElasticNet(alpha=30)
#model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
#model = make_pipeline(StandardScaler(), LinearSVC(tol=1e-4, max_iter=5000, penalty="l1", loss="squared_hinge", dual=False, C=10))
#model = make_pipeline(StandardScaler(), LinearSVR(tol=1e-4, max_iter=7000, C=10))
#model = tree.DecisionTreeClassifier()
#model = tree.DecisionTreeRegressor(criterion="poisson")
#model = KNeighborsRegressor(n_neighbors=50, algorithm="kd_tree")
#model = KNeighborsClassifier(n_neighbors=65, algorithm="kd_tree")
#model = NearestCentroid()
#model = GaussianNB()
#model = ComplementNB()
#model = make_pipeline(StandardScaler(), LinearSVC(max_iter=5000, tol=0.0001, C=0.1))
#model = LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=0.1)
#model = GradientBoostingRegressor()

In [89]:
accuracy = []

test_size = 0.30
for i in range(100):
    # Split into test and training sets
    X_train, X_test, Y_train, Y_test =  train_test_split(xTrain, yTrain, test_size=test_size)

    trained_model = model.fit(X_train, Y_train)
    predictions = model.predict(X_test)

    predictions[predictions < .5] = 0
    predictions[predictions >= .5] = 1

    accuracy.append(np.mean(predictions == Y_test))
    print("Finished iteration:", i + 1, " with accuracy: ", np.mean(predictions == Y_test))

print("Accuracy: ", sum(accuracy)/len(accuracy))

Finished iteration: 1  with accuracy:  0.7070422535211267
Finished iteration: 2  with accuracy:  0.6704225352112676
Finished iteration: 3  with accuracy:  0.7183098591549296
Finished iteration: 4  with accuracy:  0.7070422535211267
Finished iteration: 5  with accuracy:  0.6788732394366197
Finished iteration: 6  with accuracy:  0.6957746478873239
Finished iteration: 7  with accuracy:  0.6929577464788732
Finished iteration: 8  with accuracy:  0.7267605633802817
Finished iteration: 9  with accuracy:  0.6957746478873239
Finished iteration: 10  with accuracy:  0.7577464788732394
Finished iteration: 11  with accuracy:  0.6591549295774648
Finished iteration: 12  with accuracy:  0.7211267605633803
Finished iteration: 13  with accuracy:  0.7098591549295775
Finished iteration: 14  with accuracy:  0.752112676056338
Finished iteration: 15  with accuracy:  0.7380281690140845
Finished iteration: 16  with accuracy:  0.7464788732394366
Finished iteration: 17  with accuracy:  0.676056338028169
Finished

### MODEL ACCURACY:

#### LINEAR MODELS:
- Linear Regression: 76.10%
- Ridge Regression (alpha=0.5): 76.08%
- Lasso (alpha=0.1): 76.25%
- Logistic Regression: 76.19%
- Stochastic Gradient Descent (hinge): 75.75%
- Stochastic Gradient Descent (log): 75.90%
- Bayesian Ridge: 76.10%
- Elastic Net (alpha=30): 76.31%

#### TREES:
- Decision Tree Classifier: 66.13%
- Decision Tree Regressor (Squared Error): 66.08%
- Decision Tree Regressor (Poisson): 68.46 (only 1 iteration)

#### NEAREST NEIGHBORS:
- KNeighbors Regressor (n=2): 67.04%
- KNeighbors Regressor (n=50, kd_tree): 75.93%
- KNeighbors Classifier (n=65, kd_tree): 75.96%
- Nearest Centroid: 75.92%

#### NAIVE BAYES:
- Gaussian Naive Bayes: 75.70%

#### SVC and SVR:
- Linear SVC (tol=1e-5): 76.12%
- Linear SVC (tol=0.0001, C=0.1): 76.14%
- SVM SVC: 76.28% (only 1 iteration)
- SVM LinearSVC (penalty=l1, C=10, dual=False): 76.15%
- Linear SVR (): Too long didn't want to wait

#### Gradient Boosting Regressor:
- Gradient Boosting Regressor(): 76.27%

After testing out a variety of models, it seems there is not very much more possible predictive gain from model selection and algorithm tuning. In order to increase accuracy, other methods will most likely need to be applied. This includes going back and adjusting feature selection, feature engineering, atttempting ensemble methods, etc.

# Step 5: Prediction

In [90]:
def predictGame(team1ID, team2ID, year):
    difVector = getDifferenceVector(team1ID, team2ID, year)
    return model.predict([difVector])[0]

#Chances of Gaonzaga beating Iowa in 2019
print(predictGame(1211, 1234, 2019))

0.6941928022128238


In [112]:
#Testing the 2019 championship game
year = 2017
team1 = getTeamID("North Carolina")
team2 = getTeamID("Gonzaga")

print("Probability that", getTeamName(team1), "beats", getTeamName(team2), ": ", predictGame(team1, team2, year))

Probability that North Carolina beats Gonzaga :  0.40767304024012574
