# 2018 NCAA March Madness Men's Basketball Predictions

### By Brice Walker

[View on GitHub](https://github.com/bricewalker/Personal_Ames_Regression)

[View on nbviewer](http://nbviewer.jupyter.org/github/bricewalker/Personal_Ames_Regression/blob/master/HousingRegression.ipynb)

## Outline

- [Introduction](#intro)
- [Importing libraries](#libraries)
- [Importing the dataset](#data)
- [Exploratory analysis and plotting](#analysis)
- [Feature extraction and engineering](#features)
- [Classification analysis](#classification)
    - [Logistic Regression](#log-reg)
    - [KNeighbors](#knn)
    - [Random Forest](#forest)
    - [Extra Trees](#extra)
    - [Support Vector Machine](#svm)
    - [Gradient Boosting](#gradboost)
    - [XGBoost](#xgboost)
    - [LightGBM](#lgbm)
    - [Keras/Tensorflow Neural Network](#nn)
- [Ensembling models](#ensemble-reg)
- [Exporting the final model](#exporting)

<a id='intro'></a>
## Introduction

This is a classification project completed for the 2018 March Madness Kaggle Competition. In this project, I have extracted 18 season-based, and 28 tournament-based team-level characteristics from several datasets. I used datasets provided by kaggle as well as data scraped from sports-reference.com. I then engineered several advanced measures and extracted Elo ratings. I used these characteristics to predict probabilities for each matchup in the 2018 March Madness Schedule. My final model was a soft voting classifier that used KNeighbors, Random Forest, Extra Trees, Logistic Regression, Gradient Boosting, and LightGBM classifiers to predict probabilities for each matchup. I also ran XGBoost and Keras/Tensorflow neural network models.

<a id='libraries'></a>
## Importing libraries

In [1]:
# Common imports
import pandas as pd
import numpy as np
import scipy as sp
import collections
import os
import sys
import math
from math import pi
from math import sqrt
import csv
import urllib
import pickle
import random
import statsmodels.api as sm
from patsy import dmatrices

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

# Math and descriptive stats
from math import sqrt
from scipy import stats
from scipy.stats import norm, skew
from scipy.stats.stats import pearsonr
from scipy.special import boxcox1p, inv_boxcox1p

# Sci-kit Learn modules for machine learning
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, log_loss
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, make_scorer
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier, AdaBoostClassifier, AdaBoostRegressor
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor, VotingClassifier
from sklearn.ensemble import IsolationForest, RandomForestClassifier, RandomForestRegressor, RandomTreesEmbedding
from sklearn.svm import SVR, LinearSVC, SVC, LinearSVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.calibration import CalibratedClassifierCV

# Boosting libraries
import lightgbm as lgb
import xgboost

# Deep Learning modules
from keras.layers import Input, Dense, Dropout, Flatten, Embedding, merge, Activation
from keras.layers import Convolution2D, MaxPooling2D, Convolution1D
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.models import Model, Sequential
from keras.optimizers import SGD
from keras.utils import np_utils, to_categorical

K = 20.
HOME_ADVANTAGE = 100.

  from pandas.core import datetools
Using TensorFlow backend.


<a id='data'></a>
## Importing The Data Sets

In [2]:
reg_season_compact_pd = pd.read_csv('Data/KaggleData/RegularSeasonCompactResults.csv')
seasons_pd = pd.read_csv('Data/KaggleData/Seasons.csv')
teams_pd = pd.read_csv('Data/KaggleData/Teams.csv')
tourney_compact_pd = pd.read_csv('Data/KaggleData/NCAATourneyCompactResults.csv')
tourney_detailed_pd = pd.read_csv('Data/KaggleData/NCAATourneyDetailedResults.csv')
conference_pd = pd.read_csv('Data/KaggleData/Conference.csv')
tourney_results_pd = pd.read_csv('Data/KaggleData/TourneyResults.csv')
sample_sub_pd = pd.read_csv('Data/KaggleData/sample_submission.csv')
tourney_seeds_pd = pd.read_csv('Data/KaggleData/NCAATourneySeeds.csv')
team_conferences_pd = pd.read_csv('Data/KaggleData/TeamConferences.csv')
sample_sub_pd = pd.read_csv('Data/KaggleData/SampleSubmissionStage1.csv')
seeds_pd = pd.read_csv('Data/KaggleData/NCAATourneySeeds.csv')
elos_ratings_pd = pd.read_csv('Data/ElosRatings/season_elos.csv')

<a id='analysis'></a>
## Exploratory analysis and plotting

<a id='features'></a>
## Feature Engineering and Extraction

This project attempts to predict outcomes of games based on the following team level characteristics for season games:
- A modified Elo rating (where new entrants are initialized at a score of 1500, and there is no reversion to the mean between seasons)
- Number of wins
- Avg points per game scored
- Avg points per game allowed
- Avg # of 3 pointers per game
- Avg turnovers per game
- Avg Assists per game
- Avg rebounds per game
- Avg steals per game
- Power 6 Conference
- Reg Season championships
- Strength of team's schedule
- Championship appearances
- Location of the game
- A simple rating system

And the following team level characteristics for tournament performance:<br>
> Note: If a team plays in more than one tourney in a year than these values are averaged over all tourneys they played that year.

- Tournament appearances
- Conference tournament championships
- Points scored for winning/losing team
- A measure of possession
- Offensive efficiency
- Defensive efficiency
- Net Rating (Offensive - Defensive efficiency)
- Assist Ratio
- Turnover Ratio
- Shooting Percentage
- Effective Field Goal Percentage adjusting for the fact that 3pt shots are more valuable
- FTA Rating : How good a team is at drawing fouls.
- Percentage of team offensive rebounds
- Percentage of team defensive rebounds
- Percentage of team total rebounds

In [258]:
# Advanced tourney data
df = pd.read_csv('Data/KaggleData/NCAATourneyDetailedResults.csv')
# Points Winning/Losing Team
df['WPts'] = df.apply(lambda row: 2*row.WFGM + row.WFGM3 + row.WFTM, axis=1)
df['LPts'] = df.apply(lambda row: 2*row.LFGM + row.LFGM3 + row.LFTM, axis=1)
# Calculate Winning/losing Team Possesion Feature
wPos = df.apply(lambda row: 0.96*(row.WFGA + row.WTO + 0.44*row.WFTA - row.WOR), axis=1)
df['WPos'] = df.apply(lambda row: 0.96*(row.WFGA + row.WTO + 0.44*row.WFTA - row.WOR), axis=1)
lPos = df.apply(lambda row: 0.96*(row.LFGA + row.LTO + 0.44*row.LFTA - row.LOR), axis=1)
df['LPos'] = lPos = df.apply(lambda row: 0.96*(row.LFGA + row.LTO + 0.44*row.LFTA - row.LOR), axis=1)
df['Pos'] = (wPos+lPos)/2
# Offensive efficiency (OffRtg) = 100 x (Points / Possessions)
df['WOffRtg'] = df.apply(lambda row: 100 * (row.WPts / row.Pos), axis=1)
df['LOffRtg'] = df.apply(lambda row: 100 * (row.LPts / row.Pos), axis=1)
# Defensive efficiency (DefRtg) = 100 x (Opponent points / Opponent possessions)
df['WDefRtg'] = df.LOffRtg
df['LDefRtg'] = df.WOffRtg
# Net Rating = Off.Rtg - Def.Rtg
df['WNetRtg'] = df.apply(lambda row:(row.WOffRtg - row.WDefRtg), axis=1)
df['LNetRtg'] = df.apply(lambda row:(row.LOffRtg - row.LDefRtg), axis=1)                       
# Assist Ratio : Percentage of team possessions that end in assists
df['WAstR'] = df.apply(lambda row: 100 * row.WAst / (row.WFGA + 0.44*row.WFTA + row.WAst + row.WTO), axis=1)
df['LAstR'] = df.apply(lambda row: 100 * row.LAst / (row.LFGA + 0.44*row.LFTA + row.LAst + row.LTO), axis=1)
# Turnover Ratio: Number of turnovers of a team per 100 possessions used.
# (TO * 100) / (FGA + (FTA * 0.44) + AST + TO
df['WTOR'] = df.apply(lambda row: 100 * row.LAst / (row.LFGA + 0.44*row.LFTA + row.LAst + row.LTO), axis=1)
df['LTOR'] = df.apply(lambda row: 100 * row.LAst / (row.LFGA + 0.44*row.LFTA + row.LAst + row.LTO), axis=1)                  
# The Shooting Percentage : Measure of Shooting Efficiency (FGA/FGA3, FTA)
df['WTSP'] = df.apply(lambda row: 100 * row.WPts / (2 * (row.WFGA + 0.44 * row.WFTA)), axis=1)
df['LTSP'] = df.apply(lambda row: 100 * row.LPts / (2 * (row.LFGA + 0.44 * row.LFTA)), axis=1)
# eFG% : Effective Field Goal Percentage adjusting for the fact that 3pt shots are more valuable 
df['WeFGP'] = df.apply(lambda row:(row.WFGM + 0.5 * row.WFGM3) / row.WFGA, axis=1)      
df['LeFGP'] = df.apply(lambda row:(row.LFGM + 0.5 * row.LFGM3) / row.LFGA, axis=1)   
# FTA Rate : How good a team is at drawing fouls.
df['WFTAR'] = df.apply(lambda row: row.WFTA / row.WFGA, axis=1)
df['LFTAR'] = df.apply(lambda row: row.LFTA / row.LFGA, axis=1)                       
# OREB% : Percentage of team offensive rebounds
df['WORP'] = df.apply(lambda row: row.WOR / (row.WOR + row.LDR), axis=1)
df['LORP'] = df.apply(lambda row: row.LOR / (row.LOR + row.WDR), axis=1)
# DREB% : Percentage of team defensive rebounds
df['WDRP'] = df.apply(lambda row: row.WDR / (row.WDR + row.LOR), axis=1)
df['LDRP'] = df.apply(lambda row: row.LDR / (row.LDR + row.WOR), axis=1)                                      
# REB% : Percentage of team total rebounds
df['WRP'] = df.apply(lambda row: (row.WDR + row.WOR) / (row.WDR + row.WOR + row.LDR + row.LOR), axis=1)
df['LRP'] = df.apply(lambda row: (row.LDR + row.WOR) / (row.WDR + row.WOR + row.LDR + row.LOR), axis=1) 
df['WPIE'] = df.apply(lambda row: (row.WDR + row.WOR) / (row.WDR + row.WOR + row.LDR + row.LOR), axis=1)
wtmp = df.apply(lambda row: row.WPts + row.WFGM + row.WFTM - row.WFGA - row.WFTA + row.WDR + 0.5*row.WOR + row.WAst +row.WStl + 0.5*row.WBlk - row.WPF - row.WTO, axis=1)
ltmp = df.apply(lambda row: row.LPts + row.LFGM + row.LFTM - row.LFGA - row.LFTA + row.LDR + 0.5*row.LOR + row.LAst +row.LStl + 0.5*row.LBlk - row.LPF - row.LTO, axis=1) 
df['WPIE'] = wtmp/(wtmp + ltmp)
df['LPIE'] = ltmp/(wtmp + ltmp)
    
df.to_csv('Data/KaggleData/NCAATourneyDetailedResultsEnriched.csv', index=False)
enriched_pd = pd.read_csv('Data/KaggleData/NCAATourneyDetailedResultsEnriched.csv')

In [259]:
# Creating custom Elo ratings. This takes a long time to run so beware!
team_ids = set(reg_season_compact_pd.WTeamID).union(set(reg_season_compact_pd.LTeamID))

elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))

reg_season_compact_pd['margin'] = reg_season_compact_pd.WScore - reg_season_compact_pd.LScore
reg_season_compact_pd['w_elo'] = None
reg_season_compact_pd['l_elo'] = None

def elo_pred(elo1, elo2):
    return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))

def expected_margin(elo_diff):
    return((7.5 + 0.006 * elo_diff))

def elo_update(w_elo, l_elo, margin):
    elo_diff = w_elo - l_elo
    pred = elo_pred(w_elo, l_elo)
    mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)
    update = K * mult * (1 - pred)
    return(pred, update)

assert np.all(reg_season_compact_pd.index.values == np.array(range(reg_season_compact_pd.shape[0]))), "Index is out of order."

preds = []

# Loop over all rows
for i in range(reg_season_compact_pd.shape[0]):
    
    # Get key data from each row
    w = reg_season_compact_pd.at[i, 'WTeamID']
    l = reg_season_compact_pd.at[i, 'LTeamID']
    margin = reg_season_compact_pd.at[i, 'margin']
    wloc = reg_season_compact_pd.at[i, 'WLoc']
    
    # Home court advantage?
    w_ad, l_ad, = 0., 0.
    if wloc == "H":
        w_ad += HOME_ADVANTAGE
    elif wloc == "A":
        l_ad += HOME_ADVANTAGE
    
    # Get elo updates as a result of each game
    pred, update = elo_update(elo_dict[w] + w_ad,
                              elo_dict[l] + l_ad, 
                              margin)
    elo_dict[w] += update
    elo_dict[l] -= update
    preds.append(pred)

    # Store elos in new dataframe
    reg_season_compact_pd.loc[i, 'w_elo'] = elo_dict[w]
    reg_season_compact_pd.loc[i, 'l_elo'] = elo_dict[l]

In [260]:
reg_season_compact_pd.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,w_elo,l_elo
0,1985,20,1228,81,1328,64,N,0,17,1514.65,1485.35
1,1985,25,1106,77,1354,70,H,0,7,1505.61,1494.39
2,1985,25,1112,63,1223,56,H,0,7,1505.61,1494.39
3,1985,25,1165,70,1432,54,H,0,16,1509.37,1490.63
4,1985,25,1192,86,1447,74,H,0,12,1507.76,1492.24


In [261]:
def seed_to_int(seed):
# Convert seeds to integers
    s_int = int(seed[1:3])
    return s_int
seeds_pd['seed_int'] = seeds_pd.Seed.apply(seed_to_int)
seeds_pd.drop(labels=['Seed'], inplace=True, axis=1) # This is the string label
seeds_pd.head()

Unnamed: 0,Season,TeamID,seed_int
0,1985,1207,1
1,1985,1210,2
2,1985,1228,3
3,1985,1260,4
4,1985,1374,5


In [262]:
teamList = teams_pd['TeamName'].tolist()
NCAAChampionsList = tourney_results_pd['NCAA Champion'].tolist()

In [263]:
def Power6Conf(team_id):
    team_pd = team_conferences_pd[(team_conferences_pd['Season'] == 2018) & (team_conferences_pd['TeamID'] == team_id)]
    if (len(team_pd) == 0):
        return 0
    confName = team_pd.iloc[0]['ConfAbbrev']
    return int(confName == 'sec' or confName == 'acc'or confName == 'big_ten' or confName == 'big_twelve' or confName == 'big_east' or confName == 'pac_twelve')

In [264]:
def createTeamName(team_id):
    return teams_pd[teams_pd['TeamID'] == team_id].values[0][1]

def findNumChampionships(team_id):
    name = createTeamName(team_id)
    return NCAAChampionsList.count(name)

In [265]:
def handleCases(arr):
    indices = []
    listLen = len(arr)
    for i in range(listLen):
        if (arr[i] == 'St' or arr[i] == 'FL'):
            indices.append(i)
    for p in indices:
        arr[p-1] = arr[p-1] + ' ' + arr[p]
    for i in range(len(indices)): 
        arr.remove(arr[indices[i] - i])
    return arr

In [266]:
def checkConferenceChamp(team_id, year):
    year_conf_pd = conference_pd[conference_pd['Year'] == year]
    champs = year_conf_pd['Regular Season Champ'].tolist()
# In case of a tie
    champs_separated = [words for segments in champs for words in segments.split()]
    name = createTeamName(team_id)
    champs_separated = handleCases(champs_separated)
    if (name in champs_separated):
        return 1
    else:
        return 0

In [267]:
def checkConferenceTourneyChamp(team_id, year):
    year_conf_pd = conference_pd[conference_pd['Year'] == year]
    champs = year_conf_pd['Tournament Champ'].tolist()
    name = createTeamName(team_id)
    if (name in champs):
        return 1
    else:
        return 0

In [268]:
def getTourneyAppearances(team_id):
    return len(tourney_seeds_pd[tourney_seeds_pd['TeamID'] == team_id].index)

In [269]:
# Fixing names in csv's with differing formats
def handleDifferentCSV(df):
    df['School'] = df['School'].replace('(State)', 'St', regex=True) 
    df['School'] = df['School'].replace('Albany (NY)', 'Albany NY') 
    df['School'] = df['School'].replace('Boston University', 'Boston Univ')
    df['School'] = df['School'].replace('Central Michigan', 'C Michigan')
    df['School'] = df['School'].replace('(Eastern)', 'E', regex=True)
    df['School'] = df['School'].replace('Louisiana St', 'LSU')
    df['School'] = df['School'].replace('North Carolina St', 'NC State')
    df['School'] = df['School'].replace('Southern California', 'USC')
    df['School'] = df['School'].replace('University of California', 'California', regex=True) 
    df['School'] = df['School'].replace('American', 'American Univ')
    df['School'] = df['School'].replace('Arkansas-Little Rock', 'Ark Little Rock')
    df['School'] = df['School'].replace('Arkansas-Pine Bluff', 'Ark Pine Bluff')
    df['School'] = df['School'].replace('Bowling Green St', 'Bowling Green')
    df['School'] = df['School'].replace('Brigham Young', 'BYU')
    df['School'] = df['School'].replace('Cal Poly', 'Cal Poly SLO')
    df['School'] = df['School'].replace('Centenary (LA)', 'Centenary')
    df['School'] = df['School'].replace('Central Connecticut St', 'Central Conn')
    df['School'] = df['School'].replace('Charleston Southern', 'Charleston So')
    df['School'] = df['School'].replace('Coastal Carolina', 'Coastal Car')
    df['School'] = df['School'].replace('College of Charleston', 'Col Charleston')
    df['School'] = df['School'].replace('Cal St Fullerton', 'CS Fullerton')
    df['School'] = df['School'].replace('Cal St Sacramento', 'CS Sacramento')
    df['School'] = df['School'].replace('Cal St Bakersfield', 'CS Bakersfield')
    df['School'] = df['School'].replace('Cal St Northridge', 'CS Northridge')
    df['School'] = df['School'].replace('East Tennessee St', 'ETSU')
    df['School'] = df['School'].replace('Detroit Mercy', 'Detroit')
    df['School'] = df['School'].replace('Fairleigh Dickinson', 'F Dickinson')
    df['School'] = df['School'].replace('Florida Atlantic', 'FL Atlantic')
    df['School'] = df['School'].replace('Florida Gulf Coast', 'FL Gulf Coast')
    df['School'] = df['School'].replace('Florida International', 'Florida Intl')
    df['School'] = df['School'].replace('George Washington', 'G Washington')
    df['School'] = df['School'].replace('Georgia Southern', 'Ga Southern')
    df['School'] = df['School'].replace('Gardner-Webb', 'Gardner Webb')
    df['School'] = df['School'].replace('Illinois-Chicago', 'IL Chicago')
    df['School'] = df['School'].replace('Kent St', 'Kent')
    df['School'] = df['School'].replace('Long Island University', 'Long Island')
    df['School'] = df['School'].replace('Loyola Marymount', 'Loy Marymount')
    df['School'] = df['School'].replace('Loyola (MD)', 'Loyola MD')
    df['School'] = df['School'].replace('Loyola (IL)', 'Loyola-Chicago')
    df['School'] = df['School'].replace('Massachusetts', 'MA Lowell')
    df['School'] = df['School'].replace('Maryland-Eastern Shore', 'MD E Shore')
    df['School'] = df['School'].replace('Miami (FL)', 'Miami FL')
    df['School'] = df['School'].replace('Miami (OH)', 'Miami OH')
    df['School'] = df['School'].replace('Missouri-Kansas City', 'Missouri KC')
    df['School'] = df['School'].replace('Monmouth', 'Monmouth NJ')
    df['School'] = df['School'].replace('Mississippi Valley St', 'MS Valley St')
    df['School'] = df['School'].replace('Montana St', 'MTSU')
    df['School'] = df['School'].replace('Northern Colorado', 'N Colorado')
    df['School'] = df['School'].replace('North Dakota St', 'N Dakota St')
    df['School'] = df['School'].replace('Northern Illinois', 'N Illinois')
    df['School'] = df['School'].replace('Northern Kentucky', 'N Kentucky')
    df['School'] = df['School'].replace('North Carolina A&T', 'NC A&T')
    df['School'] = df['School'].replace('North Carolina Central', 'NC Central')
    df['School'] = df['School'].replace('Pennsylvania', 'Penn')
    df['School'] = df['School'].replace('South Carolina St', 'S Carolina St')
    df['School'] = df['School'].replace('Southern Illinois', 'S Illinois')
    df['School'] = df['School'].replace('UC-Santa Barbara', 'Santa Barbara')
    df['School'] = df['School'].replace('Southeastern Louisiana', 'SE Louisiana')
    df['School'] = df['School'].replace('Southeast Missouri St', 'SE Missouri St')
    df['School'] = df['School'].replace('Stephen F. Austin', 'SF Austin')
    df['School'] = df['School'].replace('Southern Methodist', 'SMU')
    df['School'] = df['School'].replace('Southern Mississippi', 'Southern Miss')
    df['School'] = df['School'].replace('Southern', 'Southern Univ')
    df['School'] = df['School'].replace('St. Bonaventure', 'St Bonaventure')
    df['School'] = df['School'].replace('St. Francis (NY)', 'St Francis NY')
    df['School'] = df['School'].replace('Saint Francis (PA)', 'St Francis PA')
    df['School'] = df['School'].replace('St. John\'s (NY)', 'St John\'s')
    df['School'] = df['School'].replace('Saint Joseph\'s', 'St Joseph\'s PA')
    df['School'] = df['School'].replace('Saint Louis', 'St Louis')
    df['School'] = df['School'].replace('Saint Mary\'s (CA)', 'St Mary\'s CA')
    df['School'] = df['School'].replace('Mount Saint Mary\'s', 'Mt St Mary\'s')
    df['School'] = df['School'].replace('Saint Peter\'s', 'St Peter\'s')
    df['School'] = df['School'].replace('Texas A&M-Corpus Christian', 'TAM C. Christian')
    df['School'] = df['School'].replace('Texas Christian', 'TCU')
    df['School'] = df['School'].replace('Tennessee-Martin', 'TN Martin')
    df['School'] = df['School'].replace('Texas-Rio Grande Valley', 'UTRGV')
    df['School'] = df['School'].replace('Texas Southern', 'TX Southern')
    df['School'] = df['School'].replace('Alabama-Birmingham', 'UAB')
    df['School'] = df['School'].replace('UC-Davis', 'UC Davis')
    df['School'] = df['School'].replace('UC-Irvine', 'UC Irvine')
    df['School'] = df['School'].replace('UC-Riverside', 'UC Riverside')
    df['School'] = df['School'].replace('Central Florida', 'UCF')
    df['School'] = df['School'].replace('Louisiana-Lafayette', 'ULL')
    df['School'] = df['School'].replace('Louisiana-Monroe', 'ULM')
    df['School'] = df['School'].replace('Maryland-Baltimore County', 'UMBC')
    df['School'] = df['School'].replace('North Carolina-Asheville', 'UNC Asheville')
    df['School'] = df['School'].replace('North Carolina-Greensboro', 'UNC Greensboro')
    df['School'] = df['School'].replace('North Carolina-Wilmington', 'UNC Wilmington')
    df['School'] = df['School'].replace('Nevada-Las Vegas', 'UNLV')
    df['School'] = df['School'].replace('Texas-Arlington', 'UT Arlington')
    df['School'] = df['School'].replace('Texas-San Antonio', 'UT San Antonio')
    df['School'] = df['School'].replace('Texas-El Paso', 'UTEP')
    df['School'] = df['School'].replace('Virginia Commonwealth', 'VA Commonwealth')
    df['School'] = df['School'].replace('Western Carolina', 'W Carolina')
    df['School'] = df['School'].replace('Western Illinois', 'W Illinois')
    df['School'] = df['School'].replace('Western Kentucky', 'WKU')
    df['School'] = df['School'].replace('Western Michigan', 'W Michigan')
    df['School'] = df['School'].replace('Abilene Christian', 'Abilene Chr')
    df['School'] = df['School'].replace('Montana State', 'Montana St')
    df['School'] = df['School'].replace('Central Arkansas', 'Cent Arkansas')
    df['School'] = df['School'].replace('Houston Baptist', 'Houston Bap')
    df['School'] = df['School'].replace('South Dakota St', 'S Dakota St')
    df['School'] = df['School'].replace('Maryland-Eastern Shore', 'MD E Shore')
    return df

In [270]:
def createHomeStat(row):
    if (row == 'H'):
        home = 1
    if (row == 'A'):
        home = -1
    if (row == 'N'):
        home = 0
    return home

In [271]:
def normalizeInput(arr):
    for i in range(arr.shape[1]):
        minVal = min(arr[:,i])
        maxVal = max(arr[:,i])
        arr[:,i] =  (arr[:,i] - minVal) / (maxVal - minVal)
    return arr

In [272]:
def normalizeInput2(X):
    return (X - np.mean(X, axis = 0)) / np.std(X, axis = 0)

In [486]:
def getSeasonTourneyData(team_id, year):
    year_data_pd = reg_season_compact_pd[reg_season_compact_pd['Season'] == year]
# Elo   
    year_pd = year_data_pd.copy()
    year_pd = year_pd.loc[(year_pd.WTeamID == team_id) | (year_pd.LTeamID == team_id), :]
    year_pd.sort_values(['Season', 'DayNum'], inplace=True)
    year_pd.drop_duplicates(['Season'], keep='last', inplace=True)
    w_mask = year_pd.WTeamID == team_id
    l_mask = year_pd.LTeamID == team_id
    year_pd['season_elo'] = None
    year_pd.loc[w_mask, 'season_elo'] = year_pd.loc[w_mask, 'w_elo']
    year_pd.loc[l_mask, 'season_elo'] = year_pd.loc[l_mask, 'l_elo']
    elo = year_pd.season_elo
    elo = elo.values.mean()
# Points per game
    gamesWon = year_data_pd[year_data_pd.WTeamID == team_id] 
    totalPointsScored = gamesWon['WScore'].sum()
    gamesLost = year_data_pd[year_data_pd.LTeamID == team_id] 
    totalGames = gamesWon.append(gamesLost)
    numGames = len(totalGames.index)
    totalPointsScored += gamesLost['LScore'].sum()
# Number of points allowed
    totalPointsAllowed = gamesWon['LScore'].sum()
    totalPointsAllowed += gamesLost['WScore'].sum()
# Scraped data    
    stats_SOS_pd = pd.read_csv('Data/RegSeasonStats/MMStats_'+str(year)+'.csv')
    stats_SOS_pd = handleDifferentCSV(stats_SOS_pd)
    ratings_pd = pd.read_csv('Data/RatingStats/RatingStats_'+str(year)+'.csv')
    ratings_pd = handleDifferentCSV(ratings_pd)
    
    name = createTeamName(team_id)
    team = stats_SOS_pd[stats_SOS_pd['School'] == name]
    team_rating = ratings_pd[ratings_pd['School'] == name]
    if (len(team.index) == 0 or len(team_rating.index) == 0):
        total3sMade = 0
        totalTurnovers = 0
        totalAssists = 0
        sos = 0
        totalRebounds = 0
        srs = 0
        totalSteals = 0
    else:
        total3sMade = team['X3P'].values[0]
        totalTurnovers = team['TOV'].values[0]
        if (math.isnan(totalTurnovers)):
            totalTurnovers = 0
        totalAssists = team['AST'].values[0]
        if (math.isnan(totalAssists)):
            totalAssists = 0
        sos = team['SOS'].values[0]
        srs = team['SRS'].values[0]
        totalRebounds = team['TRB'].values[0]
        if (math.isnan(totalRebounds)):
            totalRebounds = 0
        totalSteals = team['STL'].values[0]
        if (math.isnan(totalSteals)):
            totalSteals = 0
    
# Finding tourney seed
    tourneyYear = tourney_seeds_pd[tourney_seeds_pd['Season'] == year]
    seed = tourneyYear[tourneyYear['TeamID'] == team_id]
    if (len(seed.index) != 0):
        seed = seed.values[0][1]
        tournamentSeed = int(seed[1:3])
    else:
        tournamentSeed = 25

# Number of wins and losses
    numWins = len(gamesWon.index)
# Preventing division by 0
    if numGames == 0:
        avgPointsScored = 0
        avgPointsAllowed = 0
        avg3sMade = 0
        avgTurnovers = 0
        avgAssists = 0
        avgRebounds = 0
        avgSteals = 0
    else:
        avgPointsScored = totalPointsScored/numGames
        avgPointsAllowed = totalPointsAllowed/numGames
        avg3sMade = total3sMade/numGames
        avgTurnovers = totalTurnovers/numGames
        avgAssists = totalAssists/numGames
        avgRebounds = totalRebounds/numGames
        avgSteals = totalSteals/numGames
        
# Tourney data   
    enriched_df = enriched_pd[enriched_pd['Season'] == year]
    enriched_df = enriched_df.loc[(enriched_df.WTeamID == team_id) | (enriched_df.LTeamID == team_id), :]
    w_mask = enriched_df.WTeamID == team_id
    l_mask = enriched_df.LTeamID == team_id
    enriched_df['Score'] = 0
    enriched_df['FGM'] = 0
    enriched_df['FGA'] = 0
    enriched_df['FGM3'] = 0
    enriched_df['FGA3'] = 0
    enriched_df['FTM'] = 0
    enriched_df['FTA'] = 0
    enriched_df['OR'] = 0
    enriched_df['DR'] = 0
    enriched_df['Ast'] = 0
    enriched_df['TO'] = 0
    enriched_df['Stl'] = 0
    enriched_df['Blk'] = 0
    enriched_df['PF'] = 0
    enriched_df['Pts'] = 0
    enriched_df['Pos'] = 0
    enriched_df['OffRtg'] = 0
    enriched_df['DefRtg'] = 0
    enriched_df['NetRtg'] = 0
    enriched_df['AstR'] = 0
    enriched_df['TOR'] = 0
    enriched_df['TSP'] = 0
    enriched_df['eFGP'] = 0
    enriched_df['FTAR'] = 0
    enriched_df['ORP'] = 0
    enriched_df['DRP'] = 0
    enriched_df['RP'] = 0
    enriched_df['PIE'] = 0
    enriched_df.loc[w_mask, 'Score'] = enriched_df.loc[w_mask, 'WScore']
    enriched_df.loc[l_mask, 'Score'] = enriched_df.loc[l_mask, 'LScore']
    Score = enriched_df.Score.values.mean()
    enriched_df.loc[w_mask, 'FGM'] = enriched_df.loc[w_mask, 'WFGM']
    enriched_df.loc[l_mask, 'FGM'] = enriched_df.loc[l_mask, 'LFGM']
    FGM = enriched_df.FGM.values.mean()
    enriched_df.loc[w_mask, 'FGA'] = enriched_df.loc[w_mask, 'WFGA']
    enriched_df.loc[l_mask, 'FGA'] = enriched_df.loc[l_mask, 'LFGA']
    FGA = enriched_df.FGA.values.mean()
    enriched_df.loc[w_mask, 'FGM3'] = enriched_df.loc[w_mask, 'WFGM3']
    enriched_df.loc[l_mask, 'FGM3'] = enriched_df.loc[l_mask, 'LFGM3']
    FGM3 = enriched_df.FGM3.values.mean()
    enriched_df.loc[w_mask, 'FGA3'] = enriched_df.loc[w_mask, 'WFGA3']
    enriched_df.loc[l_mask, 'FGA3'] = enriched_df.loc[l_mask, 'LFGA3']
    FGA3 = enriched_df.FGA3.values.mean()
    enriched_df.loc[w_mask, 'FTM'] = enriched_df.loc[w_mask, 'WFTM']
    enriched_df.loc[l_mask, 'FTM'] = enriched_df.loc[l_mask, 'LFTM']
    FTM = enriched_df.FTM.values.mean()
    enriched_df.loc[w_mask, 'FTA'] = enriched_df.loc[w_mask, 'WFTA']
    enriched_df.loc[l_mask, 'FTA'] = enriched_df.loc[l_mask, 'LFTA']
    FTA = enriched_df.FTA.values.mean()
    enriched_df.loc[w_mask, 'OR'] = enriched_df.loc[w_mask, 'WOR']
    enriched_df.loc[l_mask, 'OR'] = enriched_df.loc[l_mask, 'LOR']
    OR = enriched_df.OR.values.mean()
    enriched_df.loc[w_mask, 'DR'] = enriched_df.loc[w_mask, 'WDR']
    enriched_df.loc[l_mask, 'DR'] = enriched_df.loc[l_mask, 'LDR']
    DR = enriched_df.DR.values.mean()
    enriched_df.loc[w_mask, 'Ast'] = enriched_df.loc[w_mask, 'WAst']
    enriched_df.loc[l_mask, 'Ast'] = enriched_df.loc[l_mask, 'LAst']
    Ast = enriched_df.Ast.values.mean()
    enriched_df.loc[w_mask, 'TO'] = enriched_df.loc[w_mask, 'WTO']
    enriched_df.loc[l_mask, 'TO'] = enriched_df.loc[l_mask, 'LTO']
    TO = enriched_df.TO.values.mean()
    enriched_df.loc[w_mask, 'Stl'] = enriched_df.loc[w_mask, 'WStl']
    enriched_df.loc[l_mask, 'Stl'] = enriched_df.loc[l_mask, 'LStl']
    Stl = enriched_df.Stl.values.mean()
    enriched_df.loc[w_mask, 'Blk'] = enriched_df.loc[w_mask, 'WBlk']
    enriched_df.loc[l_mask, 'Blk'] = enriched_df.loc[l_mask, 'LBlk']
    Blk = enriched_df.Blk.values.mean()
    enriched_df.loc[w_mask, 'PF'] = enriched_df.loc[w_mask, 'WPF']
    enriched_df.loc[l_mask, 'PF'] = enriched_df.loc[l_mask, 'LPF']
    PF = enriched_df.PF.values.mean()
    enriched_df.loc[w_mask, 'Pts'] = enriched_df.loc[w_mask, 'WPts']
    enriched_df.loc[l_mask, 'Pts'] = enriched_df.loc[l_mask, 'LPts']
    Pts = enriched_df.Pts.values.mean()
    enriched_df.loc[w_mask, 'Pos'] = enriched_df.loc[w_mask, 'WPos']
    enriched_df.loc[l_mask, 'Pos'] = enriched_df.loc[l_mask, 'LPos']
    Pos = enriched_df.Pos.values.mean()
    enriched_df.loc[w_mask, 'OffRtg'] = enriched_df.loc[w_mask, 'WOffRtg']
    enriched_df.loc[l_mask, 'OffRtg'] = enriched_df.loc[l_mask, 'LOffRtg']
    OffRtg = enriched_df.OffRtg.values.mean()
    enriched_df.loc[w_mask, 'DefRtg'] = enriched_df.loc[w_mask, 'WDefRtg']
    enriched_df.loc[l_mask, 'DefRtg'] = enriched_df.loc[l_mask, 'LDefRtg']
    DefRtg = enriched_df.DefRtg.values.mean()
    enriched_df.loc[w_mask, 'NetRtg'] = enriched_df.loc[w_mask, 'WNetRtg']
    enriched_df.loc[l_mask, 'NetRtg'] = enriched_df.loc[l_mask, 'LNetRtg']
    NetRtg = enriched_df.NetRtg.values.mean()
    enriched_df.loc[w_mask, 'AstR'] = enriched_df.loc[w_mask, 'WAstR']
    enriched_df.loc[l_mask, 'AstR'] = enriched_df.loc[l_mask, 'LAstR']
    AstR = enriched_df.AstR.values.mean()
    enriched_df.loc[w_mask, 'TOR'] = enriched_df.loc[w_mask, 'WTOR']
    enriched_df.loc[l_mask, 'TOR'] = enriched_df.loc[l_mask, 'LTOR']
    TOR = enriched_df.TOR.values.mean()
    enriched_df.loc[w_mask, 'TSP'] = enriched_df.loc[w_mask, 'WTSP']
    enriched_df.loc[l_mask, 'TSP'] = enriched_df.loc[l_mask, 'LTSP']
    TSP = enriched_df.TSP.values.mean()
    enriched_df.loc[w_mask, 'eFGP'] = enriched_df.loc[w_mask, 'WeFGP']
    enriched_df.loc[l_mask, 'eFGP'] = enriched_df.loc[l_mask, 'LeFGP']
    eFGP = enriched_df.eFGP.values.mean()
    enriched_df.loc[w_mask, 'FTAR'] = enriched_df.loc[w_mask, 'WFTAR']
    enriched_df.loc[l_mask, 'FTAR'] = enriched_df.loc[l_mask, 'LFTAR']
    FTAR = enriched_df.FTAR.values.mean()
    enriched_df.loc[w_mask, 'ORP'] = enriched_df.loc[w_mask, 'WORP']
    enriched_df.loc[l_mask, 'ORP'] = enriched_df.loc[l_mask, 'LORP']
    ORP = enriched_df.ORP.values.mean()
    enriched_df.loc[w_mask, 'DRP'] = enriched_df.loc[w_mask, 'WDRP']
    enriched_df.loc[l_mask, 'DRP'] = enriched_df.loc[l_mask, 'LDRP']
    DRP = enriched_df.DRP.values.mean()
    enriched_df.loc[w_mask, 'RP'] = enriched_df.loc[w_mask, 'WRP']
    enriched_df.loc[l_mask, 'RP'] = enriched_df.loc[l_mask, 'LRP']
    RP = enriched_df.RP.values.mean()
    enriched_df.loc[w_mask, 'PIE'] = enriched_df.loc[w_mask, 'WPIE']
    enriched_df.loc[l_mask, 'PIE'] = enriched_df.loc[l_mask, 'LPIE']
    PIE = enriched_df.PIE.values.mean()
    
    return [numWins, avgPointsScored, avgPointsAllowed, Power6Conf(team_id), avg3sMade, avgAssists, avgTurnovers,
            checkConferenceChamp(team_id, year), checkConferenceTourneyChamp(team_id, year), tournamentSeed,
            sos, srs, avgRebounds, avgSteals, getTourneyAppearances(team_id), findNumChampionships(team_id), elo,
            FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk, PF, Pts, Pos, OffRtg, DefRtg, NetRtg, Score,
            AstR, TOR, TSP, eFGP, FTAR, ORP, DRP, RP, PIE]

In [487]:
def compareTwoTeams(id_1, id_2, year):
    team_1 = getSeasonTourneyData(id_1, year)
    team_2 = getSeasonTourneyData(id_2, year)
    diff = [a - b for a, b in zip(team_1, team_2)]
    return diff

In [488]:
def createSeasonDict(year):
    seasonDictionary = collections.defaultdict(list)
    for team in teamList:
        team_id = teams_pd[teams_pd['TeamName'] == team].values[0][0]
        team_vector = getSeasonTourneyData(team_id, year)
        seasonDictionary[team_id] = team_vector
    return seasonDictionary

In [489]:
def createTrainingSet(years, stage1Years):
    totalNumGames = 0
    for year in years:
        season = reg_season_compact_pd[reg_season_compact_pd['Season'] == year]
        totalNumGames += len(season.index)
        tourney = tourney_compact_pd[tourney_compact_pd['Season'] == year]
        totalNumGames += len(tourney.index)
    numFeatures = len(getSeasonTourneyData(1181,2012))
    X_train = np.zeros(( totalNumGames, numFeatures + 1))
    y_train = np.zeros(( totalNumGames ))
    indexCounter = 0
    for year in years:
        team_vectors = createSeasonDict(year)
        season = reg_season_compact_pd[reg_season_compact_pd['Season'] == year]
        numGamesInSeason = len(season.index)
        tourney = tourney_compact_pd[tourney_compact_pd['Season'] == year]
        numGamesInSeason += len(tourney.index)
        xTrainSeason = np.zeros(( numGamesInSeason, numFeatures + 1))
        yTrainSeason = np.zeros(( numGamesInSeason ))
        counter = 0
        for index, row in season.iterrows():
            w_team = row['WTeamID']
            w_vector = team_vectors[w_team]
            l_team = row['LTeamID']
            l_vector = team_vectors[l_team]
            diff = [a - b for a, b in zip(w_vector, l_vector)]
            home = createHomeStat(row['WLoc'])
            if (counter % 2 == 0):
                diff.append(home) 
                xTrainSeason[counter] = diff
                yTrainSeason[counter] = 1
            else:
                diff.append(-home)
                xTrainSeason[counter] = [ -p for p in diff]
                yTrainSeason[counter] = 0
            counter += 1
        for index, row in tourney.iterrows():
            w_team = row['WTeamID']
            w_vector = team_vectors[w_team]
            l_team = row['LTeamID']
            l_vector = team_vectors[l_team]
            diff = [a - b for a, b in zip(w_vector, l_vector)]
            home = 0
            if (counter % 2 == 0):
                diff.append(home) 
                xTrainSeason[counter] = diff
                yTrainSeason[counter] = 1
            else:
                diff.append(-home)
                xTrainSeason[counter] = [ -p for p in diff]
                yTrainSeason[counter] = 0
            counter += 1
        X_train[indexCounter:numGamesInSeason+indexCounter] = xTrainSeason
        y_train[indexCounter:numGamesInSeason+indexCounter] = yTrainSeason
        indexCounter += numGamesInSeason
        print ('Finished year:', year)
        if (year in stage1Years):
            np.save('Data/PrecomputedMatrices/TeamVectors/' + str(year) + 'TeamVectors', team_vectors)
    return X_train, y_train

In [490]:
def createAndSave(years, stage1Years):
    X_train, y_train = createTrainingSet(years, stage1Years)
    np.save('Data/PrecomputedMatrices/X_train', X_train)
    np.save('Data/PrecomputedMatrices/y_train', y_train)     

In [491]:
years = range(1994,2018)
# Saves the team vectors for the following years
stage1Years = range(2014,2018)
if os.path.exists("Data/PrecomputedMatrices/X_train.npy") and os.path.exists("Data/PrecomputedMatrices/y_train.npy"):
    print ('There are already precomputed X_train, and y_train matricies.')
#    os.remove("Data/PrecomputedMatrices/X_train.npy")
#    os.remove("Data/PrecomputedMatrices/y_train.npy")
#    createAndSave(years, stage1Years)
else:
    createAndSave(years, stage1Years)

  
  ret = ret.dtype.type(ret / rcount)
  ret = ret.dtype.type(ret / rcount)


Finished year: 1994
Finished year: 1995
Finished year: 1996
Finished year: 1997
Finished year: 1998
Finished year: 1999
Finished year: 2000
Finished year: 2001
Finished year: 2002
Finished year: 2003
Finished year: 2004
Finished year: 2005
Finished year: 2006
Finished year: 2007
Finished year: 2008
Finished year: 2009
Finished year: 2010
Finished year: 2011
Finished year: 2012
Finished year: 2013
Finished year: 2014
Finished year: 2015
Finished year: 2016
Finished year: 2017


In [3]:
Xtrain = np.load("Data/PrecomputedMatrices/X_train.npy")
ytrain = np.load("Data/PrecomputedMatrices/y_train.npy")
print ("Shape of X_train:", Xtrain.shape)
print ("Shape of y_train:", ytrain.shape)

Shape of X_train: (116530, 46)
Shape of y_train: (116530,)


In [4]:
Xtrain = np.nan_to_num(Xtrain)
ytrain = np.nan_to_num(ytrain)

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(Xtrain, ytrain)

<a id='classification'></a>
## Classification Analysis

<a id='log-reg'></a>
### Logistic Regression

In [6]:
log = LogisticRegression(random_state=95)
log.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=95, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

<a id='knn'></a>
### K Nearest Neighbors

In [7]:
knn = KNeighborsClassifier()
knn_params = {
              'n_neighbors': [60],
              'weights': ['uniform'],
              'p': [1],
              'algorithm': ['kd_tree'],
              'leaf_size': [18],
             }

knn_grid = GridSearchCV(knn, param_grid = knn_params, cv=5, verbose=1, n_jobs=-1)
knn_grid.fit(X_train, Y_train)
print(knn_grid.best_params_)
print(knn_grid.best_score_)
print(knn_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.6min remaining:  3.8min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.6min finished


{'algorithm': 'kd_tree', 'leaf_size': 18, 'n_neighbors': 60, 'p': 1, 'weights': 'uniform'}
0.751181390666
KNeighborsClassifier(algorithm='kd_tree', leaf_size=18, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=60, p=1,
           weights='uniform')


<a id='forest'></a>
### Random Forest

In [8]:
rfc = RandomForestClassifier()
rfc_params = {
              'n_estimators': [1000],
              'max_features': ['log2'],
              'min_samples_split': [500], 
              'max_depth': [40],
              'min_samples_leaf': [4],
              'min_weight_fraction_leaf': [0], 
              'max_leaf_nodes': [200],
              'min_impurity_decrease': [0]
             }

rfr_grid = GridSearchCV(rfc, param_grid = rfc_params, cv=5, verbose=1, n_jobs=-1)
rfr_grid.fit(X_train, Y_train)
print(rfr_grid.best_params_)
print(rfr_grid.best_score_)
print(rfr_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.9min remaining:  4.4min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.9min finished


{'max_depth': 40, 'max_features': 'log2', 'max_leaf_nodes': 200, 'min_impurity_decrease': 0, 'min_samples_leaf': 4, 'min_samples_split': 500, 'min_weight_fraction_leaf': 0, 'n_estimators': 1000}
0.759453985835
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=40, max_features='log2', max_leaf_nodes=200,
            min_impurity_decrease=0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=500,
            min_weight_fraction_leaf=0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


<a id='extra'></a>
### Extra Trees

In [9]:
etrees = ExtraTreesClassifier()
etrees_params = {
              'n_estimators': [1000],
              'max_features': ['log2'],
              'min_samples_split': [500], 
              'max_depth': [40],
              'min_samples_leaf': [4],
              'min_weight_fraction_leaf': [0], 
              'max_leaf_nodes': [200],
              'min_impurity_decrease': [0]
             }

etrees_grid = GridSearchCV(etrees, param_grid = etrees_params, cv=5, verbose=1, n_jobs=-1)
etrees_grid.fit(X_train, Y_train)
print(etrees_grid.best_params_)
print(etrees_grid.best_score_)
print(etrees_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  3.5min remaining:  5.3min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.6min finished


{'max_depth': 40, 'max_features': 'log2', 'max_leaf_nodes': 200, 'min_impurity_decrease': 0, 'min_samples_leaf': 4, 'min_samples_split': 500, 'min_weight_fraction_leaf': 0, 'n_estimators': 1000}
0.759259471149
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=40, max_features='log2', max_leaf_nodes=200,
           min_impurity_decrease=0, min_impurity_split=None,
           min_samples_leaf=4, min_samples_split=500,
           min_weight_fraction_leaf=0, n_estimators=1000, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


<a id='svm'></a>
### Support Vector Machines

In [10]:
lsvc = LinearSVC()
lsvc_params = {
             'penalty': ['l2'], 
             'loss': ['hinge'],
             'C': [0.1],
             'multi_class': ['ovr'],
             'intercept_scaling': [0.5]
             }

lsvc_grid = GridSearchCV(lsvc, param_grid = lsvc_params, cv=5, verbose=1, n_jobs=-1)
lsvc_grid.fit(X_train, Y_train)
print(lsvc_grid.best_params_)
print(lsvc_grid.best_score_)
print(lsvc_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   28.5s remaining:   42.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   29.0s finished


{'C': 0.1, 'intercept_scaling': 0.5, 'loss': 'hinge', 'multi_class': 'ovr', 'penalty': 'l2'}
0.617366728835
LinearSVC(C=0.1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=0.5, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)


<a id='gradboost'></a>
### Gradient Boosting

In [11]:
gbc = GradientBoostingClassifier()
gbc_params = {
              'max_features': [None],
              'loss': ['deviance'],
              'n_estimators': [150],
              'max_depth': [3],
              'min_samples_leaf': [220],
              'min_samples_split': [2],
              'learning_rate': [0.1],
              'criterion': ['friedman_mse'],
              'min_weight_fraction_leaf': [0],
              'subsample': [1],
              'max_leaf_nodes': [16],
              'min_impurity_decrease': [0.2],
             }

gbc_grid = GridSearchCV(gbc, param_grid = gbc_params, cv=5, verbose=1, n_jobs =-1)
gbc_grid.fit(X_train, Y_train)
print(gbc_grid.best_params_)
print(gbc_grid.best_score_)
print(gbc_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.6min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.6min finished


{'criterion': 'friedman_mse', 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': 16, 'min_impurity_decrease': 0.2, 'min_samples_leaf': 220, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0, 'n_estimators': 150, 'subsample': 1}
0.763023902422
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=16,
              min_impurity_decrease=0.2, min_impurity_split=None,
              min_samples_leaf=220, min_samples_split=2,
              min_weight_fraction_leaf=0, n_estimators=150, presort='auto',
              random_state=None, subsample=1, verbose=0, warm_start=False)


<a id='xgboost'></a>
### XGBoost

In [12]:
# This performs pretty well but the sklearn wrapper doesn't play nice with sparse matricies so won't be used.
xgb = xgboost.XGBClassifier()
xgb_params = {
              'max_depth': [5],
              'learning_rate': [.1],
              'n_estimators': [100],
              'gamma': [.65],
              'min_child_weight': [3],
              'max_delta_step': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'colsample_bylevel': [0.8],
              'reg_alpha': [0.1],
              'reg_lambda': [0.2],
              'scale_pos_weight' : [1],
              'base_score' : [0.5],
             }

grid = GridSearchCV(xgb, param_grid = xgb_params, cv=5, n_jobs=-1, verbose=1)
grid.fit(X_train, Y_train)
print(grid.best_params_)
print(grid.best_score_)
print(grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   19.6s remaining:   29.4s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   21.2s finished


{'base_score': 0.5, 'colsample_bylevel': 0.8, 'colsample_bytree': 0.7, 'gamma': 0.65, 'learning_rate': 0.1, 'max_delta_step': 1, 'max_depth': 5, 'min_child_weight': 3, 'n_estimators': 100, 'reg_alpha': 0.1, 'reg_lambda': 0.2, 'scale_pos_weight': 1, 'subsample': 0.7}
0.763390047713
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=0.7, gamma=0.65, learning_rate=0.1,
       max_delta_step=1, max_depth=5, min_child_weight=3, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0.1,
       reg_lambda=0.2, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.7)


In [13]:
xgb = xgboost.XGBClassifier(base_score=0.5, colsample_bylevel=0.8, colsample_bytree=0.7,
                            gamma=0.65, learning_rate=0.1, max_delta_step=1, max_depth=5,
                            min_child_weight=3, missing=None, n_estimators=100, nthread=-1,
                            objective='binary:logistic', reg_alpha=0.1, reg_lambda=0.2,
                            scale_pos_weight=1, seed=0, silent=True, subsample=0.7)

xgb.fit(X_train, Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=0.7, gamma=0.65, learning_rate=0.1,
       max_delta_step=1, max_depth=5, min_child_weight=3, missing=None,
       n_estimators=100, n_jobs=1, nthread=-1, objective='binary:logistic',
       random_state=0, reg_alpha=0.1, reg_lambda=0.2, scale_pos_weight=1,
       seed=0, silent=True, subsample=0.7)

<a id='lgbm'></a>
### LightGBM

In [14]:
lgbm = lgb.LGBMClassifier()
lgbm_params = {
               'num_boost_round': [40],
               'learning_rate': [0.1],
               'num_leaves': [35],
               'num_threads': [4],
               'max_depth': [6],
               'min_data_in_leaf': [30],
               'feature_fraction': [1.0],
               'feature_fraction_seed': [95],
               'bagging_freq': [0],
               'bagging_seed': [95],
               'lambda_l1': [0.0],
               'lambda_l2': [0.0],
               'min_split_gain': [0],
             }

lgbm_grid = GridSearchCV(lgbm, param_grid = lgbm_params, cv=5, n_jobs=-1, verbose=1)
lgbm_grid.fit(X_train, Y_train)
print(lgbm_grid.best_params_)
print(lgbm_grid.best_score_)
print(lgbm_grid.best_estimator_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.8s remaining:    2.7s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.3s finished


{'bagging_freq': 0, 'bagging_seed': 95, 'feature_fraction': 1.0, 'feature_fraction_seed': 95, 'lambda_l1': 0.0, 'lambda_l2': 0.0, 'learning_rate': 0.1, 'max_depth': 6, 'min_data_in_leaf': 30, 'min_split_gain': 0, 'num_boost_round': 40, 'num_leaves': 35, 'num_threads': 4}
0.763309953431
LGBMClassifier(bagging_freq=0, bagging_seed=95, boosting_type='gbdt',
        class_weight=None, colsample_bytree=1.0, feature_fraction=1.0,
        feature_fraction_seed=95, lambda_l1=0.0, lambda_l2=0.0,
        learning_rate=0.1, max_depth=6, min_child_samples=20,
        min_child_weight=0.001, min_data_in_leaf=30, min_split_gain=0,
        n_estimators=100, n_jobs=-1, num_boost_round=40, num_leaves=35,
        num_threads=4, objective=None, random_state=None, reg_alpha=0.0,
        reg_lambda=0.0, silent=True, subsample=1.0,
        subsample_for_bin=200000, subsample_freq=1)


<a id='nn'></a>
### Keras/Tensorflow Neural Network

In [504]:
model_k = Sequential()
model_k.add(Dense(34, input_dim=Xtrain.shape[1], activation='relu'))
model_k.add(Dense(34, activation='relu'))
model_k.add(Dense(34, activation='relu'))
model_k.add(Dense(34, activation='relu'))
model_k.add(Dense(34, activation='relu'))
model_k.add(Dense(17, activation='relu'))
model_k.add(Dense(17, activation='relu'))
model_k.add(Dense(17, activation='relu'))
model_k.add(Dense(17, activation='relu'))
model_k.add(Dense(1, activation='sigmoid'))

model_k.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model_k.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=20)

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Train on 87397 samples, validate on 29133 samples
Epoch 1/20

KeyboardInterrupt: 

<a id='ensemble'></a>
## Ensembling Our models

In [15]:
model1 = CalibratedClassifierCV(base_estimator=gbc_grid.best_estimator_, cv=5)
model2 = CalibratedClassifierCV(base_estimator=rfr_grid.best_estimator_, cv=5)
model3 = CalibratedClassifierCV(base_estimator=etrees_grid.best_estimator_, cv=5)
model4 = CalibratedClassifierCV(base_estimator=knn_grid.best_estimator_, cv=5)
model5 = CalibratedClassifierCV(base_estimator=lsvc_grid.best_estimator_, cv=5)
model6 = CalibratedClassifierCV(base_estimator=lgbm_grid.best_estimator_, cv=5)

clf1 = gbc_grid.best_estimator_
clf2 = rfr_grid.best_estimator_
clf3 = etrees_grid.best_estimator_
clf4 = knn_grid.best_estimator_
clf5 = lsvc_grid.best_estimator_
clf6 = lgbm_grid.best_estimator_

In [16]:
calibrated_model = VotingClassifier(estimators=[('gbc', model1), ('rfr', model2), ('etrees', model3), ('knn', model4), 
                                                ('lsvc', model5), ('lgbm', clf6), ('log', log)])

In [17]:
model_hard = VotingClassifier(estimators=[('gbc', clf1), ('rfr', clf2), ('etrees', clf3), ('knn', clf4), 
                                          ('lsvc', clf5), ('lgbm', clf6), ('log', log)])

In [22]:
model_soft = VotingClassifier(estimators=[('gbc', clf1), ('rfr', clf2), ('etrees', clf3), ('knn', clf4), 
                                          ('lgbm', clf6), ('log', log)], voting='soft')

<a id='exporting'></a>
## Exporting Our Submission

In [23]:
categories=['Wins','PPG','PPGA','PowerConf','3PG', 'APG','TOP','Conference Champ','Tourney Conference Champ',
            'Seed','SOS','SRS', 'RPG', 'SPG', 'Tourney Appearances','National Championships','Location', 'Team Season Elo']

accuracy=[]
numTrials = 1

for i in range(numTrials):
    results = model_soft.fit(X_train, Y_train)
    preds = model_soft.predict(X_test)

#    preds[preds < .5] = 0
#    preds[preds >= .5] = 1
    localAccuracy = np.mean(preds == Y_test)
    accuracy.append(localAccuracy)
    print ("Finished run #" + str(i) + ". Accuracy = " + str(localAccuracy))
print ("The average accuracy is", sum(accuracy)/len(accuracy))



Finished run #0. Accuracy = 0.762262726118
The average accuracy is 0.762262726118


In [24]:
def predictGame(team_1_vector, team_2_vector, home):
    diff = [a - b for a, b in zip(team_1_vector, team_2_vector)]
    diff.append(home)

    #return model_hard.predict([diff])[0]
    return model_soft.predict_proba([diff])[0][1] # Depends on model(s) chosen

In [25]:
def loadTeamVectors(years):
    listDictionaries = []
    for year in years:
        curVectors = np.load("Data/PrecomputedMatrices/TeamVectors/" + str(year) + "TeamVectors.npy").item()
        listDictionaries.append(curVectors)
    return listDictionaries

In [26]:
def createPrediction():
    if os.path.exists("results.csv"):
        os.remove("results.csv")
    years = range(2014,2018)
    listDictionaries = loadTeamVectors(years)
    print ("Loaded the team vectors.")
    results = [[0 for x in range(2)] for x in range(len(sample_sub_pd.index))]
    for index, row in sample_sub_pd.iterrows():
        matchupId = row['ID']
        year = int(matchupId[0:4]) 
        teamVectors = listDictionaries[year - years[0]]
        team1Id = int(matchupId[5:9])
        team2Id = int(matchupId[10:14])
        team1Vector = teamVectors[team1Id] 
        team2Vector = teamVectors[team2Id]
        pred = predictGame(team1Vector, team2Vector, 0)
        results[index][0] = matchupId
        results[index][1] = pred
    results = pd.np.array(results)
    firstRow = [[0 for x in range(2)] for x in range(1)]
    firstRow[0][0] = 'ID'
    firstRow[0][1] = 'Pred'
    with open("result.csv", "w") as f:
        writer = csv.writer(f)
        writer.writerows(firstRow)
        writer.writerows(results)
        print("Saved Results.")

In [27]:
createPrediction()

Loaded the team vectors.
Saved Results.


In [28]:
results = pd.read_csv('result.csv')
results

Unnamed: 0,ID,Pred
0,2014_1107_1110,0.441945
1,2014_1107_1112,0.040672
2,2014_1107_1113,0.136355
3,2014_1107_1124,0.055960
4,2014_1107_1140,0.168766
5,2014_1107_1142,0.576289
6,2014_1107_1153,0.102997
7,2014_1107_1157,0.584205
8,2014_1107_1160,0.184427
9,2014_1107_1163,0.055720


In [None]:
from bracketeer import build_bracket
b = build_bracket(
        output_path='output.png', 
        teamsPath='Data/KaggleData/Teams.csv',
        seedsPath='Data/KaggleData/NCAATourneySeeds.csv',
        submissionPath='result.csv',
        slotsPath='Data/KaggleData/NCAATourneySlots.csv',
        year=2018
)