# March Madness Machine Learning

I try to predict the outcome of March Madnss matches based on the outcome of previous matches between two teams.

I used a MLP neural network classifier. I used pybrain for the classifier and pandas to sift through the data.

This is based on the [World Cup Classifer](https://github.com/fisadev/world_cup_learning/) built by [fisadev](https://github.com/fisadev/)

In [1]:
import pandas as pd
import numpy as np

# Data cleaning
Get season and seeds data

In [2]:
df = pd.read_csv('march-machine-learning-mania-2016-v2/TourneySeeds.csv')
df

Unnamed: 0,Season,Seed,Team
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374
5,1985,W06,1208
6,1985,W07,1393
7,1985,W08,1396
8,1985,W09,1439
9,1985,W10,1177


Out of curiosity, let's take a look at unique team codes in 2003

In [3]:
df[df['Season'] == 2003].Team.unique()

array([1328, 1448, 1393, 1257, 1280, 1329, 1386, 1143, 1301, 1120, 1335,
       1139, 1122, 1264, 1190, 1354, 1400, 1196, 1462, 1390, 1163, 1268,
       1277, 1261, 1345, 1160, 1423, 1140, 1360, 1407, 1358, 1411, 1421,
       1246, 1338, 1266, 1173, 1458, 1281, 1231, 1332, 1428, 1104, 1356,
       1451, 1409, 1221, 1447, 1237, 1112, 1242, 1181, 1228, 1323, 1166,
       1272, 1153, 1211, 1113, 1141, 1454, 1443, 1161, 1429, 1436])

In [4]:
df[df['Team'] == 1437]

Unnamed: 0,Season,Seed,Team
55,1985,Z08,1437
105,1986,Y10,1437
229,1988,Y06,1437
363,1990,Y12,1437
392,1991,W09,1437
642,1995,W03,1437
722,1996,X03,1437
771,1997,W04,1437
919,1999,X08,1437
1336,2005,Z05,1437


Get detailed season results (2003 onwards)

In [5]:
df2 = pd.read_csv('march-machine-learning-mania-2016-v2/RegularSeasonDetailedResults.csv')
df2

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14
5,2003,11,1458,81,1186,55,H,0,26,57,...,11,12,17,6,22,8,19,4,3,25
6,2003,12,1161,80,1236,62,H,0,23,55,...,15,20,28,9,21,11,30,10,4,28
7,2003,12,1186,75,1457,61,N,0,28,62,...,17,17,23,8,25,10,15,14,8,18
8,2003,12,1194,71,1156,66,N,0,28,58,...,18,12,27,13,26,13,25,8,2,18
9,2003,12,1458,84,1296,56,H,0,32,67,...,14,7,12,9,23,10,18,1,3,18


Merge season seed data for both winner and loser teams with the deatiled data dataframe

In [6]:
games = pd.merge(df2, df, how='left', left_on=['Wteam', 'Season'], right_on=['Team', 'Season'], suffixes=('', 'W'))
games['WSeed'] = games['Seed']
del games["Seed"]
del games["Team"]
games = pd.merge(games, df, how='left', left_on=['Lteam', 'Season'], right_on=['Team', 'Season'], suffixes=('', 'L'))
games['LSeed'] = games['Seed']
del games["Seed"]
del games["Team"]
games.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf,WSeed,LSeed
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,10,22,8,18,9,2,20,Y10,W01
1,2003,10,1272,70,1393,63,N,0,26,62,...,20,20,25,7,12,8,6,16,Z07,W03
2,2003,11,1266,73,1437,61,N,0,24,58,...,23,31,22,9,12,2,5,23,Y03,
3,2003,11,1296,56,1457,50,N,0,18,38,...,15,17,20,9,19,4,3,23,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,27,21,15,12,10,7,1,14,X01,


In [7]:
games.columns

Index([u'Season', u'Daynum', u'Wteam', u'Wscore', u'Lteam', u'Lscore', u'Wloc',
       u'Numot', u'Wfgm', u'Wfga', u'Wfgm3', u'Wfga3', u'Wftm', u'Wfta',
       u'Wor', u'Wdr', u'Wast', u'Wto', u'Wstl', u'Wblk', u'Wpf', u'Lfgm',
       u'Lfga', u'Lfgm3', u'Lfga3', u'Lftm', u'Lfta', u'Lor', u'Ldr', u'Last',
       u'Lto', u'Lstl', u'Lblk', u'Lpf', u'WSeed', u'LSeed'],
      dtype='object')

Begin parsing through seed data to seperate it out

In [8]:
games['Seed'].unique()

KeyError: 'Seed'

Create region data as seperate columns

In [9]:
games['Wregion'] = games.WSeed.str[0]

In [10]:
games['Lregion'] = games.LSeed.str[0]

In [11]:
games.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Ldr,Last,Lto,Lstl,Lblk,Lpf,WSeed,LSeed,Wregion,Lregion
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,8,18,9,2,20,Y10,W01,Y,W
1,2003,10,1272,70,1393,63,N,0,26,62,...,25,7,12,8,6,16,Z07,W03,Z,W
2,2003,11,1266,73,1437,61,N,0,24,58,...,22,9,12,2,5,23,Y03,,Y,
3,2003,11,1296,56,1457,50,N,0,18,38,...,20,9,19,4,3,23,,,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,15,12,10,7,1,14,X01,,X,


Import season data (2003 and above)

In [12]:
seasons = pd.read_csv('march-machine-learning-mania-2016-v2/Seasons.csv')
seasons = seasons[seasons['Season'] > 2002]

Fix seasons data for ease of importing into games dataframe

In [13]:
seasons['W'] = seasons.Regionw
seasons['X'] = seasons.Regionx
seasons['Y'] = seasons.Regiony
seasons['Z'] = seasons.Regionz
del seasons['Regionw']
del seasons['Regionx']
del seasons['Regiony']
del seasons['Regionz']
seasons.head()

Unnamed: 0,Season,Dayzero,W,X,Y,Z
18,2003,11/4/2002,East,South,Midwest,West
19,2004,11/3/2003,Atlanta,Phoenix,EastRutherford,StLouis
20,2005,11/1/2004,Albuquerque,Chicago,Austin,Syracuse
21,2006,10/31/2005,Atlanta,Oakland,Minneapolis,WashingtonDC
22,2007,10/30/2006,East,South,Midwest,West


In [14]:
seasons

Unnamed: 0,Season,Dayzero,W,X,Y,Z
18,2003,11/4/2002,East,South,Midwest,West
19,2004,11/3/2003,Atlanta,Phoenix,EastRutherford,StLouis
20,2005,11/1/2004,Albuquerque,Chicago,Austin,Syracuse
21,2006,10/31/2005,Atlanta,Oakland,Minneapolis,WashingtonDC
22,2007,10/30/2006,East,South,Midwest,West
23,2008,11/5/2007,East,Midwest,South,West
24,2009,11/3/2008,East,South,Midwest,West
25,2010,11/2/2009,East,South,Midwest,West
26,2011,11/1/2010,East,West,Southeast,Southwest
27,2012,10/31/2011,East,Midwest,South,West


Add Region data into games df for both winning and losing teams

In [83]:
for index, row in games.iterrows():
    try:
        games.loc[index, 'Wregion'] = seasons[seasons['Season'] == 2007][[games.loc[index, 'Wregion']]].values[0][0]
    except:
        games.loc[index, 'Wregion'] = np.nan

In [84]:
games.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Ldr,Last,Lto,Lstl,Lblk,Lpf,WSeed,LSeed,Wregion,Lregion
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,8,18,9,2,20,Y10,W01,Midwest,W
1,2003,10,1272,70,1393,63,N,0,26,62,...,25,7,12,8,6,16,Z07,W03,West,W
2,2003,11,1266,73,1437,61,N,0,24,58,...,22,9,12,2,5,23,Y03,,Midwest,
3,2003,11,1296,56,1457,50,N,0,18,38,...,20,9,19,4,3,23,,,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,15,12,10,7,1,14,X01,,South,


In [85]:
for index, row in games.iterrows():
    try:
        games.loc[index, 'Lregion'] = seasons[seasons['Season'] == 2007][[games.loc[index, 'Lregion']]].values[0][0]
    except:
        games.loc[index, 'Lregion'] = np.nan

In [86]:
games.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Ldr,Last,Lto,Lstl,Lblk,Lpf,WSeed,LSeed,Wregion,Lregion
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,8,18,9,2,20,Y10,W01,Midwest,East
1,2003,10,1272,70,1393,63,N,0,26,62,...,25,7,12,8,6,16,Z07,W03,West,East
2,2003,11,1266,73,1437,61,N,0,24,58,...,22,9,12,2,5,23,Y03,,Midwest,
3,2003,11,1296,56,1457,50,N,0,18,38,...,20,9,19,4,3,23,,,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,15,12,10,7,1,14,X01,,South,


Import Teams data and replace team code's with team names

In [87]:
teams = pd.read_csv('march-machine-learning-mania-2016-v2/Teams.csv')
teams.head()

Unnamed: 0,Team_Id,Team_Name
0,1101,Abilene Chr
1,1102,Air Force
2,1103,Akron
3,1104,Alabama
4,1105,Alabama A&M


In [110]:
for index, row in games.iterrows():
    games.loc[index, 'Wteam'] = teams[teams['Team_Id'] == games.loc[index, 'Wteam']].Team_Name.to_string().split("    ")[1]
    games.loc[index, 'Lteam'] = teams[teams['Team_Id'] == games.loc[index, 'Lteam']].Team_Name.to_string().split("    ")[1]

In [111]:
games.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Ldr,Last,Lto,Lstl,Lblk,Lpf,WSeed,LSeed,Wregion,Lregion
0,2003,10,Alabama,68,Oklahoma,62,N,0,27,58,...,22,8,18,9,2,20,Y10,W01,Midwest,East
1,2003,10,Memphis,70,Syracuse,63,N,0,26,62,...,25,7,12,8,6,16,Z07,W03,West,East
2,2003,11,Marquette,73,Villanova,61,N,0,24,58,...,22,9,12,2,5,23,Y03,,Midwest,
3,2003,11,N Illinois,56,Winthrop,50,N,0,18,38,...,20,9,19,4,3,23,,,,
4,2003,11,Texas,77,Georgia,71,N,0,30,61,...,15,12,10,7,1,14,X01,,South,


Save games df to csv incase we lose the kernel

In [112]:
games.to_csv('all_games.csv')

Count nans by column

In [113]:
for c in games.columns:
    print c, len(games[c]) - games[c].count()

Season 0
Daynum 0
Wteam 0
Wscore 0
Lteam 0
Lscore 0
Wloc 0
Numot 0
Wfgm 0
Wfga 0
Wfgm3 0
Wfga3 0
Wftm 0
Wfta 0
Wor 0
Wdr 0
Wast 0
Wto 0
Wstl 0
Wblk 0
Wpf 0
Lfgm 0
Lfga 0
Lfgm3 0
Lfga3 0
Lftm 0
Lfta 0
Lor 0
Ldr 0
Last 0
Lto 0
Lstl 0
Lblk 0
Lpf 0
WSeed 49914
LSeed 63332
Wregion 49914
Lregion 63332


In [114]:
len(games)

71241

We have very little seed data. This might pose a problem since seed data is usually highly correlated with teams winning or losing. Should come back to that.

In [128]:
games.describe()

Unnamed: 0,Season,Daynum,Wscore,Lscore,Numot,Wfgm,Wfga,Wfgm3,Wfga3,Wftm,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
count,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,...,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0,71241.0
mean,2009.709423,71.443467,74.720568,62.752713,0.072304,25.830126,54.698109,6.857821,17.92145,16.202496,...,18.988265,12.178465,18.11339,11.317556,21.325543,11.394478,14.481029,6.08234,2.868587,19.867829
std,3.993369,35.203727,11.059601,10.873009,0.314147,4.676932,7.598108,2.981373,5.62757,6.269218,...,5.789449,5.368745,7.166855,4.224845,4.493498,3.726841,4.462374,2.786944,2.050225,4.526861
min,2003.0,0.0,34.0,20.0,0.0,10.0,27.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0
25%,2006.0,40.0,67.0,55.0,0.0,23.0,49.0,5.0,14.0,12.0,...,15.0,8.0,13.0,8.0,18.0,9.0,11.0,4.0,1.0,17.0
50%,2010.0,75.0,74.0,62.0,0.0,26.0,54.0,7.0,18.0,16.0,...,19.0,12.0,18.0,11.0,21.0,11.0,14.0,6.0,3.0,20.0
75%,2013.0,102.0,82.0,70.0,0.0,29.0,59.0,9.0,21.0,20.0,...,23.0,16.0,23.0,14.0,24.0,14.0,17.0,8.0,4.0,23.0
max,2016.0,132.0,144.0,140.0,6.0,56.0,103.0,25.0,56.0,48.0,...,54.0,42.0,61.0,36.0,45.0,31.0,41.0,22.0,18.0,45.0


In [134]:
games_na = games.dropna()

In [138]:
import re

In [139]:
games_na['WSeed'] = games_na['WSeed'].map(lambda x: re.sub('[^0-9]','',x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [141]:
games_na['LSeed'] = games_na['LSeed'].map(lambda x: re.sub('[^0-9]','',x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Begin neural net learning using pybrain

In [145]:
games_na[['WSeed','LSeed']] = games_na[['WSeed','LSeed']].apply(pd.to_numeric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [146]:
games_na.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4018 entries, 0 to 71240
Data columns (total 38 columns):
Season     4018 non-null int64
Daynum     4018 non-null int64
Wteam      4018 non-null object
Wscore     4018 non-null int64
Lteam      4018 non-null object
Lscore     4018 non-null int64
Wloc       4018 non-null object
Numot      4018 non-null int64
Wfgm       4018 non-null int64
Wfga       4018 non-null int64
Wfgm3      4018 non-null int64
Wfga3      4018 non-null int64
Wftm       4018 non-null int64
Wfta       4018 non-null int64
Wor        4018 non-null int64
Wdr        4018 non-null int64
Wast       4018 non-null int64
Wto        4018 non-null int64
Wstl       4018 non-null int64
Wblk       4018 non-null int64
Wpf        4018 non-null int64
Lfgm       4018 non-null int64
Lfga       4018 non-null int64
Lfgm3      4018 non-null int64
Lfga3      4018 non-null int64
Lftm       4018 non-null int64
Lfta       4018 non-null int64
Lor        4018 non-null int64
Ldr        4018 non-n

In [169]:
del games_na['winner']

I needed to add a winner column that randomized through which team won. Since it was always team 1 I decided to use a randomizer between 1 and 2 and assign the winner column value to 1 or 2 depending on the randomizer.

In [174]:
g = games_na[['Season', 'Daynum', 'Numot', 'Wloc']]
w = games_na[['Wteam', 'Wscore', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wor', 'Wdr', 'Wast', 'Wto', 'Wstl', 'Wblk', 'Wpf', 'WSeed', 'Wregion']]
l = games_na[['Lteam', 'Lscore', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Lor', 'Ldr', 'Last', 'Lto', 'Lstl', 'Lblk', 'Lpf', 'LSeed', 'Lregion']]

In [229]:
w.columns = w.columns.map(lambda x: x.strip('W'))
l.columns = l.columns.map(lambda x: x.strip('L'))

In [273]:
games_rnd = pd.DataFrame()

In [274]:
import random

In [275]:
for i in range(0, len(g)):
    who_first = random.randint(1, 3)
    if who_first == 1:
        s = pd.Series([1],index=['winner'])
        one = w.iloc[i].rename(lambda x: x + '1')
        two = l.iloc[i].rename(lambda x: x + '2')
    else:
        s = pd.Series([2],index=['winner'])
        one = l.iloc[i].rename(lambda x: x + '1')
        two = w.iloc[i].rename(lambda x: x + '2')
    temp = pd.concat([g.iloc[i], one, two, s])
    games_rnd = games_rnd.append(temp, ignore_index=True)

I created a team_stats dataframe to concatenate the average results of a team in a single season. I will be using this to gather data about teams to make a predcition.

In [None]:
team_stats_1 = games_rnd.groupby(['Season','team1'])['Seed1', 'ast1', 'blk1', 'dr1', 'fga1', 'fga31', 'fgm1', 'fgm31', 'fta1', 'ftm1', 'or1', 'pf1', 'score1', 'stl1', 'to1'].mean().reset_index()
team_stats_2 = games_rnd.groupby(['Season','team2'])['Seed2', 'ast2', 'blk2', 'dr2', 'fga2', 'fga32', 'fgm2', 'fgm32', 'fta2', 'ftm2', 'or2', 'pf2', 'score2', 'stl2', 'to2'].mean().reset_index()

In [22]:
team_stats_1.columns = team_stats_1.columns.map(lambda x: x.strip('1'))
team_stats_2.columns = team_stats_2.columns.map(lambda x: x.strip('2'))

In [24]:
team_stats = pd.concat([team_stats_1, team_stats_2])

In [26]:
team_stats = team_stats.groupby(['Season','team'])['Seed', 'ast', 'blk', 'dr', 'fga', 'fga3', 'fgm', 'fgm3', 'fta', 'ftm', 'or', 'pf', 'score', 'stl', 'to'].mean().reset_index()

In [31]:
team_stats.head()

Unnamed: 0,Season,team,Seed,ast,blk,dr,fga,fga3,fgm,fgm3,fta,ftm,or,pf,score,stl,to
0,2003.0,Alabama,10.0,10.714286,3.0,23.071429,54.821429,17.517857,22.732143,5.375,19.857143,14.142857,12.482143,18.553571,64.982143,6.285714,14.285714
1,2003.0,Arizona,1.0,16.0,3.333333,25.5,63.666667,18.5,29.333333,6.916667,24.833333,17.75,15.5,19.25,83.333333,8.25,14.0
2,2003.0,Arizona St,10.0,14.244444,3.377778,21.577778,57.033333,14.533333,24.977778,4.255556,26.588889,16.711111,13.577778,20.533333,70.922222,4.544444,13.455556
3,2003.0,Auburn,10.0,12.75,4.875,21.25,58.125,18.875,24.8125,5.75,17.75,11.125,13.4375,15.9375,66.5,8.0625,14.3125
4,2003.0,Austin Peay,13.0,14.5,5.5,24.5,62.5,23.0,22.5,8.5,14.0,10.0,11.5,23.0,63.5,7.0,17.5


# Learning
Ok, now we have everything we need. Lets feed the selected input features to a the neural network classifier, and let it learn.

We have to normalize the data, otherwise the features with smaller values will impose a greater weight on the prediction.

In [None]:
from random import random

from IPython.display import SVG
import pygal

from pybrain.structure import SigmoidLayer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError

from sklearn.preprocessing import StandardScaler

In [51]:
# the features I will feed to the classifier as input data.
input_features = ['Season','score1','score2','fgm1','fga1','fgm31','fga31','ftm1','fta1','or1','dr1','ast1','to1','stl1','blk1','pf1','fgm2','fga2','fgm32','fga32','ftm2','fta2','or2','dr2','ast2','to2','stl2','blk2','pf2','Seed1','Seed2']

# the feature giving the result the classifier must learn to predict
output_feature = 'winner'

I defined a normalizer to be able to normalize the data using a simple function rather than a complex one that I would constantly have to google.

In [28]:
def normalize(array):
    scaler = StandardScaler()
    array = scaler.fit_transform(array)

    return scaler, array

I also defined a sample extractor to be able to pull the data out easier when I need it.

In [29]:
def extract_samples(matches, origin_features, result_feature):
    inputs = [tuple(matches.loc[i, feature]
                    for feature in origin_features)
              for i in matches.index]

    outputs = tuple(matches[result_feature].values)

    assert len(inputs) == len(outputs)

    return inputs, outputs

And a basic splitter because I wanted randomized samples whenever I could get them.

In [30]:
def split_samples(inputs, outputs, percent=0.75):
    assert len(inputs) == len(outputs)

    inputs1 = []
    inputs2 = []
    outputs1 = []
    outputs2 = []

    for i, inputs_row in enumerate(inputs):
        if random() < percent:
            input_to = inputs1
            output_to = outputs1
        else:
            input_to = inputs2
            output_to = outputs2

        input_to.append(inputs_row)
        output_to.append(outputs[i])

    return inputs1, outputs1, inputs2, outputs2

In [219]:
inputs, outputs = extract_samples(games_rnd,
                                  input_features,
                                  output_feature)

normalizer, inputs = normalize(inputs)

train_inputs, train_outputs, test_inputs, test_outputs = split_samples(inputs, outputs)

n = buildNetwork(len(input_features),
                 10 * len(input_features),
                 10 * len(input_features),
                 1,
                 outclass=SigmoidLayer,
                 bias=True)

To be able to evaluate the results and show progress on the learning cycle, we need these two functions wich help us calculate how well the network can predict the results from the games used to learn, and the games it doesn't know.

In [220]:
def neural_result(input):
    """Call the neural network, and translates its output to a match result."""
    n_output = n.activate(input) 
    if n_output >= 0.5:
        return 2
    else:
        return 1
    
def test_network():
    """Calculate train and test sets errors."""
    print (100 - percentError(map(neural_result, train_inputs), train_outputs), 
           100 - percentError(map(neural_result, test_inputs), test_outputs))

In [221]:
train_set = ClassificationDataSet(len(input_features))

for i, input_line in enumerate(train_inputs):
    train_set.addSample(train_inputs[i], [train_outputs[i] - 1])

trainer = BackpropTrainer(n, dataset=train_set, momentum=0.5, weightdecay=0.0)

train_set.assignClasses()

test_network()

(56.590450571620714, 56.70498084291188)


Train the network, for a given number of iterations. You can re-run this step many times, and it will keep learning but if you train too much you can end up overfitting the training data (this is visible when the test set accuracy starts to decrease).

In [222]:
for i in range(10):
    trainer.train()
    test_network()

(67.11499663752522, 66.28352490421456)
(67.28312037659717, 66.18773946360153)
(67.28312037659717, 66.18773946360153)
(67.38399462004034, 66.28352490421456)
(67.51849361129791, 66.28352490421456)
(67.955615332885, 66.47509578544062)
(80.43039677202421, 77.58620689655172)
(90.34969737726968, 86.7816091954023)
(93.37592468056489, 90.13409961685824)
(94.68728984532616, 90.80459770114942)


In [223]:
for i in range(5):
    trainer.train()
    test_network()

(95.62878278412911, 91.47509578544062)
(96.33490248823134, 91.37931034482759)
(96.77202420981843, 91.37931034482759)
(97.04102219233356, 91.66666666666667)
(97.2763954270343, 92.04980842911877)


In [235]:
for i in range(5):
    trainer.train()
    test_network()

(97.57901815736382, 92.33716475095785)
(97.74714189643578, 92.43295019157088)
(97.88164088769334, 92.24137931034483)
(97.94889038332212, 92.43295019157088)
(98.0497646267653, 92.43295019157088)


In [260]:
for i in range(5):
    trainer.train()
    test_network()

(98.15063887020847, 92.52873563218391)
(98.25151311365165, 92.72030651340997)
(98.4196368527236, 92.52873563218391)
(98.4196368527236, 92.52873563218391)
(98.48688634835239, 92.62452107279694)


The classifier taps out at around 92% accuracy, not bad :)

I pickle all the things just in case I end up losing data

In [261]:
import pickle
pickle.dump(input_features, open('pkl/input_features.pkl', 'wb'))
pickle.dump(games_rnd, open('pkl/games_rnd.pkl', 'wb'))
pickle.dump(train_inputs, open('pkl/train_inputs.pkl', 'wb'))
pickle.dump(train_outputs, open('pkl/train_outputs.pkl', 'wb'))
pickle.dump(normalizer, open('pkl/normalizer.pkl', 'wb'))
pickle.dump(inputs, open('pkl/inputs.pkl', 'wb'))
pickle.dump(outputs, open('pkl/outputs.pkl', 'wb'))
pickle.dump(n, open('pkl/n.pkl', 'wb'))
pickle.dump(team_stats, open('pkl/team_stats.pkl', 'wb'))

In [44]:
import pickle
input_features = pickle.load(open('pkl/input_features.pkl', 'rb'))
n = pickle.load(open('pkl/n.pkl', 'rb'))
outputs = pickle.load(open('pkl/outputs.pkl', 'rb'))
train_outputs = pickle.load(open('pkl/train_outputs.pkl', 'rb'))
inputs = pickle.load(open('pkl/inputs.pkl', 'rb'))
normalizer = pickle.load(open('pkl/normalizer.pkl', 'rb'))
train_inputs = pickle.load(open('pkl/train_inputs.pkl', 'rb'))
games_rnd = pickle.load(open('pkl/games_rnd.pkl', 'rb'))

# Prediction
With the classifier already trained, we can start making predictions. But we need a little function able to translate inputs like this: (2014, 'California', 'Hawaii'), to the numeric inputs the classifier expects (based on the input features).

This function does the conversion, also normalizes the data with the same normalizer used before, and then just asks the classifier for the prediction.

In [262]:
def neural_result(input):
    """Call the neural network, and translates its output to a match result."""
    n_output = n.activate(input) 
    if n_output >= 0.5:
        return 2
    else:
        return 1

def predict(year, team1, team2):
    inputs = []
    diff_year_1 = ''
    diff_year_2 = ''
    for feature in input_features:
        from_team_2 = '2' in feature
        feature = feature.replace('2', '')
        feature = feature.replace('1', '')
        if feature in [x for x in team_stats.columns.values if x != 'Season']:
            team = team2 if from_team_2 else team1
            try:
                value = team_stats[(team_stats.team == team)&(team_stats.Season == year)].iloc[[0]][feature].values[0]
            except:
                if from_team_2:
                    diff_year_2 = team
                else:
                    diff_year_1 = team
                value = team_stats[team_stats['team'] == team].iloc[[-1]][feature].values[0]
        elif feature == 'Season':
            value = year
        else:
            raise ValueError("Don't know where to get feature: " + feature)
        inputs.append(value)

    inputs = normalizer.transform(np.array(inputs).reshape((1, -1)))
    result = neural_result(inputs[0])

    results = ''
    if diff_year_1 != '':
        year_used = team_stats[team_stats['team'] == team1].iloc[[-1]]['Season'].values[0]
        results += "Couldn't find data from "+str(year)+" for team1 = "+diff_year_1+", used "+str(int(year_used))+" instead.\n"
    if diff_year_2 != '':
        year_used = team_stats[team_stats['team'] == team2].iloc[[-1]]['Season'].values[0]
        results += "Couldn't find data from "+str(year)+" for team1 = "+diff_year_2+", used "+str(int(year_used))+" instead.\n"
    
    if results:
        print results
    
    if result == 1:
        return team1
    elif result == 2:
        return team2
    else:
        return 'Unknown result: ' + str(result)

Some predictions about the past, compared to real results:
Even while we know those results and some of them we're used to train the classifier, that doesn't guarantee the real result is what the classifier will predict.

In [263]:
predict(2016, 'Kansas', 'Austin Peay') #Correct

'Kansas'

In [264]:
predict(2016, 'Kansas', 'Connecticut') #Correct

'Kansas'

In [265]:
predict(2016, 'Kansas', 'Maryland') #Correct

'Kansas'

In [266]:
predict(2016, 'Kansas', 'Villanova') #Wrong

'Kansas'

In [267]:
predict(2016, 'Villanova', 'North Carolina') #Wrong

'North Carolina'

In [268]:
predict(2016, 'Villanova', 'Oklahoma') #Correct

'Villanova'

In [269]:
# What about a huge upset?
predict(2016, 'SF Austin', 'West Virginia') #Wrong

'West Virginia'

In [270]:
# Another upset
predict(2016, 'Purdue', 'Ark Little Rock') #Wrong

'Purdue'

## Some predictions about the future:
Future prediction will not work and the predict method will use data from the most recent year where there is data available instead.

In [271]:
predict(2017, 'Hawaii', 'California')

Couldn't find data from 2017 for team1 = Hawaii, used 2016 instead.
Couldn't find data from 2017 for team1 = California, used 2016 instead.



'Hawaii'

In [272]:
predict(2017, 'North Carolina', 'California')

Couldn't find data from 2017 for team1 = North Carolina, used 2016 instead.
Couldn't find data from 2017 for team1 = California, used 2016 instead.



'North Carolina'