# Input/Output Filenames

`training_data_file` is the file with all the games and features you want to use to train your model.  `submission_data_file` is the file with all the games we need to predict and their associated features.  `submission_output_file` is the filename we'll write with all our predictions to submit to Kaggle.

In [1]:
training_data_file = 'data/data.csv'
submission_data_file = 'data/sample.csv'
submission_output_file = 'data/submission.csv'

In [2]:
import pandas as pd
import numpy as np
def write_submission_file(_model, _featureList): # see submission.ipynb for details
    import pandas as pd
    sample = pd.read_csv(submission_data_file, index_col=0)
    Xsample = sample[_featureList].values
    sample['Pred'] = _model.predict_proba(Xsample)[:,0]
    submission = sample[['Id', 'Pred']]
    submission.to_csv(submission_output_file, encoding='ascii')
    
tourney_df = pd.read_csv(training_data_file, index_col=0)
tourney_df.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,std_stl0,std_stl1,avg_blk0,avg_blk1,std_blk0,std_blk1,avg_pf0,avg_pf1,std_pf0,std_pf1
0,2003,134,1421,92,1411,84,N,1,32,69,...,3.14,2.31,3.0,2.23,1.65,1.72,19.1,18.3,3.69,4.56
1,2003,136,1112,80,1436,51,N,0,31,66,...,3.5,3.65,2.97,4.21,1.72,2.01,15.9,17.75,4.05,2.99
2,2003,136,1113,84,1272,71,N,0,31,59,...,3.16,2.14,5.07,4.24,3.16,2.94,18.76,19.41,4.34,3.25
3,2003,136,1141,79,1166,73,N,0,29,53,...,2.94,2.82,4.0,4.45,2.35,2.22,20.97,17.27,4.81,3.12
4,2003,136,1143,76,1301,74,N,1,27,64,...,3.63,2.64,3.07,2.79,2.49,1.63,18.67,17.1,4.21,3.74


# Choosing Features to Use for Model

Edit featureList below to use any of the features in the data frame above (see above for names of columns).  

In [12]:
y = tourney_df['Winner'].values # results
X = tourney_df.ix[:,'wins0':].values # features (this takes everything from wins0 onward)

# Train our model!

In [13]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [14]:
from sklearn import svm
model = svm.SVC(probability=True)
model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
y_pred = model.predict_proba(X_test)[:,0] # probability that team0 wins (what Kaggle calls team 1, and wants for submission)
for i in range(5):
    print("Features = {0}, Pred. prob. team0 wins: {1}".format(X_test[i], y_pred[i]))

Features = [ 30.    26.    26.24  24.21   4.68   5.02  56.    52.88   6.02   5.88
   5.29   7.76   2.3    2.29  14.85  20.79   3.37   3.7   15.29  17.36
   4.56   6.71  23.35  23.33   6.98   8.07  12.59   8.79   3.66   2.93
  26.29  24.39   4.78   4.44  15.09  12.12   3.63   3.2   10.44   8.06
   3.23   2.98   5.88   5.     1.9    2.3    4.24   3.52   2.02   1.99
  16.44  14.94   3.14   3.32], Pred. prob. team0 wins: 0.543264188895785
Features = [ 25.    33.    26.33  28.76   5.68   5.3   54.88  61.76   7.95   7.72
   4.97   7.94   3.09   3.52  13.33  22.68   4.23   6.16  13.67  14.21
   6.38   6.35  18.7   23.85   8.16   9.63  12.39  14.12   4.72   4.66
  24.82  26.68   4.41   4.98  17.64  16.03   6.05   5.78  13.67  12.21
   4.56   4.15   6.09   8.71   3.18   3.42   4.36   6.09   2.52   2.93
  18.18  17.44   4.56   4.19], Pred. prob. team0 wins: 0.4169851685050697
Features = [ 32.    27.    28.15  29.12   6.21   5.49  57.61  61.85   7.68   8.15
   8.7    6.85   3.28   2.83  21.79  17

# Could you double-check I'm calculating this correctly?

Just realized Kaggle uses the convention for scoring that the label is 1 if the first team wins, and 0 if the second.  We use 0 if team0 wins, 1 if team1 wins.  I think what's below works...passing y_pred, i.e. the probability that the first team wins (i.e. what we call team0, what they call team 1), and 1-y_test, so that 1-y = 0 when team1 wins (their team2) and 1-y = 1 when team0 wins (their team1).  Please compare http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html and https://www.kaggle.com/c/march-machine-learning-mania-2016/details/evaluation to make sure we're doing the same thing.  It's late and I'm not thinking straight.

In [16]:
from sklearn import metrics
print("Log loss is {0}".format(metrics.log_loss(1-y_test, y_pred)))

Log loss is 0.6263580061551864


# Making a submission on Kaggle

The following call generates a file `data/submission.csv` that uses the model you pass it to generate probabilities for all the matches Kaggle wants us to predict.  It will overwrite any previous file.  See `submission.ipynb` for a walk-through of the function.  Have to pass it the model you trained, and the featureList you used.

In [8]:
write_submission_file(model, featureList)