In [1]:
import pandas as pd
import numpy as np
   
tourney_df = pd.read_csv("data/data.csv", index_col=0)
tourney_df.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,std_stl0,std_stl1,avg_blk0,avg_blk1,std_blk0,std_blk1,avg_pf0,avg_pf1,std_pf0,std_pf1
0,2003,134,1421,92,1411,84,N,1,32,69,...,2.31,3.14,2.23,3.0,1.72,1.65,18.3,19.1,4.56,3.69
1,2003,136,1112,80,1436,51,N,0,31,66,...,3.65,3.5,4.21,2.97,2.01,1.72,17.75,15.9,2.99,4.05
2,2003,136,1113,84,1272,71,N,0,31,59,...,2.14,3.16,4.24,5.07,2.94,3.16,19.41,18.76,3.25,4.34
3,2003,136,1141,79,1166,73,N,0,29,53,...,2.82,2.94,4.45,4.0,2.22,2.35,17.27,20.97,3.12,4.81
4,2003,136,1143,76,1301,74,N,1,27,64,...,2.64,3.63,2.79,3.07,1.63,2.49,17.1,18.67,3.74,4.21


# Choosing Features to Use for Model

Edit featureList below to use any of the features in the data frame above (see above for names of columns).  

In [2]:
featureList = ['wins0', 'wins1']

Making target list from team0Win column, and feature list using selected features:

In [3]:
y = tourney_df['team0Win'].values # results
X = tourney_df[featureList].values # features

Splitting off 20% of data into test set to avoid leakage:

In [4]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Making a scaler that computes mean and SD in training data, and can then be applied to any matrix of features (i.e. one of the X matrices) to do feature scaling:

In [5]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
print("Means = {0}, Stdevs = {1}".format(scaler.mean_, scaler.scale_))

Means = [ 23.83899557  24.01033973], Stdevs = [ 3.76162904  3.67130427]




Applying the same scaler to the training and testing data:

In [6]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)



# Train our model!

Here we use gridsearch to find the optimal values of C and gamma for the SVM.  We also use stratifiedshufflesplit cross-validation with 5 folds to reduce variance:

In [7]:
from sklearn import svm
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV

C_range = np.logspace(0, 3, 4)
gamma_range = np.logspace(-3, 0, 4)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedShuffleSplit(y_train, n_iter=5, test_size=0.2, random_state=42)
grid = GridSearchCV(svm.SVC(probability=True, kernel='rbf'), scoring="log_loss", param_grid=param_grid, cv=cv, n_jobs=-1)
grid.fit(X_train, y_train) 
print "The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_) 

The best parameters are {'C': 10.0, 'gamma': 0.001} with a score of -0.61


We select the classifier with the parameters that did best above, apply it to the test data, and calculate the log loss:

In [8]:
model = grid.best_estimator_
y_pred = model.predict_proba(X_test) # probability that team0 wins (what Kaggle calls team 1, and wants for submission)
from sklearn import metrics
print("Log loss is {0}".format(metrics.log_loss(y_test, y_pred)))

Log loss is 0.621202452216


# Writing the Kaggle Submission File

`SampleSubmissionWithFeatures.csv` is generated by GenerateFeatures.ipynb and has all the games we need to predict, with all the faeatures we have generated appended to it:

In [9]:
sample_df = pd.read_csv('data/SampleSubmissionWithFeatures.csv', index_col=0)
sample_df.head()

Unnamed: 0,Id,Pred,Season,team0,team1,wins0,wins1,avg_fgm0,avg_fgm1,std_fgm0,...,std_stl0,std_stl1,avg_blk0,avg_blk1,std_blk0,std_blk1,avg_pf0,avg_pf1,std_pf0,std_pf1
0,2012_1104_1124,0.5,2012,1104,1124,21,26,23.62,26.09,3.63,...,3.51,2.89,4.16,4.64,2.68,2.98,18.41,17.64,4.01,4.3
1,2012_1104_1125,0.5,2012,1104,1125,21,26,23.62,27.91,3.63,...,3.51,2.83,4.16,2.24,2.68,1.68,18.41,18.27,4.01,4.01
2,2012_1104_1140,0.5,2012,1104,1140,21,23,23.62,27.97,3.63,...,3.51,3.44,4.16,4.19,2.68,1.72,18.41,18.45,4.01,4.44
3,2012_1104_1143,0.5,2012,1104,1143,21,24,23.62,26.42,3.63,...,3.51,1.67,4.16,3.15,2.68,1.39,18.41,14.39,4.01,3.69
4,2012_1104_1153,0.5,2012,1104,1153,21,24,23.62,25.06,3.63,...,3.51,3.01,4.16,4.59,2.68,2.74,18.41,15.15,4.01,3.21


The function below takes `SampleSubmissionWithFeatures.csv`, generates a matrix Xsample that only includes the features you used above in training the SVM (which you pass as `featureList`), applies the passed scaler object (you should pass the same one you used above), and then uses the passed model to estimate the probability that the first team (team0) wins, which is what Kaggle wants.  It then writes the file you need to submit to Kaggle to `submission_output_file`:

In [10]:
def write_submission_file(model, featureList, scaler, submission_output_file): # see submission.ipynb for details
    import pandas as pd
    sample_df = pd.read_csv('data/SampleSubmissionWithFeatures.csv', index_col=0)
    Xsample = sample_df[featureList].values
    Xsample = scaler.transform(Xsample)
    sample_df['Pred'] = model.predict_proba(Xsample)[:,1] # predict_proba returns [prob label is 0, prob label is 1], kaggle wants 2nd column
    submission = sample_df[['Id', 'Pred']]
    submission.to_csv(submission_output_file, encoding='ascii', index=False)

So here we call the above function, passing it our best model from above, the same featureList we used, the scaler we used, and an output filename `data/submission.csv`.  You would then submit this file to Kaggle.

In [13]:
submission_output_file = "data/submission.csv"
write_submission_file(model, featureList, scaler, submission_output_file)



In [14]:
submission_df = pd.read_csv(submission_output_file, index_col=0)
submission_df.head()

Unnamed: 0_level_0,Pred
Id,Unnamed: 1_level_1
2012_1104_1124,0.296864
2012_1104_1125,0.296864
2012_1104_1140,0.427592
2012_1104_1143,0.382751
2012_1104_1153,0.382751
