# Support Vector Classifier

#### Predict whether a team will make to playoffs based on its statistics and its opponents statistics

We're going to use the `validation_curve` function in `sklearn.model_selection` to determine training and test scores for a Support Vector Classifier (`SVC`) with varying parameter values.  Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.

The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel.  So the first step is to create an `SVC` object with default parameters (i.e. `kernel='rbf', C=1`) and `random_state=0`. The kernel width of the RBF kernel is controlled using the `gamma` parameter.  

With this classifier, and the dataset in X, y, explore the effect of `gamma` on classifier accuracy by using the `validation_curve` function to find the training and test scores for a few values of `gamma` (i.e. `np.logspace(-4,1,6)`). We can specify what scoring metric we want validation_curve to use by setting the "scoring" parameter.  In this case, we want to use "accuracy" as the scoring metric.

For each level of `gamma`, `validation_curve` will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of `gamma` for both arrays.

In [1]:
# Read data from file
import pandas as pd
df=pd.read_csv('baseball.csv').dropna(subset=['OpponentOnBasePercentage', 'OpponentSluggingPercentage'])
df.tail()

Unnamed: 0,Team,League,Year,RunsScored,RunsAllowed,Wins,OnBasePercentage,SluggingPercentage,BattingAverage,Playoffs,RankSeason,RankPlayoffs,GamesPlayed,OpponentOnBasePercentage,OpponentSluggingPercentage
415,SFG,NL,1999,872,831,86,0.356,0.434,0.271,0,,,162,0.345,0.423
416,STL,NL,1999,809,838,75,0.338,0.426,0.262,0,,,161,0.355,0.427
417,TBD,AL,1999,772,913,69,0.343,0.411,0.274,0,,,162,0.371,0.448
418,TEX,AL,1999,945,859,95,0.361,0.479,0.293,1,5.0,4.0,162,0.346,0.459
419,TOR,AL,1999,883,862,84,0.352,0.457,0.28,0,,,162,0.353,0.456


In [2]:
from sklearn.model_selection import train_test_split
X=df[['OnBasePercentage','SluggingPercentage','BattingAverage','OpponentOnBasePercentage', 'OpponentSluggingPercentage']]
y=df['Playoffs']

In [3]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

param_range=np.logspace(0,5,6)
svc = SVC(kernel='rbf', C=1, random_state=0)
train_scores, test_scores = validation_curve(svc, X, y,
                                            param_name='gamma',
                                            param_range=param_range, 
                                            cv=5,
                                            scoring='accuracy')

# this part is for printing out the results only
svc_scores=pd.DataFrame({'gamma':param_range,
                       'train_scores':np.mean(train_scores,axis=1),
                       'test_scores':np.mean(test_scores,axis=1)})

svc_scores['Conclusion']=''
underfitting, overfitting, good_model=svc_scores.train_scores.min(),svc_scores.train_scores.max(),svc_scores.test_scores.max()
for i in range (len(svc_scores)):
    if svc_scores.train_scores[i]==underfitting: svc_scores.Conclusion[i]='Underfitting'
    if svc_scores.train_scores[i]==overfitting: svc_scores.Conclusion[i]='Overfitting'
    if svc_scores.test_scores[i]==good_model: svc_scores.Conclusion[i]='GOOD MODEL'
svc_scores.set_index('gamma')
svc_scores

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,gamma,train_scores,test_scores,Conclusion
0,1.0,0.728572,0.728585,Underfitting
1,10.0,0.764308,0.752423,
2,100.0,0.845243,0.828563,
3,1000.0,0.883928,0.83341,GOOD MODEL
4,10000.0,0.979167,0.771361,
5,100000.0,0.998216,0.728585,Overfitting
