### Classification - Decision Tree

In [1]:
# Read data from file
import pandas as pd
df=pd.read_csv('baseball.csv').dropna(subset=['OpponentOnBasePercentage', 'OpponentSluggingPercentage'])
df.tail()

Unnamed: 0,Team,League,Year,RunsScored,RunsAllowed,Wins,OnBasePercentage,SluggingPercentage,BattingAverage,Playoffs,RankSeason,RankPlayoffs,GamesPlayed,OpponentOnBasePercentage,OpponentSluggingPercentage
415,SFG,NL,1999,872,831,86,0.356,0.434,0.271,0,,,162,0.345,0.423
416,STL,NL,1999,809,838,75,0.338,0.426,0.262,0,,,161,0.355,0.427
417,TBD,AL,1999,772,913,69,0.343,0.411,0.274,0,,,162,0.371,0.448
418,TEX,AL,1999,945,859,95,0.361,0.479,0.293,1,5.0,4.0,162,0.346,0.459
419,TOR,AL,1999,883,862,84,0.352,0.457,0.28,0,,,162,0.353,0.456


In [2]:
from sklearn.model_selection import train_test_split
X=df[['OnBasePercentage','SluggingPercentage','BattingAverage','OpponentOnBasePercentage', 'OpponentSluggingPercentage']]
y=df['Playoffs']
X_train, X_test, y_train, y_test = train_test_split(X, y)#set random_state also?

###### Using `X_train` and `y_train` from the preceeding cell, train a DecisionTreeClassifier with default parameters. What are the 5 most important features found by the decision tree?

The feature names are available in the `X_train.columns` property, and the order of the features in `X_train.columns` matches the order of the feature importance values in the classifier's `feature_importances_` property. 

*Print feature names in descending order of importance.*

*Note: Need to set random_state in the DecisionTreeClassifier.*

In [3]:
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier().fit(X_train,y_train)
fi=pd.Series(clf.feature_importances_,index=X_train.columns).sort_values(ascending=False)
list(fi[0:5].index)

['OpponentOnBasePercentage',
 'OnBasePercentage',
 'SluggingPercentage',
 'BattingAverage',
 'OpponentSluggingPercentage']

###### Support Vector Classifier
We're going to use the `validation_curve` function in `sklearn.model_selection` to determine training and test scores for a Support Vector Classifier (`SVC`) with varying parameter values.  Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.

The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel.  So the first step is to create an `SVC` object with default parameters (i.e. `kernel='rbf', C=1`) and `random_state=0`. Recall that the kernel width of the RBF kernel is controlled using the `gamma` parameter.  

With this classifier, and the dataset in X, y, explore the effect of `gamma` on classifier accuracy by using the `validation_curve` function to find the training and test scores for 6 values of `gamma` from `0.0001` to `10` (i.e. `np.logspace(-4,1,6)`). Recall that you can specify what scoring metric you want validation_curve to use by setting the "scoring" parameter.  In this case, we want to use "accuracy" as the scoring metric.

For each level of `gamma`, `validation_curve` will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of `gamma` for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.

e.g.

if one of your array of scores is

    array([[ 0.5,  0.4,  0.6],
           [ 0.7,  0.8,  0.7],
           [ 0.9,  0.8,  0.8],
           [ 0.8,  0.7,  0.8],
           [ 0.7,  0.6,  0.6],
           [ 0.4,  0.6,  0.5]])
       
it should then become

    array([ 0.5,  0.73333333,  0.83333333,  0.76666667,  0.63333333, 0.5])

*This function should return one tuple of numpy arrays `(training_scores, test_scores)`*

In [5]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

param_range = np.logspace(-4,1,6)

svc = SVC(kernel='rbf', C=1, random_state=0)
train_scores, test_scores = validation_curve(svc, X, y,
                                            param_name='gamma',
                                            param_range=param_range, cv=3,
                                            scoring='accuracy')
np.mean(train_scores,axis=1), np.mean(test_scores,axis=1)

(array([0.72857143, 0.72857143, 0.72857143, 0.72857143, 0.72857143,
        0.75      ]),
 array([0.72857143, 0.72857143, 0.72857143, 0.72857143, 0.72857143,
        0.74285714]))

###### Based on the scores, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)? 

Hint: Try plotting the scores from above to visualize the relationship between gamma and accuracy.

*This function should return one tuple with the degree values in this order: `(Underfitting, Overfitting, Good_Generalization)`

In [None]:
0.0001, 10, 0.0001