# Scikit Learn - Multi Classification

#### by Alex Ahn

We can leverage many different classification models provided in the scikit-learn package for not only binary-classifications but also multi-classifications (more than one labels/target values)

The following exercise is designed to try/test different classification models on the datasets which consists of pitches thrown by Clayton Kershaw from 2014 ~ 2017.

You can choose different parameters for each model to enhance the fittingness and evaluate the results using different metrics (i.e. precision/recall scores).

We will further explore onto model selections using GridSearchCV to search for optimal parameters.

In [None]:
# import useful tools
import csv
import pickle
from sklearn import datasets, metrics
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

In [45]:
# Reading pitches thrown for Clayton Kershaw.
csv_file = 'data/ClaytonKershaw.csv'
file = open(csv_file, "r")
reader = csv.reader(file)

# store each instances and target values
instances = []
target = []
row_num = 0 
for row in reader:
    if row_num is 0:
        header = row
    else:
        col_num = 0
        features = []
        for col in row:
            if col_num is 0:
                target.append(int(col))
                instances.append([])
            else:
                instances[row_num-1].append(int(col))
            col_num += 1
    row_num +=1
file.close()
data = [instances, target]

In [32]:
# Names of the featuress can be seen as below.
print("Features:\n", header)

# An instance 'X' looks as below
print("\nInstance(one pitch data):\n", instances[0])

Features:
 ['pitch_type', 'batter_num', 'pitch_rl', 'bat_rl', 'inning', 'balls', 'strikes', 'out', 'on_1b', 'on_2b', 'on_3b', 'score_diff', 'prev_pitch_type']

Instance(one pitch data):
 [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, -1]


In [60]:
# Split X and y values.
X = data[0]
y = data[1]
n_samples = len(X)

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# try fitting onto different multi-classification models by uncommenting on each

#clf = OneVsRestClassifier(LinearSVC(random_state=0))
#clf = OneVsOneClassifier(LinearSVC(random_state=0))
#clf = svm.SVC(decision_function_shape='ovr')
#clf = GridSearchCV(svm.SVC(kernel='rbf', decision_function_shape='ovr'), param_grid)
#clf = MLPClassifier()

clf = RandomForestClassifier()
clf.fit(X_train, y_train)    

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [61]:
# Evaluation using metrics
actual = y_test
predicted = clf.predict(X_test)

# precision, recall, f1-score report on clf.
print("Classification report for classifier %s:\n%s\n"
  % (clf, metrics.classification_report(actual, predicted)))

# confusion matrix
print("Confusion matrix:\n%s" % metrics.confusion_matrix(actual, predicted))

Classification report for classifier RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False):
             precision    recall  f1-score   support

          1       0.64      0.74      0.69      1130
          7       0.36      0.29      0.32       564
         10       0.33      0.25      0.29       268

avg / total       0.52      0.54      0.53      1962


Confusion matrix:
[[836 213  81]
 [345 162  57]
 [125  75  68]]


In [62]:
# we can save the classifier using pickle and joblib (to load back later with out computation).
file_name = "data/ClaytonKershaw_model.pkl"
joblib.dump(clf, file_name)

['data/ClaytonKershaw_model.pkl']