# Assignment 3 - SVM Classification - McCartney

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**. Build (at least) **two SVM** models.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data:

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(31484443)

In [2]:
# Import the data set:
#We will predict the "attendance_binary" value in the data set:

baseball = pd.read_csv("baseball.csv")
baseball.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [3]:
baseball.shape

(2427, 17)

In [4]:
baseball.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

# Split Data (train/test)

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(baseball, test_size=0.3)

In [7]:
train.shape

(1698, 17)

In [8]:
train.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

# Data Prep

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [12]:
# Separate the target variable and input variables

train_y = train[['attendance_binary']]
test_y = test[['attendance_binary']]

train_inputs = train.drop(['attendance_binary'], axis=1)
test_inputs = test.drop(['attendance_binary'], axis=1)

In [14]:
# Identify the columns

train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [15]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [16]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['previous_homewin']

In [17]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [18]:
binary_columns

['previous_homewin']

In [19]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [21]:
categorical_columns

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

# Pipeline

In [22]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [23]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [24]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [26]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform

In [27]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-0.43179529,  0.5460065 , -0.22391713, ...,  0.        ,
         1.        ,  1.        ],
       [ 1.13122021,  0.5460065 , -2.19684451, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.91436121,  0.5460065 , -1.91499774, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.30959588, -0.74524564,  0.05792964, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.06685484, -0.74524564, -0.5057639 , ...,  1.        ,
         0.        ,  1.        ],
       [ 0.19526873, -0.74524564,  2.87639732, ...,  0.        ,
         0.        ,  0.        ]])

In [28]:
train_x.shape

(1698, 37)

In [29]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.0256154 , -0.74524564,  2.59455055, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.692796  ,  3.12851078,  0.3397764 , ...,  0.        ,
         1.        ,  1.        ],
       [ 0.07757301,  0.5460065 ,  1.46716348, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.32007643, -0.74524564,  0.3397764 , ...,  0.        ,
         1.        ,  0.        ],
       [ 1.07824183, -0.74524564, -0.22391713, ...,  0.        ,
         1.        ,  1.        ],
       [-0.74864483,  0.5460065 , -0.5057639 , ...,  1.        ,
         0.        ,  1.        ]])

In [79]:
test_x.shape

(729, 37)

# Calculate the Baseline

In [80]:
# Find majority class
train_y.value_counts()

attendance_binary
1                    880
0                    818
dtype: int64

In [82]:
# Find percentage
train_y.value_counts()/len(train_y)

attendance_binary
1                    0.518257
0                    0.481743
dtype: float64

# SVM Model 1: LinearSVC

LinearSVC fits a linear support vector machine classifier. It doesn't support kernel tricks. <br>

This is the same as `SVC(kernel='linear')` however, LinearSVC is more efficient when you have a large data set.

In [101]:
from sklearn.svm import LinearSVC 

# C is the margin width
# You can select l1 or l2 penalty using the penalty term. l2 is the default setting

svm_clf = LinearSVC(C=10, max_iter=15000)

svm_clf.fit(train_x, train_y)

  return f(*args, **kwargs)


LinearSVC(C=10, max_iter=15000)

In [102]:
# Accuracy
from sklearn.metrics import accuracy_score

In [103]:
#Predict the train values
train_y_pred = svm_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8356890459363958

In [104]:
#Predict the test values
test_y_pred = svm_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8326474622770919

In [106]:
# Classification Matrix
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[289,  61],
       [ 61, 318]], dtype=int64)

In [107]:
# Classification Report
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.83      0.83      0.83       350
           1       0.84      0.84      0.84       379

    accuracy                           0.83       729
   macro avg       0.83      0.83      0.83       729
weighted avg       0.83      0.83      0.83       729



# SVM Model 2: LinearSVC with Polynomial Terms

In [111]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms
poly_features = PolynomialFeatures(degree=2, include_bias=False)

train_x_poly = poly_features.fit_transform(train_x)

#Don't forget to transform the test set
test_x_poly = poly_features.transform(test_x)

# If degree=2, then it creates all combinations: a, a^2, b, b^2, a.b, a^2.b, a.b^2, a^2.b^2 

In [115]:
#Fails to converge

pol_svm = LinearSVC(C=10)

pol_svm.fit(train_x_poly, train_y)

  return f(*args, **kwargs)


LinearSVC(C=10)

In [116]:
#Predict the train values
train_y_poly_pred = pol_svm.predict(train_x_poly)

#Train accuracy
accuracy_score(train_y, train_y_poly_pred)

0.8904593639575972

In [118]:
#Predict the test values
test_y_poly_pred = pol_svm.predict(test_x_poly)

#Test accuracy
accuracy_score(test_y, test_y_poly_pred)

0.7860082304526749

# SVM Model 3: SVC kernal='linear'

This is the same as LinearSVC except it is slower and less efficient. However, it enables you to use other kernels such as "poly" or "rbf".

In [120]:
from sklearn.svm import SVC
 
lin_svm2 = SVC(kernel="linear")

lin_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(kernel='linear')

In [121]:
#Predict the train values
train_y_pred = lin_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8380447585394581

In [122]:
#Predict the test values
test_y_pred = lin_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8340192043895748

# SVM Model 4: SVC(kernel='poly') 

This is similar to running LinearSVC with polynomial terms. Though, this is much faster and more efficient.

In [139]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm2 = SVC(kernel="poly", degree=2, coef0=1, C=10, gamma='scale')

pol_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=10, coef0=1, degree=2, kernel='poly')

In [140]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y)

1.0

In [141]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8120713305898491

# SVM Model 5: SVC(kernel='rbf')

This is the Gaussian RBF.

In [145]:
rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=10)

In [146]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.9687868080094229

In [148]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8024691358024691

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) Does the best model perform better than the baseline (and why)?<br>
3) Does the best model exhibit any overfitting; what did you do about it?

1. SVM Model 3: SVC kernal='linear' performed the best highest test accuracy of 0.834
2. Model 3 out perfromed the base line of .51 significatly with 0.834
3. Model 3 did not exhibit over fitting since the test score was lower than the train score and the two values were close in magnitude; 0.834 and 0.838 respectively 