### Support Vector Machines

 - Linear classifiers that use the hinge loss function
 - Trained using the hinge loss and L2 regularization
 - For hinge loss the observations that fall into the "flat" line space do not contribute to the loss (removing them would not change the fit)
 - Support vectors are defined as training examples that influence the decision boundary
 - Support vectors are training examples not in the flat part of the loss diagram
 - Here all incorrectly classified points are support vectors
 - Also the correctly classified examples but close to the boundary are support vectors
 - If an example is not a support vector, removing it does not influence the model
 - The small number of support vectors (i.e. not "looking" at all the training points) makes kernel SVMs really fast
<img src="ml_assets/svm.png" style="width: 300px;"/>

 - Fitting a linear model in a transformed space corresponds to fitting a non-linear model in the original space (e.g. feature x -> transformed feature x^2 linearly separable dataset, image shows original vs transformed space)

<img src="ml_assets/transformed_space.png" style="width: 500px;"/>

 - In scikit-learn use LinearSVC for a linear SVM or SVC for a kernerl SVM
 
 - Learn more: https://en.wikipedia.org/wiki/Support-vector_machine

In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import scale
from sklearn.metrics import confusion_matrix, classification_report

# Import the necessary modules
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

In [2]:
cure = load_breast_cancer()
features = pd.DataFrame(scale(cure['data']), columns = cure['feature_names'])
features.shape

(569, 30)

In [3]:
target = pd.DataFrame(cure['target'], columns=['Type']).replace(0,-1)
target.shape

(569, 1)

In [7]:
df = pd.concat([target,features],axis=1)
df.shape

(569, 31)

#### Data split

In [5]:
# Create arrays for the features and the response variable
y = target.values.reshape(-1)
X = features.values

display(y.shape, X.shape)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.35,
                                                    random_state=42)

(569,)

(569, 30)

#### Hyperparameter tuning gamma and C with GridSearchCV

 - C controls regularization
 - kernel is "rbf" (radial basis function), but you can use "linear" (if you use the latter, there will be no gamma parameter to be set)
 - gamma controls the smoothness of the curve
     - If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting
     - When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of high density of any pair of two classes

<img src="ml_assets/gamma.png" style="width: 500px;"/>



In [12]:
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train, y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)

# Report the test accuracy using these best parameters
print("Test accuracy of best grid search hypers:", searcher.score(X_test, y_test))

Best CV params {'C': 10, 'gamma': 0.01}
Best CV accuracy 0.9701592002961867
Test accuracy of best grid search hypers: 0.985


#### Stochastic gradient descent

 - With SGDC you can select the loss method to be either "logistic" or "hinge"
 - Scales well with large datasets
 - The regularization parameter is called alpha and it behaves as the inverse of "C" in SVC or Logistic Regression

In [15]:
# We set random_state=0 for reproducibility 
linear_classifier = SGDClassifier(random_state=0)

# Instantiate the GridSearchCV object and run the search
parameters = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1], 
             'loss':['hinge', 'log'], 'penalty':['l1','l2']}

searcher = GridSearchCV(linear_classifier, parameters, cv=10)

searcher.fit(X_train, y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)
print("Test accuracy of best grid search hypers:", searcher.score(X_test, y_test))

Best CV params {'alpha': 0.01, 'loss': 'log', 'penalty': 'l2'}
Best CV accuracy 0.9701201201201203
Test accuracy of best grid search hypers: 0.99
