# Week 3 - Classical ML Models - Part II

## 1. Support Vector Machine

During the previous week, we learned how to use logistic regression for binary classification problems.

This week, we are going to look at another type of supervised model that can be used for classification problems - **Support vector machine** (or **SVM**).

### Introduction

![SVM](https://miro.medium.com/max/625/1*ala8WX2z47WYpn932hUkhA.jpeg)

SVM is a supervised ML algorithm that is most commonly used for binary classification problems. As in the previous cases, the training of such model involves passing set of examples $(x_i, y_i)$. For instance, we want to build a model for spam detection. In such case, the feature ($x_i$) could contain the word / link count, while the label ($y_i$) could be either 1 (spam) or 0 (not-spam).

So far, we have a general understanding about the training process of SVM, on the other hand, we have not yet covered the hypothesis function we are trying to fit throughout training.

### Hypothesis

The goal of SVM algorithm is to find a hyperplane (boundary line) with the following properties:
- It creates separation with the maximum margin between variable classes
- Equation $>1$ output for positive class and $<-1$ for examples in negative class

In mathematical terms this can be written as:
- $\hat{y} = -1$ if $w^T*x + b < 0$
- $\hat{y} = 1$ if $w^T*x + b \geqslant 0$


### Cost function

As in the previous models, SVM has a cost function associated to it. As we have two main goals: minimizing the individual distances ($w$) between the data point and hyperline, and maximizing the margin width.

Therefore, our cost function has two parts:

$J(w) = \frac{[w]^2}{2} + C[\frac{1}{N}\sum_{i}^n max(0, 1 - y_i(wx_1 + b))]$

The second half of the function is also called hinge loss. In addition, we are using a regularization constant $C$ as a way of weighting misclassification.

In python code, the cost function can be expressed in the following way:

In [None]:
def compute_cost(W, X, Y):
    # calculate hinge loss
    N = X.shape[0]
    distances = 1 - Y * (np.dot(X, W))
    distances[distances < 0] = 0  # equivalent to max(0, distance)
    hinge_loss = reg_strength * (np.sum(distances) / N)
    
    # calculate cost
    cost = 1 / 2 * np.dot(W, W) + hinge_loss
    return cost

As we now have our cost function, we need to find a way to optimize it.

### Optimization

Similar to the logistic regression model, we are going to apply the gradient descent algorithm for finding the minimize our cost function. After taking the partial derivative in respect to $w$, we get the following system:
- $\frac{1}{N}\sum_{i}^n w$ if $max(0, 1 - y_i*(wx_i)) = 0$
- $\frac{1}{N}\sum_{i}^n w - Cy_ix_i$ otherwise

Such system has the following Python implementation:

In [None]:
def calculate_cost_gradient(W, X_batch, Y_batch):
    # if only one example is passed (eg. in case of SGD)
    if type(Y_batch) == np.float64:
        Y_batch = np.array([Y_batch])
        X_batch = np.array([X_batch])
    distance = 1 - (Y_batch * np.dot(X_batch, W))
    dw = np.zeros(len(W))
    for ind, d in enumerate(distance):
        if max(0, d) == 0:
            di = W
        else:
            di = W - (reg_strength * Y_batch[ind] * X_batch[ind])
        dw += di
    dw = dw/len(Y_batch)  # average
    return dw

After finding the gradient of the cost function, we have to update our weights which can be done in a quite similar manner:

In [None]:
def update(features, outputs):
    max_epochs = 5000
    weights = np.zeros(features.shape[1])
    # stochastic gradient descent
    for epoch in range(1, max_epochs): 
        # shuffle to prevent repeating update cycles
        X, Y = shuffle(features, outputs)
        for ind, x in enumerate(X):
            ascent = calculate_cost_gradient(weights, x, Y[ind])
            weights = weights - (learning_rate * ascent)
            
    return weights

### Sklearn implementation

As in the previous cases, instead of building our model from scratch, we can save some time and use sklearn library.

In [None]:
from sklearn import svm

clf = svm.SVC(kernel = 'linear')
clf.fit(X_train, y_train)

As you may notice here, we have defined **kernel** parameter. As you may remember from the start, some parameter correlations might not be linear: the hyperline might have to take circle for polynomial forms to differentiate variable classes. The kernel parameter defines the form of the this hyperline.

### Exercise

Now, it's time to apply our skills. For this purpose, we are going to use the breast cancer patients data.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

In [None]:
data = datasets.load_breast_cancer()

X = data.data
y = data.target

#Split data
X_train, X_test, y_train, y_test =

#Selecting SVM model with 'linear' kernel


#Fitting model into train dataset


#Save predictions to y_pred variable
y_pred = 

In [4]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9649122807017544


### Logistic regression vs SVM

We can see that SVM and logistic regression has many similarities: both use lines to differentiate classes to solve classification problems. As a result, both models (in most cases) can be used as substitutes for one another without a larger drop in accuracy.

On the other hand, it is useful to know cases when one of the models provides a better computational performance:
- When the number of features is large (1 - 10000) and number of training examples (10 - 10000) is small, use **logistic regression** or **SVM with a linear kernel**.
- When the number of features is small (1 - 1000) and number of training examples is intermediate (10 - 10000), use **SVM**.
- When number of features is small (1 - 1000) and number of training examples is large (50000 - 1000000), use **logistic regression** or **SVM with a linear kernel**.