## **1. Introduction**
---
### Comparison to Logistic Regression
- Remember logist regression used a curve of the form:
$$y_{\beta}(x) = \frac{1}{1+e^{-(\beta_0 + \beta_{1}x + \epsilon)}}$$
- Where $\beta_{k}$ are model parameters to be optimised through fitting
- The idea is to transform the loss function from the linear regression so that we weigh the far-away points much less, so that our separation hyperplace does not try too hard to get those right.
- This was a measure taken in order to not be skewed by the outliers in our dataset

### SVMs
- Involves moving decision boundary to find the best possible position
- One way of finding the best possible decision boundary is by moving it until we have no misclassifications
- However, often there are multiple locations where we get no misclassifications, so we have a "region", in which we have no misclassifications
- The overarching goal of SVMs is to maximuse this "region", between classes
- SVMs will **not return probabilities** only a classification of 0 vs 1

## **2. SVMs**
---
- The vectors that define the boundaries of the regions between classes are called support vectors
- In 2D we can consider finding the line that best separates the two regions
- So we want to find that line that has no misclassifications that is as far away as possible from both regions (to reduce the sensitivity of the model)

### The cost function of SVMs
- With logisitic regression, the cost function smooths out and never reaches zero unless we get an exact prediction of 1.00
- SVMs instead use hinge loss, which will not penalise values outside of our margin, assuming that we predicted them correctly, but will more heavily penalise those values that are further and further away from that margin
- If we are at or inside the region defined by our support vectors, then we begin to incur some loss (zero loss at the support vector, but then as we approach the decision boundary itself, we reach a loss of 1)
- If the point falls on the opposite support vector that the one which defines the region that it is supposed to be in, then we have a loss of 2
- This loss function is a linear piecewise function, with a "breakpoint" at the support vector of the desired class

### Regularisation in SVMs
- Acheive by adding a term in the cost function that tries to minimse the coefficients, ie. trying to get the decision boundary to be as simple as possible
$$J(\beta) = \text{SVMCost}{\beta} + \frac{1}{c}\sum_i\beta_i$$
- For overfit models, the first term (misclassification term) may be low, but inevitably a more complicated decision boundary will have been used, so the second term will be larger
- The regularisation effect can be tuned by modifying the value of c (lower c = more regularisation)

### Interpretation of SVM coefficients
- The vector $\beta$ represents the vector orthogonal to the hyperplane (or decision boundary)
- We can use the fact that these coefficients dotted with the relevant features will define the side of the boundary that the feature vector will be labelled as falling on

### Syntax

In [None]:
# Import class containing our classification method
from sklearn.svm import LinearSVC
LinSVC = LinearSVC(penalty='l2', C=10.0)
LinSVC.fit(X_train, y_train)        # type: ignore
y_predict = LinSVC.predict(X_test)  # type: ignore

# Tune regularisation parameters with cross-validation (gridsearchCV)
# Can also use LinearSVM for regression (y_train would have to be continuous values)

## **3. Support Vector Machines Gaussian Kernels**
---
- What we have seen previously with linear SVMs was SVMs in their most simple form
- There are things we can do to get non-linear decision boundaries and make better predictions

### Kernel Trick
- The kernel trick can be used to acheive a non-linear decision boundary
- Non-linear data can be made linear with higher dimensionality
- So under the right transformation of our space, we can find a linear separation hyperplace which maps back to our non-linear decision boundary
    - Usually this works by adding a higher dimension, creating our linear decision boundary in this higher dimension, and then mapping it back to our original space
    - As we increase dimensions, we should eventually be able to find some linear decision boundary

### SVM Gaussian Kernel
- Approach 1: create higher order features to transform the data
- Approach 2: define similarity functions and then use those similarity functions to transform our space to higher dimensions
    - Look to the particular feature vectors in our dataset, and find the "similarity" between different feature vectors
    - We do this by creating a gaussian function at each feature, and we can see the distance for each one of our other features, to the one we chose (repeat this for all feature pairs)
    - For a given two features, we can create 3 numbers using these 3 radial gaussian basis functions - these 3 numbers will represent the similarities to the different points in our dataset
    - This mapping will make more similar functions closer to each other in higher dimensional space, making it easier to find a higher dimensional hyperplace to separate them
    - Essentially, we are transforming our feature space into a "similarity to each labelled point" space, using these gaussian functions
    - These similarities are defined by the radial basis functions (RBFs)
    - Instead of just choosing 3 random points like we did above, we calculate the RBF for all points in our feature space, which allows us to create a similarity mapping of any new point to every other point

### Syntax

In [None]:
# Import the class containing our classification method
from sklearn.svm import SVC
# Create a Gaussian SVM classifier
rbfSVC = SVC(kernel='rbf', gamma="auto", C=10.0)
rbfSVC.fit(X_train, y_train)        # type: ignore
y_predict = rbfSVC.predict(X_test) # type: ignore

# tune kernel and associated parameters with cross-validation

## **4. Workflow for SVMs**
---
- SVMs with RBF kernels are very slow to train with lots of features or data (applying kernel for every single datapoint many times over)
- We solve this using data collection, where we construct and approximate the kernel map with SGD using Nystroem of RBF sampler (may origional dataset to higher dimensions), and then we can fit a linear classifier to this new dataset
- The number of features/ rows of data can help us choose the right model:
    - Many features (~10k) and few rows (1k), use simple models, ie. Logistic or LinearSVC
    - Few features (<100) and medium rows (~10k), use SVC with RBF
    - Few features (<100) and many rows (>100k), add features, logistic, LinearSVC or Kernel Approx

### Syntax for faster kernel transformations

In [None]:
# Import the class containing our classification method
from sklearn.kernel_approximation import Nystroem

# create an instance of the Nystroem kernel approximation
NystroemSVC = Nystroem(kernel="rbf", gamma="auto", n_components=100) # identical kernel and gamma to our SVC model above

# fit the Nystroem model on the training data
X_train = NystroemSVC.fit(X_train)      # type: ignore
X_test = NystroemSVC.transform(X_test)  # type: ignore

# tune kernel and associated parameters with cross-validation

# could also use RBFsampler
from sklearn.kernel_approximation import RBFSampler

# create an instance of the RBFSampler
rbf_sampler = RBFSampler(gamma=1.0, n_components=100)

# fit the RBFSampler on the training data
X_train = rbf_sampler.fit(X_train)      # type: ignore
X_test = rbf_sampler.transform(X_test)  # type: ignore

# in both of these cases we would then use a linear classifier to fit the transformed data