# Large Margin Objective

## Optimization Objective

### From logistic regression

The sigmoid function

$ h_{\theta}(x) = g(z) = \frac{1}{1+e^{-\theta^T x}}$

Where $z = \theta^T x$

In Logistic Regression:
- If $y=1$  $\rightarrow$ $h_\theta(x) \approx 1$, $(\theta^T x) \gg 0$
- If $y=0$  $\rightarrow$ $h_\theta(x) \approx 0$, $(\theta^T x) \ll 0$

The cost of an example $(x, y)$: 
$$-(y\;log\:h_{\theta}(x) + (1-y)\:log(1-h_{\theta}(x)))$$
Substituting $h_{\theta}(x)$
$$-y\;log\:\frac{1}{1+e^{-\theta^T x}} - (1-y)\:log(1-\frac{1}{1+e^{-\theta^T x}})$$

1. When $y=1$: $-y\;log\:\frac{1}{1+e^{-\theta^T x}} - 0$
2. When $y=0$: $0 - log(1-\frac{1}{1+e^{-\theta^T x}})$

### Going into SVM:

- Improve cost function of logistic regression:
    - y = 1 $\rightarrow$ $\text{cost}_1(z) $
        - Flat part (cost = 0) for z > 1
        - Straight line for z < 1
    - y = 0 $\rightarrow$ $\text{cost}_0(z) $
        - Flat part (cost = 0) for z < -1
        - Straight line for z > -1
    
- Provide a computational advantageous method

### Cost function:

__LR__:

$$ J(\theta) = -\left[ \frac{1}{m} \sum\limits^m_{(i=1)} y^{(i)} log\:h_\theta(x^{(i)}) + (1-y^{(i)})\:log(1-h_\theta(x^{(i)})\: )  \right] + \frac{\lambda}{2m} \sum\limits^n_{j=1}\theta_j^2 $$

$$ J(\theta) = \left[ \frac{1}{m} \sum\limits^m_{(i=1)} y^{(i)} \left(-log\:h_\theta(x^{(i)})\right) + (1-y^{(i)})\:\left(- log(1-h_\theta(x^{(i)})\: )\right)  \right] + \frac{\lambda}{2m} \sum\limits^n_{j=1}\theta_j^2 $$

__SVM__:

$ J(\theta) = \left[ \frac{1}{m} \sum\limits^m_{(i=1)} y^{(i)} \text{cost}_1(\theta^T x) + (1-y^{(i)})\:\text{cost}_0(\theta^T x)\:   \right] + \frac{\lambda}{2m} \sum\limits^n_{j=1}\theta_j^2 $

Re-organising the above:

$$ J(\theta) = \left[\sum\limits^m_{(i=1)} y^{(i)} \text{cost}_1(\theta^T x) + (1-y^{(i)})\:\text{cost}_0(\theta^T x)\:   \right] + \frac{\lambda}{2} \sum\limits^n_{j=1}\theta_j^2 $$

$$ J(\theta) = A + \lambda B $$

Re-organising the regularization structure:

$$ J(\theta) = \textbf{C} A + B $$

$$ J(\theta) = \textbf{C} \left[\sum\limits^m_{(i=1)} y^{(i)} \text{cost}_1(\theta^T x) + (1-y^{(i)})\:\text{cost}_0(\theta^T x)\:   \right] + \frac{1}{2} \sum\limits^n_{j=1}\theta_j^2 $$

### Hypothesis
$$
h_{\theta}(x) = \begin{cases} 
1 & \mbox{if } \theta^T x \geq 0 \\ 
0 & \mbox{otherwise }
\end{cases}
$$

## Large Margin Intuition 

SVNs are refered as large margin classifiers. Limits are in 1 and -1 for positive and negative classifications, respectively.

![title](pictures/svm_largeMargin.png)



SVNs add a margin to decision boundaries, adding robustness to the classifier.

![title](pictures/svm_decisionBoundary.png)

SVN can be robust to outliers when $\textbf{C}$, the regularization parameter is adequately tunned.

![title](pictures/svm_outliers.png)

## Math of Large Margin Classifiers

![title](pictures/svm_mathInnerProduct.png)

![title](pictures/svm_mathDecisionBoundary.png)

![title](pictures/svm_mathDecisionBoundary2.png)

# Kernels

## Landmarks & Kernels

![title](pictures/svm_kernel.png)

Similarity or Kernel function. The function can also be rewritten for $n$ dimensional vectors ($n$ features).

![title](pictures/svm_kernel2.png)

__Example__ with two features ($n = 2$)

Height of surface equal to $f_1$ (distance to first landmark $l^{(1)}$):

- $f_1 = 1$ when $x = l$
- $f_1 \approx 0$ when $x$ far from $l$

$\sigma$ is the Gaussian Kernel. It controls how quickly the height of the landmark decreases.

![title](pictures/svm_gaussianKernel.png)

## Choosing landmarks

Options:

- Landmark per training example. $l^{(1)} = x^{(1)},\: ..., l^{(m)} = x^{(m)}$

### Landmark per training example

![title](pictures/svm_landmarkPerExample.png)

__Training__

Note again that the regularization parameter is not summed over $\theta_0$. 

The regularization parameter can be rewritten as follows: 
$$\sum_j \theta^2_j = \theta^T\theta$$
However, some SVM implementations have the following variation:
$$\sum_j \theta^2_j = \theta^T\textbf{M}\theta$$
Where $\textbf{M}$ is a matrix which tunnes the distance measure, depending on the Kernel.

![title](pictures/svm_trainingKernel.png)

### Variance & Bias in SVMs

Choosing parameters $\textbf{C}$ and $\sigma$

![title](pictures/svm_parameters.png)


# Using an SVM

Use a standard library! However still need to select:

- Parameter __C__
- Choice of Kernel
    - If no Kernel, ("linear kernel", $y=1$ if $\theta^Tx \geq 0$). 
        - Useful when $n$ is large & $m$ is small.
    - Gaussian Kernel.
        - Choose $\sigma^2$. Useful when $n$ is large & $m$ is large.

May need to provide __Kernel__

## Providing a Kernel

![title](pictures/svm_provideKernel.png)


## Choices of Kernel

They __MUST__ comply with __Mercer´s Theorem__.

![title](pictures/svm_otherKernels.png)


## Multiclass classification with SVMs

__One VS all__: Train K SVMs & predict class with largest $i$.

![title](pictures/svm_multiClassClassification.png)


## Logistic regression VS SVNs

- $n$ is large relative to $m$ ($n = 10000, m = 10 - 1000$):
    - Logistic regression or SVN with linear kernel
- $n$ is small relative to $m$ ($n = 1-1000, m = 10 - 10000$):
    - SVN with Gaussian kernel
- $n$ is small, $m$ is large ($n = 1-1000, m = 50000 + $):
    - Create/add more features, then Logistic regression or SVN with linear kernel

![title](pictures/svm_logisticVsSvms.png)
