# Support Vector Machines (SVMs)

### Optimization Objective
- As with logistic regression, $z=\theta^Tx$
- $cost_{1}(z) = -log\frac{1}{1+e^{-z}}$
![image.png](attachment:image.png)
- $cost_{0}(z) = -log(1-\frac{1}{1+e^{-z}})$
![image-2.png](attachment:image-2.png)

- Cost Function for Logistic Regression:
    - $\frac{1}{m}[\sum \limits_{i=1}^{m}y^{(i)}(-logh_\theta(x^{(i)}))+(1-y^{(i)})((-log(1-h_\theta(x^{(i)})))]+\frac{\lambda}{2m}\sum \limits_{j=1}^{n}\theta_j^2$
    
- Cost Function for SVM:
    - $\frac{1}{m}[\sum \limits_{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{\lambda}{2m}\sum \limits_{j=1}^{n}\theta_j^2$
    - $[\sum \limits_{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{\lambda}{2}\sum \limits_{j=1}^{n}\theta_j^2$
    - Unlike Logistic Regression where $A+\lambda B$ is used for minimization of the cost function, SVM will use $CA+B$, where $C$ is a very small value which results in a different way of controlling the trade off (i.e. a different way of prioritizing how much we care about optimizing the first term)
        - C plays a similar role to $\frac{1}{\lambda}$        
    - $C[\sum \limits_{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum \limits_{j=1}^{n}\theta_j^2$
    
- SVM Hypothesis:
    - $h_\theta(x)=\left\{ \begin{array}{ll}
      1 & \theta^Tx \geq 1 \\
      0 & otherwise \\
\end{array} 
\right.$


### Large Margin Intuition

- Considering the cost function,
    - if $y=1$, we want $\theta^Tx \geq 1$ (not just $\geq0$)
    - if $y=0$, we want $\theta^Tx \leq -1$ (not just $\leq0$)
- If the object is to optimize the function, then we can constrain $\Theta$ to ensure that the summation of the cost function for each example is equal to 0. The following constraint can be applied to accomplish this:
    - $\theta^Tx \geq 1$ if $y=1$
    - $\theta^Tx \leq -1$ if $y=0$
    - If C is a very large number, then we must choose $\theta$ parameters such that $\sum \limits_{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})=0$
    - Cost function can then be reduced to: $J(\theta)= C*0+\frac{1}{2}\sum \limits_{j=1}^{n}\theta_j^2 = \frac{1}{2}\sum \limits_{j=1}^{n}\theta_j^2$
    
### Mathematics Behind Large Margin Classification

- Vector Inner Product:
    - $u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}$
    - $v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}$
    - $u^Tv = ?$
    - $\left\Vert u \right\Vert=$ length of u = $\sqrt{u_1^2+u_2^2} \in \mathbb{R}$ 
    - p = length of projection of the vector v onto u
        - $u^Tv = p *\left\Vert u \right\Vert$
        - $u_1v_1+u_2v_2$ (the inner product via vector multiplication) should provide the same answer as the above geometric interpretation

- SVM Decision Boundary:
    - $min(\theta)\frac{1}{2}\sum \limits_{j=1}^{n}\theta_j^2$
    - s.t.
        - $\theta^Tx \geq 1$ if $y=1$
        - $\theta^Tx \leq -1$ if $y=0$
        
    -For a simple example, if we set $\theta_0 = 0$ and set $n=2$ (where n = the # of features), then the minimization function can be rewritten as: $\frac{1}{2}(\theta_1^2+\theta_2^2)$ = $\frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2})^2$
        - $(\sqrt{\theta_1^2+\theta_2^2}) = \left\Vert \theta \right\Vert$
        - Therefore, the optimization function can be written as $\frac{1}{2}\left\Vert \theta \right\Vert^2$
        - $\Theta^Tx^{(i)} = p^{(i)}*\left\Vert \theta \right\Vert$ = $\theta_1x_1^{(i)}+\theta_2x_2^{(i)}$
            - where $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\Theta$
        ![image-3.png](attachment:image-3.png)
    
    
    
### Kernels I

- Kernels can be used to develop complex nonlinear classifiers with SVMs
- Given x, compute new features depending on proximity to landmarks $l^{(1)}$,$l^{(2)}$,$l^{(3)}$
    - Given x: a new feature can be calculated using:
    - $f_1$ = similarity$(x,l^{(1)})$ = $e^{(-\frac{\left\Vert x-l^{(1)} \right\Vert^2}{2\sigma^2}}$
    - $f_2$ = similarity$(x,l^{(2)})$ = $e^{(-\frac{\left\Vert x-l^{(2)} \right\Vert^2}{2\sigma^2}}$... and so on
    - The similarity is called the kernel function, specifically the Gaussian Kernel
        - also written as $k(x,l^{(i)})$
    - The kernel function above, $e^{(-\frac{\left\Vert x-l^{(1)} \right\Vert^2}{2\sigma^2})}$, can also be expressed as $e^{(-\frac{\sum \limits_{j=1}^n(x_j - l_j^{(i))^2})}{2\sigma^2})}$
  
- If $x\approx l^{(1)}$:
    - $f_1 \approx e^{(-\frac{0^2}{2\sigma^2})}\approx 1$
- If x is far from $l^{(1)}$:
    - $f_1 = e^{(-\frac{(largeNumber)^2}{2\sigma^2})}\approx 0$

![image-5.png](attachment:image-5.png)

- The example above shows the relationship between a training example $x^{(i)}$ and $l^{(i)}$ as well as the effect the $\sigma^2$ parameter has on the relationship


### Kernels II

- Where do we get $l^{(1)}$, $l^{(2)}$,$l^{(3)}$,etc...
    - For each training example x, set a landmark at the value of x, i.e. $l^{(1)}=x^{(1)}$, $l^{(2)}=x^{(2)}$,$l^{(3)}=x^{(3)}$,...,$l^{(m)}=x^{(m)}$
    - $f=\begin{bmatrix}f_0\\f_1 \\ f_2\\...\\f_m\end{bmatrix}$ where $f_0=1$
    - For training example $(x^{(i)},y^{(i)})$:
        - $x^{(i)}$ -->
        - $f_1^{(i)}=sim(x^{(i)},l^{(1)})$
        - $f_2^{(i)}=sim(x^{(i)},l^{(2)})$ and so on...
- Hypothesis: Given x, compute features $f \in \mathbb{R}^{m+1}$
    - Predict "y=1" if $\theta^Tf \geq 0$
    - get parameters via the SVM cost function:
        - $min(\theta) C\sum \limits_{i=1}^m y^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum \limits_{j=1}^n\theta_j^2$
        - where $n=m$
        - $\sum \limits_{j=1}^n\theta_j^2$ can be written as $\theta^T\theta$
            - Most SVM will use $\theta^TM\theta$ where M is some matrix determined by the kernel used
            
- SVM Parameters:
    - C ($=\frac{1}{\lambda}$):
        - Large C: Lower Bias, high variance
        - Small C: Higher Bias, low variance
    - $\sigma^2$:
        - Large $\sigma^2$: Features $f_i$ vary more smoothly. Higher bias, lower variance.
        - Small $\sigma^2$: Features $f_i$ vary less smoothly. Lower bias, higher variance.

### Using an SVM

- It is advisable to use an SVM software package (e.g. liblinear, libsvm...) to solver for parameters $\theta$
- Need to specify:
    - Choice of parameter C
    - Choice of kernel (similarity function):
        - No Kernel (linear kernel)
            -Predict "y=1" if $\theta^Tx \geq 0$
        - Gaussian Kernel:
            - $f_i = e^{(-\frac{\left\Vert x-l^{(i)} \right\Vert^2}{2\sigma^2})}$, where $l^{(i)}=x^{(i)}$
            - need to choose $\sigma^2$
            - **Note:** Do perform feature scaling BEFORE using the Gaussian kernel
        - Note: Not all similarity functions make valid kernels. Need to satisfy technical condition called **Mercer's Theorem** to make sure SVM packages' optimizations run correctly and do not diverge.
        - Other off-the-shelf kernels include:
            - Polynomial Kernel: $k(x,l)=(x^Tl)^2$, $k(x,l)=(x^Tl)^3$, $k(x,l)=(x^Tl+1)^3$, etc...
            - String Kernel, Chi-square Kernel, Histogram Intersection Kenerl,...
            
- Multi-class Classification:
    - Many SVM packages already have built-in multi-class classification functionality.
    - Otherwise, use one-vs-all method (train K SVMs, one to distinguish y= i from the rest, for i=1,2,...K), get $\Theta^{(1)}$,$\Theta^{(2)}$,...,$\Theta^{(K)}$
        - Pick class i with largest ($\theta^{(i)})^Tx$)
        
- Logistic Regression vs SVMs
    - n = number of features ($x \in \mathbb{R}^{n+1}$), m = number of training examples
    - If n is larger (relative to m), use logistic regression or SVM without a kernel
    - If n is small and m is intermediate, use SVM with Gaussian Kernel
    - If n is small and m lis large, create/add more features, then use logistic regression or SVM without a kernel