## Content

- **Visual representation of SVM with different hyper-parameters**

- **Radial Basis Kernel** 

- **PCA vs SVM**

- **Support Vector Regression**

### Visual representation of SVM with different hyper-parameters



We can import SVM using sklearn.svm.SVC
- Default C is 1.0
- Default kernel is RBF
- we can provide other kernels too like linear , polynomial(default degree 3) , or custom precomputed kernels

Note: 
- Sklearn use $ \gamma $, kernel coefficient for 'rbf'
-  $ \gamma $ is defined as inverse of $\sigma $, $ \ \gamma = \frac{1}{\sigma} $

Other parameters also available like coef0 ( b in our equation ),  class_weight, cache_size to speed up. You can study about them in detail [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

This sklearn implementation is based on libsvm.


<img src='https://drive.google.com/uc?id=1VdBKs4Gq18emZKY63wvGWiJDnj_dBUj-'/> 



Another interesting visualization you can see is [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py).

If we look at input, we have two classes "red" and "blue"

 -  we can observe Nearest neigbor classification for the same input under Nearest neigbor column
 - If we observe Linear SVM, linear classifier here shades of color represents probablity point belong to that class


<img src='https://drive.google.com/uc?id=1KU6es-WY7aoqUFAjeRDD68TmXErSjx64'/>

#### Can we observe anything by looking at SVM-RBF and KNN diagram ?
- we can observe that SVM-RBF is very similar to KNN in region of data points, 

- but region where there is no data point they differ. 




 <img width=80% src='https://drive.google.com/uc?id=1PlesIJpysszwBatoG63LFUD98T6k36Ri'/>

- In the region of no data points, for any test point $ x_q $ SVM-RBF will still predict but with a low probability.<br>

Note: In case of decission trees and random forest, you will see bunch of axis-parallel hyperplanes



  <img width=80% src='https://drive.google.com/uc?id=1ovNT5eFqpa_vY5BJux6RxgBm1h5SAjgL'/> 


#### ***Question***: Can we incorporate any domain specific kernel code in sklearn ?
---
- Yes, you can create your own kernel and call it in sklearn code using callable option in "kernel" parameter
- we can provide a precomputed similarity matrix or can provide a callable function

Refer [this](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#:~:text=squared%20l2%20penalty.-,kernel,-%7B%E2%80%98linear%E2%80%99%2C%20%E2%80%98poly%E2%80%99%2C%20%E2%80%98rbf) for more details.



<img src='https://drive.google.com/uc?id=1qxrM9_dp0uOs-CqV46VrwVP2zhKeuLKa'/>

Let's visualize effect of hyperparameter like <b>C and gamma </b>

If we look at data, we have two classes "red" and "blue" 

#### How will decission surface change if $ \gamma $ changes with constant C ?

With fixed C, if we increase gamma ( $ \sigma reduces ), model overfits
- If we move horizontally in this image you can observe the effect of gamma ( &gamma; ) or &sigma;. 
- Model overfits with increasing gamma ( &gamma; ) or decreasing &sigma;.





<img src='https://drive.google.com/uc?id=1cFlJkTuGcCbJXXwKqmx6Qu9JDf7JTZC6'/>

#### What can we observe if we fix $\gamma $ and increase the value of C ?
 - initial at low value of C, we almost have linear classifer
 - With increasing C, model overfits




 <img src='https://drive.google.com/uc?id=1YQo_yNR1EscGlStvsYyXRSf7Mzn7gg5y'/>

- If we move vertical, we can see the effect of increasing C i.e. model overfits, 
- more complex non-linear boundaries.

Note: we can observe that extent of overfitting is more in case of increasing $ \gamma $ as compared to C


  <img width=80% src='https://drive.google.com/uc?id=17XPQYsTNnJbLyLcO7lETWC5HT8Hur4Y8'/> 



### Radial Basis Kernel (RBF) 

#### Question: What does radial and basis means in Radial basis function ?

Radial means 
- It's value depends only on the distance from origin, not the direction
- Given origin, all the point which are equidistant from origin will have same value of Kernel 

Basis means,
 - it forms the basis for some function space of interest,

Note: Recall basis definition : a set B of vectors in a vector space V is called a basis if every element of V may be written in a unique way as a finite linear combination of elements of B

 <img src='https://drive.google.com/uc?id=1UztuxCWQ-wLopniAJMJnUdTmz44AEkgd'/> 


### PCA vs SVM

#### ***Question***:  In PCA we decrease dimension but here in SVM we are increasing, isn't is contradictory?
---
- In PCA, we project data from d dimension to d' dimensions by preserving as much variance as possible such that d' < d
- But in SVM, using the kernel trick we go from d to d' such that d' > d. 
- In higher dimension it is easier to find seperating hyperplane.

#### Aren't these two contradictory ?

- No, because purpose of both the methods are different. 

Question: What is our objective in PCA and SVM ?

- In PCA, our objective is  visualization or understanding data because it is easy to visualize data in low dimension, we can not easily interpret data more than 2 or 3 dimension.

- On the other hand out task in SVM is classification i.e. to find best seperating hyperplane which is much easier in higher dimensions.

- Like in logistic regression also, we introduced polynomial features so that we can find the seperator



<img src='https://drive.google.com/uc?id=1PaesPo6FfYR-Zgy25Bm0efz2E-uzDeAw'/>


# Support Vector Regression ( SVR )

#### Can we do regression with SVM ?

Ans: yes,

lets understand it, with an example:

- Imagine we have two features f1,f2 and we have some data points 

<br>

The main intuition behind SVR is to find a best fitting line,  

- Such that the maximum error (ϵ) between $y_i$ and $y_î = w^Tx_i + b$ 
 - is as minimum as possible 

<br>

Thus making the loss function as:
- $ min_{(w,b)} \  \frac{1}{2}||w||^2 + C. ϵ $

such that
-  $ y_i - y_î  \leq ϵ $
- and also $y_î - y_i \leq ϵ $ 
 - with $ϵ \geq 0 $


**Note:** There is also version of SVR called kernalized SVR
- SVR is not very popular, hence we tend on not using it much 




<img src='https://drive.google.com/uc?id=1T781Qh03bWv8ttvDTPK4cn5RlLllgeYU'/> 

### Interpretability of SVM

#### what all are the hyperparameters in SVM ?
Ans: C and σ, as we have seen 
in the primal form of SVM:
- $min_{w,b} \frac{1}{2}||w|| + C. ∑_{i=1}^{n} ζ_i $

- C becomes our hyperparameter, 
 - which makes SVM overfit as it increases

And we have also seen σ being a hyperparameter when using RBF kernel 

<br>


#### Assuming that we found the best $ \sigma $ and best C, can we interpret SVM ?

Ans: We can only draw the which datapoints can be Support vectors through non zero $α_i$

- There is no native intepretation of SVM 

<br>

But, if we think RBF-SVM as similar to KNN , 

- for any query point $x_q$, we can think of it as finding nearest neighbors.

<br>

**Note:** we can say RBF-SVM as similiar to KNN is a hacky way
- of defining interpretability of SVM 



<img src='https://drive.google.com/uc?id=18Uefkx5ai-BrN-lb2KJlvxeS16bcUmhz'/> 