---

### Problem Statement

Imagine youself as a Data Scientist at Google. </br>
You've been asked to come up with model to classify emails as either **Spam** or **Ham** (Non-spam)





---

### SVMs
- Popular in 2000's ( late 90s )
- Kernel SVM
- Theoritically they are best.
  - (In practice better algorithms exists)
- Less frequently used nowadays
- Challenging Maths

---

### Geometric intution behind SVM

The main idea behind SVM is to
- find the line/plane that can
- best seperate the given classes.

Suppose,

We have some datapoints that belong to two different classes.
- +ve class
- -ve class

\
#### How do you divide the +ve samples from the -ve ones?

Using a **line/hyperplane**.

```
Hyperplanes can be defined as decision boundaries that help classify the datapoints.
```

But there can be multiple lines/hyperplanes.

\
#### So which line/hyperplane should we choose?

Imagine, </br>
we have two different **hyperplanes** $\pi_1$ & $\pi_2$

\
<img src='https://drive.google.com/uc?id=1lFAoAj0_EimHorfMbhySFQLZlErMFq1W' height='500' width='400'>

\
#### Which hyperplane is better and why?

Hyperplane $\pi_2$

\
Intuitively,
- If we look at the hyperplane $\pi_1$
- and we draw two hyperplanes parallel to $\pi_1$ from the closest +ve & -ve datapoint to $\pi_1$
- we can see the gap between these two parallel drawn hyperplanes.

Let's call this gap as **margin_1** wrt hyperplane $\pi_1$.

\
Similarly,
- we can get **margin_2** wrt hyperplane $\pi_2$.

Now we can see that,
- margin_2 is much larger than margin_1.

Hence,
- we pick that hyperplane $\pi$ that results in largest margin.

---

#### Why do we need the margin to be large?

The hyperplane is drawn with a view of putting in
- the **widest street** that separates the +ve & -ve samples.

\
Therefore,
- the larger the margin,
- the better the separation.

\
Such classifiers where,
- we want the margin to be as large as possible
- are called **margin-maximizing classifiers**.

\
```
NOTE:
Distances from datapoints are measured perpendicular to any of the hyperplanes.
```

---

Assume,

We have a bunch of +ve & -ve datapoints.

- $\pi$ is a margin-maximizing hyperplane

- $\pi^+$ is the positive hyperplane parallel to $\pi$ and
    - touching the closest +ve points to the $\pi$
- $\pi^-$ is the +ve hyerplane  parallel to $\pi$ and
    - touching the closest -ve points to the $\pi$

- Margin is the dist. between $\pi^+$ and $\pi^-$

#### How do we define these hyperplanes?

Recall from regression models, </br>
we define a hyperplane as:
- $\pi : w^Tx+b = 0$

\
Let's assume, </br>
the parallel hyperplanes are defined as:

<img src='https://drive.google.com/uc?id=1xkJBWZ_Hd6nVtr-zgfiF4of15OcHpo6i'
height='450' width='450'>

\
#### What will be the length of margin?

Recall from linear algebra,

\
If measured from the origin,
- the distance of hyperplane $\pi^+$ can be defined as
  - $ d(0, \pi^+) = \frac{b-k}{||w||}$

- the distance of hyperplane $\pi^-$ can be defined as
  - $ d(0, \pi^-) = \frac{b+k}{||w||}$

\
Hence,
- the distance between the two hyperplanes will be
- $\frac{b+k}{||w||} - \frac{b-k}{||w||}$

\
$\Rightarrow$ Margin i.e., </br> $d( \pi^+,\pi^-) = \frac{2k}{||w||}$

\
#### What will be the parameters for the margin?

We will maximize this margin on -
- weight ($w$) and
- constant ($b$)

\
<img src='https://drive.google.com/uc?id=1XKkBcW0hoTyRanCVMfTwmpoLpWKuHaQt' height='300' width='500'>

\
Since, </br>
- changing the value of $k$ would only scale the margin.
- it won't affect the position of the hyperplane.

\
Hence, </br>
- for mathematical simplicity,
- we take $k$ as +1, -1.



---

Therefore, $k=1$
- $\pi^+:\Rightarrow w^Tx+b-1 = 0$
- $\pi^-:\Rightarrow w^Tx+b+1 = 0$

\
And, we get our margin as:
- $d( \pi^+,\pi^-) = \frac{2}{||w||}$

\
<img src='https://drive.google.com/uc?id=1l1mOyJbXlB5vHnQISpy_EmZqJvTMuH7E' height='450' width='450'>

\
Now, our goal is
- to maximize this margin $\frac{2}{||w||}$
- with respect to:
  - weight ($w$) and
  - constant ($b$)
-  to obtain the best possible separation between the two classes.



---

---

### <b> SVM Demo </b>

https://jgreitemann.github.io/svm-demo

<img src='https://drive.google.com/uc?id=149Xs-dDaEhXH8m90fiSlT0SPlUchUo2T' height='400' width='650'>

---

### Hard Margin SVM

In a binary classification,
- we have some $+ve$ and $-ve$ datapoints
- with hyperplane $\pi$  which separates them
- and $\pi_+$ and $\pi_-$ as our parallel hyperplanes.

\
Also,
- we have $n$ samples,
- where each sample contains some features $x$
-and class label $y$ which can take value +1 or -1

\
Now, as we discussed
- our aim is to maximize the margin.

\
#### So how can we perform optimization here?

Let's look at an example to understand this.

\
What will be the value of a +ve datapoint which lies on the hyperplane $\pi^+$?
- 1 ,
- since $\pi^+: w^Tx+b=1$

```
Case 1:
```
What will be the value of the +ve datapoints which lie beyond the hyperplane $\pi^+$?
- greater than 1 ,
- hence $\pi^+: w^Tx+b > 1$

\
Similarly,

```
Case 2:
```
What will be the value of the +ve datapoints which lie beyond the hyperplane $\pi^-$ ?
- less than -1 ,
- hence $\pi^-: w^Tx+b < -1$




---

Now, </br>
for mathematical convenience </br>
we introduce a term $y_i$ such that:

- $y_i$ = +1 for +ve samples
- $y_i$ = -1 for -ve samples

\
Therefore,
- $\pi^+: y_i(w^Tx+b) \geq 1$
- $\pi^-: y_i(w^Tx+b) \leq -1$

\
We can club both the cases and say,
- we maximize the margin such that
- for all n samples,
- our $y_i(w^Tx_i + b) \geq 1$

\
<img src='https://drive.google.com/uc?id=1TdMTmU8OeS-jZss9Io9VD3LKKTcfQG9W' height='500' width='400'>

\
#### But how does this work?

- For +ve samples,
  - $y_i$ = +1
  - $(w^Tx_i + b)$ will be +ve since it will be $\geq$ 1
  - hence, (+ve) multiplied by (+ve) makes positive.
- For -ve samples,
  - $y_i$ = -1
  - $(w^Tx_i + b)$ will be -ve since it will be $\leq$ -1
  - hence, (-ve) multiplied by (-ve) makes positive.

Example -

<img src='https://drive.google.com/uc?id=1VgUbWlosavzPc9ftYrdt0JntpwZP6c1_' height='400' width='750'>


---

#### But why the $(w^Tx_i + b)y_i \geq 1$ constraint?

Our goal is to
- seperate the two classes completely
- with margin as maximum as possible.

\
With this constraint:
- all +ve datapoints should lie beyond $\pi^+$
- all -ve datapoints should lie beyond $\pi^-$

\
Thus,
- it expects the hyperplane to have zero errors since,
- it does not want any datatpoint to belong
- to the wrong side of the parallel hyperplanes.

To summarise, </br>

If we strictly impose that:
- all instances must be off the street (margin)
- and on the correct side,
- then this is called **Hard Margin classification**.


---

#### When would Linear SVM with Hard Margin fail?

<img src='https://drive.google.com/uc?id=1Sieivv5mv2kFdKdXdjcc51bLTybnq0iC' height='400' width='500'>



---

### Soft Margin SVM

#### What if data is not perfectly linearly seperable?

Imagine a dataset where,
- some data points are on the wrong side.

This is what we call **almost linearly separable** data.

\
<img src='https://drive.google.com/uc?id=15l_dsx_MUv7PB8YbQPJPWW0RQj2lueI3' height='420' width='475'>

\
#### How to account for these data points?

Imagine,
- A $+ve$ labelled data point $ \ x_1 $
- at 0.5 unit distance in between $ \pi \ and \ \pi^+ $

\
#### What will be the value of $ y_i(w^Tx_i+b) $ for $x_1$?

- $\ y_1(w^Tx_1+b)$ = 0.5 = 1 - 0.5,
- where 0.5 is error $ \zeta_1 $

\
#### Does this equation $(w^Tx_i+b)y_i >= 1 - \zeta_i$ holds true for -ve points too?

**Yes**

Imagine,

- A $-ve$ labelled data point $ \ x_2 $
- at 0.5 unit distance in between $ \pi \ and \ \pi^- $

\
#### What will be the value of $ y_i(w^Tx_i+b) $ for $x_2$?

- $\ y_2(w^Tx_2+b)$ = 0.5 = 1 - 0.5,
- where 0.5 is $ \zeta_2 $

\
<img src='https://drive.google.com/uc?id=19Jy5DJE9zZMqmxos8FWhH-RJdonqfIi9' height='450' width='400'>

\
So,
- $ \zeta_i = 0$ for all correctly placed points.
- $ \zeta_i > 0$ for all incorrectly placed points.

\
Now, our optimization problem becomes:

- $ max \ \frac{2}{||w||} $ i.e., the margin
- along with minimizing error $ \zeta_i's $

because we're try to get the best possible classificaton.

\
#### Can we think of another way to write this?

Reciprocating above equation,

- $ min \ \frac{||w||}{2} $ with $ \zeta_i's $



---

#### What do you think our goal here is?

- maximize the margin
- minimize datapoints having $\zeta_i > 0$,
  - minimize the errors $ \zeta_i's $.

\
Now the optimization function changes to:
- $ min_{w,b} \ \frac{||w||}{2} + C \sum_{i=1}^N \zeta_i$

- such that $(w^Tx_i+b)y_i >= 1 - \zeta_i$

- for all $i : 1 \rightarrow N$

-  $ \zeta_i >= 0 $.

\
This is called as **SVM with soft margin** which we use when we have **almost linearly seperable** data.

---

#### What's the use of $C$ here?

$C$ is a hyperparameter.

\
It controls whether we have to focus on
- maximizing the margin or
- minimizig the errors $ \zeta_i's $

\
#### What if the value of $C$ becomes zero?

As $C \downarrow$
- Model becomes less tolerant towards misclassifications (hard margin).
- more importance to maximize the margin.
- So, the model may Underfit.

\
#### What will happen if $C$ is very large?

As $C \uparrow$
- Model becomes more tolerant towards misclassifications (soft margin).
- more importance to minimize the errors.
- So, the model may Overfit.

\
<img src='https://drive.google.com/uc?id=1UeCaQ7rITYFVlUYnb6ytnCOB5B6iLT2m' height='400' width='800'>

\
Therefore, we need to find a balance here.



---

### Algebric intuition behind SVM

We saw that,

Soft Margin SVMs are defined as:
- $min$ <sub> $(w,b)$ </sub> $\frac{||w||}{2}$ + $C\frac{1}{N}$$\sum_{i=1}^N \zeta_i$

\
The term $\frac{1}{N}$$\sum_{i=1}^N \zeta_i$
- is the error which we try to minimize.

We refer to this as **Hinge Loss**.


Algebraically
- the term $\frac{||w||}{2}$ is just $\frac{1}{2}$ of L2-Regularization.

Also, </br>
$C$ becomes analogous to regularization hyperparameter $\lambda$.

\
Therefore, </br>
we can interpret our Soft-Margin SVM as   
 - $C$ HingeLoss $+\frac{1}{2}$ L2Reg


---

### Intuition of Hinge Loss

Let's say
- we define our hyperplane $w^Tx+b$ as $f(x_i)$.
- and assume our x-axis to be:
 - $z_i = y_if(x_i)$ </br>
  $~~~ = y_i(w^Tx+b)$

\
<img src='https://drive.google.com/uc?id=1SAEr8fu7okt9FKVsA_DLFM-BN9u7INOd' height='400' width='600'>

\
Consider 3 datapoints $x_1, x_2, x_3$
- that belong to the $+ve$ class.

\
<img src='https://drive.google.com/uc?id=1CsHd9n-8lTEvzyAn3lP8SrVx3EVgBnxE' height='450' width='500'>

\
```
Case 1:
```
Point $x_1$ lies on/beyond $\pi^+$

Then $z_i \geq 1$
- or we can say $y_i(w^Tx+b) \geq 1$

So, $\zeta_i = 0$
- i.e., $d(x_1, \pi^+) = 0$

\
```
Case 2:
```
Point $x_2$ lies on $\pi$

Then $z_i = 0$
- or we can say $y_i(w^Tx+b) = 0$

So, $\zeta_i = 1$
- i.e., $d(x_2, \pi^+) = 1$

\
```
Case 3:
```
Point $x_3$ lies on $\pi^-$

Then $z_i = -1$
- or we can say $y_i(w^Tx+b) = -1$

So, $\zeta_i = 2$
- i.e., $d(x_3, \pi^+) = 2$

---

#### What can we conclude from the graph above?

1. Error cannot be negative.
  - $\zeta_i >= 0$
2. A/C to the constraint,
  - $y_i (w^Tx+b) > 1 - \zeta_i$

$~~~~~~~~~~~~ \Rightarrow z_i > 1 - \zeta_i$

$~~~~~~~~~~~~ \Rightarrow \zeta_i \geq 1- z_i$

\
We can actually combine these two eqns into: </br>

**Hinge Loss** </br>
$ ~~~~~ \Rightarrow \zeta_i = max ~ (0, 1-z_i)$

\
**Note:**

As $z_i$ increases,
- $\zeta_i$ will reduce
- but it'll not go below zero.

As $z_i$ decreases,
- $\zeta_i$ will increases.

---

### Comparison with Log Loss

<img src='https://drive.google.com/uc?id=1zpPvhQLQtp-I4FU4YwOkLZQJIG3cqk3H' height='400' width='550'>

\
#### What happens if $y_i ~ \epsilon ~ \{-1, +1\}$ for LogLoss?

- If, $y_i ~ \epsilon ~ \{-1, +1\}$

- Then, </br>
LogLoss
= $ ∑_{i=1}^{n} log( 1 + e^{(-z_i)} )$

$~~~~~~~~~~~~~~~~~~~~~$ = $∑_{i=1}^{n} log( 1 + e^{( -y_i (w^Tx_i + b))} )$

\
We will not be deriving how we get this equation.

---

### SVM Imbalance

#### Are SVMs affected by class imbalance?

\
Only a few datapoints will be contributing to the hinge loss ($\zeta_i$).
- These points are called **Support Vectors**.

\
Hence, SVM will only be affected
- if there is imbalance in the no. of support vectors from each class.

\
#### Should we use SVM as the baseline model if we have imbalanced data?

Not necessarily because
- the balance in no. of support vectors from each class can't be guranteed.

---

### Code implementation of Linear SVM

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

In [15]:
!gdown 1QViUZJ5UIBCgxB_qbOXTLs_2V48w7MWo

df = pd.read_csv('Spam_processed.csv', encoding='latin-1')
df.dropna(inplace = True)

Downloading...
From: https://drive.google.com/uc?id=1QViUZJ5UIBCgxB_qbOXTLs_2V48w7MWo
To: /content/Spam_processed.csv
  0% 0.00/767k [00:00<?, ?B/s]100% 767k/767k [00:00<00:00, 102MB/s]


In [16]:
df

Unnamed: 0,type,message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives around though
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,2nd time tried 2 contact u u å750 pound prize ...
5568,0,Will Ì_ b going to esplanade fr home?,ì_ b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",pity mood suggestions
5570,0,The guy did some bitching but I acted like i'd...,guy bitching acted like interested buying some...


- Performing train-test split
- with [CountVectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- and StandardScaler.

In [17]:
from sklearn.model_selection import train_test_split

df_X_train, df_X_test, y_train, y_test = train_test_split(df['cleaned_message'], df['type'],
                                                          test_size=0.25, random_state=47)
print([np.shape(df_X_train), np.shape(df_X_test)])

# CountVectorizer
f = feature_extraction.text.CountVectorizer()
X_train = f.fit_transform(df_X_train)
X_test = f.transform(df_X_test)

# StandardScaler
scaler = StandardScaler(with_mean=False) # problems with dense matrix
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print([np.shape(X_train), np.shape(X_test)])
print(type(X_train))

[(4173,), (1392,)]
[(4173, 7622), (1392, 7622)]
<class 'scipy.sparse._csr.csr_matrix'>


In [20]:
X_train.shape

(4173, 7622)

In [21]:
X_test.shape

(1392, 7622)

In [18]:
np.unique(y_train,return_counts=True)

(array([0, 1]), array([3613,  560]))

In [19]:
np.unique(y_test,return_counts=True)

(array([0, 1]), array([1205,  187]))

Let's train Linear SVM on the given Spam/Ham data.


In [33]:
# SVC

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

params = {
          'C': [1e-4,  0.001, 0.01, 0.1, 1,10], # which hyperparam value of C do you think will work well?
          'class_weight': [{ 0:0.1, 1:0.6 }, { 0:1.0, 1:1.0 }]
         }

#svc = SVC(class_weight={ 0:0.1, 1:0.5 }, kernel='linear')
svc = SVC(kernel='linear')
clf = GridSearchCV(svc, params, scoring = "f1", cv=3)

clf.fit(X_train, y_train)

In [34]:
res = clf.cv_results_

for i in range(len(res["params"])):
  print(f"Parameters:{res['params'][i]} \n Mean score: {res['mean_test_score'][i]} \n Rank: {res['rank_test_score'][i]}")

Parameters:{'C': 0.0001, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.7061751410191853 
 Rank: 12
Parameters:{'C': 0.0001, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.7507778900994672 
 Rank: 11
Parameters:{'C': 0.001, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.7732665611050361 
 Rank: 1
Parameters:{'C': 0.001, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.7705023107316228 
 Rank: 2
Parameters:{'C': 0.01, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.767533370474547 
 Rank: 3
Parameters:{'C': 0.01, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.7649416969151316 
 Rank: 4
Parameters:{'C': 0.1, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.7649416969151316 
 Rank: 4
Parameters:{'C': 0.1, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.7649416969151316 
 Rank: 4
Parameters:{'C': 1, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.7649416969151316 
 Rank: 4
Parameters:{'C': 1, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.7649416969151316 
 Rank: 4
P

As you can see,
- we get the best performance when $C=0.001$,
- with F1 Score of 0.77.

\
Now implementing this SVM on the test data.

In [35]:
#svc = SVC(C=1e-2,class_weight={ 0:0.1, 1:0.6 }, kernel='rbf')
svc = clf.best_estimator_

svc.fit(X_train, y_train)

y_pred_train = svc.predict(X_train)
y_pred_test = svc.predict(X_test)
print(metrics.f1_score(y_train,y_pred_train))
print(metrics.f1_score(y_test,y_pred_test))

0.9991063449508489
0.8973607038123168


Linear SVM performs much well
- on the Spam/Ham data
- with F1 Score of 0.88
- when using class weights.

In [30]:
# SVC

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

params = {
          'C': [1e-4,  0.001, 0.01, 0.1, 1,10], # which hyperparam value of C do you think will work well?
          'class_weight': [{ 0:0.1, 1:0.6 }, { 0:1.0, 1:1.0 }]
         }

#svc = SVC(class_weight={ 0:0.1, 1:0.5 }, kernel='linear')
svc = SVC(kernel='rbf')
clf = GridSearchCV(svc, params, scoring = "f1", cv=3)

clf.fit(X_train, y_train)

In [31]:
res = clf.cv_results_

for i in range(len(res["params"])):
  print(f"Parameters:{res['params'][i]} \n Mean score: {res['mean_test_score'][i]} \n Rank: {res['rank_test_score'][i]}")

Parameters:{'C': 0.0001, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.0001, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.001, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.001, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.01, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.01, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 0.1, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.43486026731927735 
 Rank: 4
Parameters:{'C': 0.1, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.0 
 Rank: 6
Parameters:{'C': 1, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.5337471035349756 
 Rank: 1
Parameters:{'C': 1, 'class_weight': {0: 1.0, 1: 1.0}} 
 Mean score: 0.42467437297591193 
 Rank: 5
Parameters:{'C': 10, 'class_weight': {0: 0.1, 1: 0.6}} 
 Mean score: 0.4922745877494413 
 Rank: 2
Paramet