# Support Vector Machines (SVMs)

## Introduction

Support Vector Machines are classifiers that can classify datasets by a introducing an optimal hyperplane between the multi-dimensional data points. An hyperplane is a multi-dimensional structure that extends a two-dimensional plane. If the datasets consists of two dimensional dataset, then an estimate line is fit that provides the best classification on the  dataset. By "best classification", it is to be noted that a plane that not necessarily provides perfect classification of all points in the training dataset but fits a criterion such that the line is farthest from all points. You can see from the figure below that a hyperplane classifies the dataset as shown.

<img src="../../../images/SVM.PNG" style="width:45vw"> 

We shall use a plot_learning_curve function from sklearn:
ref: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .

## Exercise

* In the titanic dataset that has been cleaned, train a SVM classifier on the 'features' list provided below.
* Perform a Train-Test split
* Perform 10 fold cross-validation
* Find out mean accuracy of the trained model and put the obtained accuracy in accuracy_train variable

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.cross_validation import KFold

train_data = pd.read_csv("https://raw.githubusercontent.com/colaberry/data/master/Titanic/train_data.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/colaberry/data/master/Titanic/test_data.csv")
features = ['Pclass', 'Survived','Age_Imputed', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']

#Keeping relevant data for processing 
train_data = train_data[features]

#Converting dataset into array for Cross validation
array = train_data.values

#Seperating target variable and indepentdent variables
X=np.delete(array, 1, axis=1)
Y=array[:,1]

#Setting the test size and train size
test_size = 0.20
seed = 7


### Solution

```python

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)

scoring = 'accuracy'
models=SVC()
kfold = model_selection.KFold(n_splits=10, random_state=seed)

cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
results=(cv_results)
accuracy_train = cv_results.mean()
print(accuracy_train)
```

We should now predict the model on our test data.

## Exercise

* Predict on the test data and find the accuracy of the model. Put the accuracy of the model in a variable called accuracy_test.

In [None]:
# Make predictions on test dataset


## Solution

use svm.fit(..) to fit the model and then accuracy_score(..) to find the accuracy

```python
svm.fit(X_train, Y_train)
predictions = svm.predict(X_test)
accuracy_test= accuracy_score(Y_test, predictions)
print(accuracy_test)
```

## Linear SVC

Support Vector Machines has two options for Linear Model.

<li>LinearSVC() 
<li>SVC(kernel='linear')


The linear models LinearSVC() and SVC(kernel='linear') yield slightly different decision boundaries. This can be a consequence of the following differences:

LinearSVC uses the One-vs-Rest multiclass reduction while SVC (linear kernel) uses the One-vs-One multiclass reduction.


In this section, let us use IRIS dataset with Comparison of two linear SVM classifiers. We only consider the first 2 features of this dataset:

<b>Sepal length</b> <br>
<b>Sepal width</b> <br>

## Exercise

* Use LinearSVC and SVC(kernel='linear') to fit IRIS dataset on 'Sepal length' and 'Sepal width'. Then print the accuracies for the models.

In [None]:
# Make predictions on test dataset
# Make predictions on test dataset
from sklearn import datasets,metrics
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
iris_data = iris.data
iris_data = pd.DataFrame(iris_data, columns=iris.feature_names)
iris_data['species'] = iris.target 
iris_data['species'].unique()

features = iris.feature_names
target = 'species'

X = iris_data[features]
y = iris_data[target]


## Solution

```python
lsvm=LinearSVC()
lsvm.fit(X_train, Y_train)
predictions = lsvm.predict(X_test)
accuracy_test= accuracy_score(Y_test, predictions)
print('Accuracy of LinearSVC',accuracy_test)

l_svc=SVC(kernel='linear')
l_svc.fit(X_train, Y_train)
predictions = l_svc.predict(X_test)
accuracy_test= accuracy_score(Y_test, predictions)
print('Accuracy of SVC with Linear kernel',accuracy_test)
```

## Non-linear Support Vector models

Both linear models LinearSVC and SVC(kernel='linear') have linear decision boundaries - a hyperplane.

The non-linear kernel models (polynomial or Gaussian RBF) have more flexible non-linear decision boundaries with shapes that depend on the kind of kernel and its parameters. This is based on higher dimension.


## Exercise

* Use non-linear kernel  SVC(kernel='poly') and SVC(kernel='rbf') to fit IRIS dataset on 'Sepal length' and 'Sepal width'. Then print the accuracies for the models.

In [None]:
# Make predictions on test dataset


## Solution


```python
p_svc=SVC(kernel='poly')
p_svc.fit(X_train, Y_train)
predictions = p_svc.predict(X_test)
accuracy_test= accuracy_score(Y_test, predictions)
print('Accuracy of SVC with Poly kernel',accuracy_test)

r_svc=SVC(kernel='rbf')
r_svc.fit(X_train, Y_train)
predictions = r_svc.predict(X_test)
accuracy_test= accuracy_score(Y_test, predictions)
print('Accuracy of SVC with Poly kernel',accuracy_test)
```