<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Support Vector Machines


## Explanation by StatQuest 

StatQuest has 4 videos on Support Vector Machines, but for now you should just watch this introduction: 
[StatQuest: SVM Main Ideas](https://www.youtube.com/watch?v=efR1C6CvhmE&t=742s) (20min 30s)



Support Vector Machines can be used for both Classification and Regression, but in this example we're going to use it as a Binary Classifier.

Start by loading in the breast cancer dataset. 

# Part 1: Train a SVM Model

In [None]:
import pandas as pd
pd.options.display.max_columns = None

In [None]:
cancer_df = pd.read_csv('../../Data/breast_cancer.csv') # outcome: 0 = benign, 1 = malignant

In [None]:
cancer_df

# Train the Model

### Create arrays for the Features and Target Value

Set "outcome" as the target value. 1 means that the patient has Breast Cancer. 0 means they don't.

In [None]:
X = cancer_df.drop('outcome', axis=1).values

In [None]:
y = cancer_df['outcome'].values

### Split into Training and Test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)


### Train the model

In [None]:
from sklearn.svm import SVC

model = SVC()
model.fit(X_train, y_train)

### Make predictions

In [None]:
y_pred = model.predict(X_test)
y_pred

### Evaluate the Predictions

Let's see how good our model is by creating a confusion matrix. 

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# plotting the confusion matrix
plt.figure(figsize=(12,6))
plt.title("Confusion Matrix")
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")


In [None]:
print(classification_report(y_test, y_pred))

In case you've forgotten how to read the Confusion Matrix, here's some - Reference: 

* https://medium.com/@kohlishivam5522/understanding-a-classification-report-for-your-machine-learning-model-88815e2ce397
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
* https://www.statology.org/sklearn-classification-report/ 


# Part 2: Tuning the model with GridSearchCV

The SVM model accepts hyperparameters that can be used to tune it. 

C
* the cost of misclassification. 
* a large value of C will result in low bias, but high variance

Kernel
* You can choose which kernel to use: "Radial Basis Function" is the default

Gamma
* This is a parameter of the Radial Basis Function
* a large value of gamma will result in high bias and low variance


In order to find the best values for C and Gamma, we'll try lots of values. This is easy to do with GridSearch Cross-Validation. In fact, GridSearchCV works with any model that accepts hyperparameters e.g. K Nearest Neighbors.


In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np

In [None]:
param_grid = {
    'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 5, 10, 50, 100, 500, 1000],
    'gamma': [1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001]
}


In [None]:
grid = GridSearchCV(SVC(), param_grid=param_grid, verbose=3) # higher the verbose number the more output you see.

In [None]:
grid.fit(X_train, y_train)

Get the best parameters!

In [None]:
grid.best_params_

What's the best score that could be achieved? 

In [None]:
grid.best_score_

Let's plug these values into the model

In [None]:
grid.best_estimator_

Now that we've found the best hyperparameters, and trained our model with them, let's see if our model performs any better. 

In [None]:
grid_predictions = grid.predict(X_test)

In [None]:
conf_mat = confusion_matrix(y_test, grid_predictions)
conf_mat

In [None]:
# plotting the confusion matrix

plt.figure(figsize=(12,6))
plt.title("Confusion Matrix")
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.show()

In [None]:
print(classification_report(y_test, grid_predictions))

# In conclusion

You may have noticed that the performance of this model with the GridSearchCV performs worse than just the plain old SVM model we used in Part 1. 

Take a minute to verify this for yourself. 

Why is this? 

If I had to guess, I'd imagine it's probably a combination of 2 factors: 

1. The dataset we have to work with is relatively small
2. GridSearchCV performs K-Folds validation (5 folds by default), so it's not working with the entire training dataset on any one pass. 

Nevertheless, this is a good example of how to tune hyperparameters using GridSearchCV. With larger, real-world datasets, this nearly always yields an improvement on the default values in the SVM model. 
