# Suppport Vector Machines (30 points)

A SVM is used for classification or regression. This algorithm goal is to find a hyperplane in N dimensional space that can distinctly classify a certain datapoint. Below are some examples of hyper planes that seperate classes.

![svm](images/svm.jpg)

This shows a hyperplane that is drawn on the graph. The points on either side of the line are two distinct classes

![svm-margin](images/SVM_margin.png)

This shows a linear kernel SVM

## Prepare the data for the SVM

similiar to the decision tree we need to read in the data and split it into train and test sets

In [None]:
# load libraries needed
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

In [None]:
# read in the data and clean the data
df = None
path = 'loan_defualter.csv'
############################STUDENT CODE GOES HERE#########################
# read in csv

# there are three columns that are not needed
# remove the Surname, CustomerId, and RowNumber columns from the df

# There are two columns that are strings Geography and Gender.
# Computers can not process strings easily and it's difficult to do math with strings.
# The two most common approches to dealing with strings is One Hot Encoding and Label Encoding
# Use label encoding to convert these two columns to integer values
# use sklearn label encoding function for this

#################################END CODE##################################

assert(type(df) == pd.DataFrame)
assert(df['Gender'].max() == 1)
assert(df['Geography'].max() == 2)
df

In [None]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y= None,None,None,None
############################STUDENT CODE GOES HERE#########################
# split the data using train_test_split (set random_state arg to 42, train_size=0.8, test_size=0.2, shuffle=True)

#################################END CODE##################################

## Normalize the data

It is better to have the same scale in many optimization methods.

Many kernel functions use internally an euclidean distance to compare two different samples (in the gaussian kernel the euclidean distance is in the exponential term), if every feature has a different scale, the euclidean distance only take into account the features with highest scale.

You can read more on normalization techniques [here](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02)

For most problems I have worked with normalization is done via Standard scaler, MinMax scaler, or the model has data scaling built into their repository.

In [None]:
# normalize the train data (train_x)
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import _data

scaler = None
original_train = train_x
############################STUDENT CODE GOES HERE#########################
# assign the scaler to an instance of StandarScaler

# Fit the scaler and tranform the data. The train_x variable should now be normalized

#################################END CODE##################################
assert(type(scaler) == _data.StandardScaler)
assert(np.array_equal(original_train,train_x) == False)

## Train the SVM

In [None]:
from sklearn.svm import SVC

support_vector_classifier = SVC(kernel='rbf')
############################STUDENT CODE GOES HERE#########################
# use the support_vector_classifier and train the model on the train data

#################################END CODE##################################

## Evaluate the SVM

Important step for this case is to normalize the test data with the same scaler for the train. The reason is the model expects your test data to be using the same normalization technique. The model is trained with that scale in mind. If you use the raw test data then you won't get accurate results.

In [None]:
original_test = test_x
############################STUDENT CODE GOES HERE#########################
# normalize the test data using the same scaler used above
# use the transform method to transform the data

#################################END CODE##################################

assert(np.array_equal(original_test,test_x) == False)

In [None]:
# the SVM is trained so now we need to evaluate the model and see how it performs on the test set
from sklearn.metrics import accuracy_score, precision_score, recall_score

results_dict = {'Accuracy':0,'Precision':0,'Recall':0}
############################STUDENT CODE GOES HERE#########################
# use the SVM to predict on the test data

# compare the truth labels with the predicted labels for accuracy, precision, and recall
# store the results into the dataframe

#################################END CODE##################################

assert(results_dict['Accuracy'] > 0.15)
assert(results_dict['Precision'] > 0.05)
assert(results_dict['Recall'] > 0.30)
results_dict

You have now trained and evaluated a SVM. 

## Visualize training a SVM

Similiar to the decision tree use the plot helper function to graph the svm metrics as it trains. This will tell you if it is overfitting, underfitting or converging properly.

In [None]:
from plot_helpers import plot_learning_curve
svm = SVC(kernel='linear')
############################STUDENT CODE GOES HERE#########################
# use the plot_learning_curve function to return a graph of your training
# this fuction will use cross validation for getting the validation metrics
# Hint: you can use the scoring argument and set it to a string to show a certain metric ("accuracy","precision","recall")
# plot at least one of the metrics

#################################END CODE##################################
plt.show()

You have now completed the SVM portion of this assignment. Now you have gotten the hang of training and evaluating these models. It is now time for you to run your own experiments and see how well you can get these models to perform.