## 17.5.1
### Overview of Support Vector Machines
**Support vector machine (SVM)**, like logistic regression, is a binary classifier: It can categorize samples into one of two categories (for example, yes or no).

To understand support vector machines, let's revisit logistic regression first. A logistic regression model evaluates the probability of an occurrence. For example, the model would take features into account (for example, an applicant's income and credit score) and decide whether to approve the application.

The outcome is binary because the only possible options are to approve or to deny the loan application: If the probability is higher than 0.5, the application is classified as approved, or if the probability is less than that, the application is classified as denied. There is a strict cutoff line that divides one classification from the other:

In logistic regression, a probability that exceeds 50% is classified one way, and all other values are classified as the other class.

SVM also categorizes the target variable into one of two classes (for example, approved or denied). However, it differs from logistic regression in several ways. As a linear classifier, the goal of SVM is to find a line that separates the data into two classes:

SVM is a binary classifier.

However, there may be many different ways to draw the boundary line, as shown in the diagram below. Which boundary to choose isn't always clear from visual inspection, and choosing the wrong boundary can affect the performance of the model:

A casual visual inspection of the data doesn't always make it clear how to optimally divide the classes.

In a two-dimensional grid, as shown below, SVM draws a line at the edge of each class, and attempts to maximize the distance between them. It does so by separating the data points with the largest possible margins:

SVM seeks to maximize the margins between the two classes

A hyperplane is the line exactly between the two margins (i.e., equidistant from both margins). Again, the SVM's goal is to find the hyperplane with the widest possible margins (i.e., the largest margin of separation between the two classes):

SVM seeks to find the widest equidistant margins to improve classification predictions.

Support vectors are defined as the data points closest to the hyperplane:

Support vectors are data points that define the class boundaries. Data points closest to the hyperplane are support vectors and serve as decision boundaries for classification.

Real-life data, however, can be messy and will often not yield such a clean line of separation. Imagine that a data point belonging to the blue class were found closer to the cluster of data points that belong to the red class. In this case, would the hyperplane have to be relocated? Would the support vectors have to be redefined?

TextSVMs can make exceptions for outliers with soft margins.: A data point is an outlier if it is close to a cluster of data points from another class.

SVMs can accommodate such outliers by using soft margins. A soft margin allows SVM to make allowances for outliers that cross the hyperplane while maintaining support vectors and hyperplane to maximize the overall separation of the two classes:

SVMs can make exceptions for outliers with soft margins.

Up to this point, we have visualized using SVM in datasets with two features. A dataset with three features (e.g., age, education, income) and a target with two classes (e.g., approval or denial of a loan application) would be visualized as a 3D space, with a hyperplane separating the two classes:

A 3D hyperplane separating two classes. Datasets with three features are modeled in 3D, with two target classes and a hyperplane.

To summarize, SVM works by separating the two classes in a dataset with the widest possible margins. The margins, however, are soft and can make exceptions for outliers. This stands in contrast to the logistic regression model. In logistic regression, any data point whose probability of belonging to one class exceeds the cutoff point belongs to that class; all other data points belong to the other class.

 # SVM Loan Approver

 There are a number of classification algorithms that can be used to determine loan elgibility. Some algorithms run better than others. Build a loan approver using the SVM algorithm and compare the accuracy and performance of the SVM model with the Logistic Regression model.

In [1]:
from path import Path
import numpy as np
import pandas as pd

In [2]:
# Read in the data
# Note: The following data has been normalized between 0 and 1
data = Path('../Resources/loans.csv')
df = pd.read_csv(data)
df.head()

Unnamed: 0,assets,liabilities,income,credit_score,mortgage,status
0,0.210859,0.452865,0.281367,0.628039,0.302682,deny
1,0.395018,0.661153,0.330622,0.638439,0.502831,approve
2,0.291186,0.593432,0.438436,0.434863,0.315574,approve
3,0.45864,0.576156,0.744167,0.291324,0.394891,approve
4,0.46347,0.292414,0.489887,0.811384,0.566605,approve


 ## Separate the Features (X) from the Target (y)

In [3]:
# Segment the features from the target
y = df["status"]
X = df.drop(columns="status")

 ## Split our data into training and testing

In [4]:
# Use the train_test_split function to create training and testing subsets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape

(75, 5)

 ## Create a SVM Model

In [5]:
# Instantiate a linear SVM model
from sklearn.svm import SVC
model = SVC(kernel='linear')

 ## Fit (train) or model using the training data

In [6]:
# Fit the data
model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

 ## Score the model using the test data

 ## Make predictions

In [8]:
# Make predictions using the test data
y_pred = model.predict(X_test)
results = pd.DataFrame({
    "Prediction": y_pred, 
    "Actual": y_test
}).reset_index(drop=True)
results.head()

Unnamed: 0,Prediction,Actual
0,approve,deny
1,deny,approve
2,deny,deny
3,approve,deny
4,deny,deny


In [9]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.6

 # Generate Confusion Matrix

In [10]:
    from sklearn.metrics import confusion_matrix
    confusion_matrix(y_test, y_pred)

array([[7, 5],
       [5, 8]])

 # Generate Classification Report

In [11]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     approve       0.58      0.58      0.58        12
        deny       0.62      0.62      0.62        13

    accuracy                           0.60        25
   macro avg       0.60      0.60      0.60        25
weighted avg       0.60      0.60      0.60        25



## 17.5.2
### SVM in Practice
Although the ideas behind support vector machines are different from those behind logistic regression, actually implementing a SVM model is very similar to what you have done. As before, you will split your dataset, create and train a model, create predictions, then validate the model.

Now that we have looked at how an SVM model works, let's look at using SVM in practice. To get started, download the following files.

Download 17-5-2-svm.zip (That is what is above here) 

Open the notebook and load the dataset:

- from path import Path
- import numpy as np
- import pandas as pd

- data = Path('../Resources/loans.csv')
- df = pd.read_csv(data)
- df.head()

Each row in the dataset represents an application for a loan, and information is available on the applicant's assets, liabilities, income, credit score, and mortgage size. We also have information on whether the application was approved or denied. Here, the target variable is status, and all other columns are features used to predict the loan application status.

It's worth noting that the data in this dataset have been normalized. In this case, the data in the numerical features, such as assets and liabilities, have been scaled to be between 0 and 1.

We will discuss scaling in greater detail later, but note for now that some models require scaling the data, and that in this dataset, the scaling has been done for you:

Preview of the dataset. The dataset shows the information for each loan application as well as status.

The next two steps should be familiar. We separate the dataset into features (X) and target (y):

- y = df["status"]
- X = df.drop(columns="status")

We then further split the dataset into training and testing sets. Note that the shape of the training is (75, 5), meaning 75 rows and five columns. It is generally good practice to stratify the data when splitting into training and testing sets, especially when the dataset is small, as is the case here:

- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X,
   y,  random_state=1, stratify=y)
- X_train.shape

Next, we import the SVC module from Scikit-learn, then instantiate it. The kernel specifies the mathematical functions used to separate the classes. The kernel, in this example, identifies the orientation of the hyperplane as linear. However, a number of kernels exist that define nonlinear boundaries:

- from sklearn.svm import SVC
- model = SVC(kernel='linear')

We then train the model with fit():

- model.fit(X_train, y_train)

Next, we create predictions with the model:

- y_pred = model.predict(X_test)
- results = pd.DataFrame({
   "Prediction": y_pred,
   "Actual": y_test
- }).reset_index(drop=True)
- results.head()

We assess the accuracy_score of the model, which is 0.6:

- from sklearn.metrics import accuracy_score
- accuracy_score(y_test, y_pred)

We then generate a confusion_matrix and print the classification report:

- from sklearn.metrics import confusion_matrix
- confusion_matrix(y_test, y_pred)

- from sklearn.metrics import classification_report
- print(classification_report(y_test, y_pred))

The classification report calculates precision, recall (sensitivity), and the F1 score.

In summary, much of using a SVM model in practice follows the pattern we saw with logistic regression: split the dataset, create a model, train the model, create predictions, then validate the model.

skill drill

Assess the performance of a logistic regression model, namely the precision, recall, and F1 scores for the approve category. Compare it with the performance of the SVM model. Which model performs better?
End of text box.
