<a href="https://colab.research.google.com/github/WhiteHum/Application-security/blob/main/3_04_Support_Vector_Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Classifiers

## Overview

In this lab, we will begin our supervised learning journey with a consideration of Support Vector Classifiers.  Our end-goal in this series of labs is to examine supervised learning tools that can be used as predictors or classifiers.

## Goals

This lab will focus only on Support Vector Classifiers.  The end-state of this lab leads directly into the next lab, which expands on this topic to Support Vector Machines

* Understand how to build a Support Vector Classifier

 
## Estimated Time: 30 minutes

This lab and the three that follow it will all make use of the same BackBlaze dataset that we examined yesterday.  After getting set up to read that data, we will decide which fields are of interest and load that entire dataset into memory for our analysis.

# <img src="../images/task.png" width=20 height=20> Task 4.1

Please revist lab 2.1.  Extract the portions of that notebook that are necessary to read the BackBlaze dataset into a dataframe.  Using this as a basis, load all of the data ***from the first twenty files*** preserving only the failure status, capacity, and normalized smart values from the data.

*Note:* You may wish to research the `usecols` keyword argument for `pandas.read_csv()`.  You may also wish to examine the `ignore_index` keyword argument and the `append()` convenience function.

In [None]:
import os
import pandas as pd
import numpy as np

def get_file_list(starting_directory="../data/data_Q4_2020/"):
    final_list = list()
    files = os.listdir(starting_directory)
    for file in files:
        file_name = os.path.join(starting_directory, file)
        if os.path.isdir(file_name):
            final_list = final_list + get_file_list(file_name)
        else:
            final_list.append(file_name)
    return final_list

all_files = get_file_list()
columns = [
    'failure', 
    'capacity_bytes',
    'smart_1_normalized',
    'smart_2_normalized',
    'smart_3_normalized',
    'smart_4_normalized',
    'smart_5_normalized',
    'smart_7_normalized',
    'smart_8_normalized',
    'smart_9_normalized',
    'smart_10_normalized',
    'smart_11_normalized',
    'smart_12_normalized',
    'smart_13_normalized',
    'smart_15_normalized',
    'smart_16_normalized',
    'smart_17_normalized',
    'smart_18_normalized',
    'smart_22_normalized',
    'smart_23_normalized',
    'smart_24_normalized',
    'smart_168_normalized',
    'smart_170_normalized',
    'smart_173_normalized',
    'smart_174_normalized',
    'smart_175_normalized',
    'smart_177_normalized',
    'smart_179_normalized',
    'smart_180_normalized',
    'smart_181_normalized',
    'smart_182_normalized',
    'smart_183_normalized',
    'smart_184_normalized',
    'smart_187_normalized',
    'smart_188_normalized',
    'smart_189_normalized',
    'smart_190_normalized',
    'smart_191_normalized',
    'smart_192_normalized',
    'smart_193_normalized',
    'smart_194_normalized',
    'smart_195_normalized',
    'smart_196_normalized',
    'smart_197_normalized',
    'smart_198_normalized',
    'smart_199_normalized',
    'smart_200_normalized',
    'smart_201_normalized',
    'smart_202_normalized',
    'smart_206_normalized',
    'smart_210_normalized',
    'smart_218_normalized',
    'smart_220_normalized',
    'smart_222_normalized',
    'smart_223_normalized',
    'smart_224_normalized',
    'smart_225_normalized',
    'smart_226_normalized',
    'smart_231_normalized',
    'smart_232_normalized',
    'smart_233_normalized',
    'smart_234_normalized',
    'smart_235_normalized',
    'smart_240_normalized',
    'smart_241_normalized',
    'smart_242_normalized',
    'smart_245_normalized',
    'smart_247_normalized',
    'smart_248_normalized',
    'smart_250_normalized',
    'smart_251_normalized',
    'smart_252_normalized',
    'smart_254_normalized',
    'smart_255_normalized'
]
# write the code below 



Now that we have the data loaded, we'd like to try to work out a way to potentially predict drive failures.  Our idea is that we might be able to find some boundary condition that can be defined which makes the failed/not-failed drives linearly separable.

To accomplish this, we need training data, which we now have in abundance.  Our training data really isn't ideal since there are only just over 400 rows in total that indicate that a drive has failed.

In order to be able to evaluate how well our classification works, we'd like to have some ground truth data to test it out with.  To do this, let's isolate several rows that contain failed drives and an equal number of rows with drives that have not failed.  We will set these aside, deleting them from the dataframe.

# <img src="../images/task.png" width=20 height=20> Task 4.2

Select 10 rows containing failed drives and 10 rows containing drives that have not failed.  Combine this data into a new testing dataframe and delete those rows from the master dataframe.

We're now ready to try to build a Support Vector Classifier.  To do so, we need to load support for SVCs from Scikit Learn.  To do so, we need to convert our dataframe into a Numpy array and reorganize our data slightly.

It is traditional in statistical and machine learning to use X to represent the training data and Y to represent the training labels or expected outputs.  Currently, our labels (whether or not the drive has failed) are embedded in our training data.  We need to pull those apart.

# <img src="../images/task.png" width=20 height=20> Task 4.3

Use the cell that follows to import `SVC` from `sklearn.svm`.  Additionally, convert the training data in our dataframe to the expected X and Y arrays, isolating the labels from the data.  Along the way, we need to eliminate any `NaN` values.

Take this opportunity to process the testing dataframe in the same way, creating `x_test` and `y_test`.

It's time to train our classifier.  To do so, we need to instantiate an `SVC` object and fit on it:

```
svc = SVC(kernel='linear')
svc.fit(x, y)
```

# <img src="../images/task.png" width=20 height=20> Task 4.4

Use the following cell to create and train a linear Support Vector Classifier.  Since these can take a *very* long time to train, limit the model to train on data in the range 60000 to 70000. (The linear classifier can take a *very long time* to train... If you are impatient or pressed for time, consider reducing this to a much smaller range.  Just ensure that your training sample includes both failed and non-failed drives!)

We'd like to evaluate the classifier to see how well it does.  Honestly, we aren't expecting a linear classifier to do very well at all, so we won't get our hopes up...

# <img src="../images/task.png" width=20 height=20> Task 4.5

Use the `predict()` method on your SVC model, passing it the `x_test` data.  The returned array is the classification predictions for that data.  Compare the predicted labels to the known labels in `y_test` to determine the accuracy.

# Conclusion

There is more to say about Support Vector Classifiers.  We will continue this discussion in the next lab, picking up where we are leaving off here.  Still, there are some important takeaways.

We again find that preprocessing the data is more time consuming and attention intensive the creating the model.  We also discover that, as simple as these classifiers are, the training time increases directly as a result of the number of samples that we are processing.

While this is intuitive and may seem obvious, you will find that other types of models have a very different set of training characteristics.  While we always expect more data to take more time to train on, is the increase in time exponential?  Is it linear?  Is it logarithmic?  Linear or worse can have a very big impact on our ability to repeatedly train a model.  In the context of Support Vectors this is important because the general practice is to perform a grid search to determine parameter selection for the classifier.  This requires that we retrain the model with all possible combinations that we include in our search, which can be very time consuming.

Why is this particular example so slow?  Think about what is required.  The classifier must calculate the distance of every point from every other point and work out a decision boundary that separates the data.  If we have 100,000 with 73 dimensions each, that adds up to a *lot* of processing!