# Welcome to the Dark Art of Coding:
## Introduction to Machine Learning
Support Vector Machines

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Cover an overview of Linear Regression
* Examine code samples that walk us through **The Process**:
   * Prep the data
   * Choose the model
   * Choose appropriate hyperparameters
   * Fit the model
   * Apply the model
   * Examine the results
* Explore a deep dive into this model
* Review some gotchas that might complicate things
* Review tips related to learning more

# Overview: Support Vector Machines
---

Support Vector Machines (SVM) are popular machine learning models because they:
* can be used in both classification and regression
* can be used for 2-dimensional (2D) data as well as multi-dimensional data (3D, 4D, and more)
    * 2D -> uses a line/curve to separate the classes
    * 3D-plus -> uses a manifold/surface to separate the classes
* can be fairly easy to interpret

If I give you data points separated into two very distinct classes (or categories), we would probably find it pretty easy to draw a line between the two data points.

<img src='two_classes.png' width='400'>

But how do we know what is the best line to draw between the data points? The slope of the line could vary widely.

SVM takes the idea of drawing a simple line between two classes of data and adds a parallel margin to either side of the line where the margins goes up to the nearest point in each class. By maximizing the margin between the closest points in each class, we can select an optimal separating line. Thus SVMs qualify as `maximum margin` estimators.

From there, classifying new data is simply a matter of identifying which side of the line the point is on. Thus SVM is a form of discriminative classification.

For this example, we will use the Support Vector Classifier (SVC). The sklearn.svm module has a number of classifiers and regression models.

* SVC
* SVR
* LinearSVC
* LinearSVR
* NuSVC
* NuSVR

With this background, let's apply **Our Process** on a Support Vector Machine model.

## Prep the data

We start with a set of standard imports...

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

# NOTE: during the Choose the Model step, we will import the 
#     model we want, but there is no reason you can't import it here.
# from sklearn.svm import SVC

### Prep the training Data

This data set is simply a sequence of `x` and `y` vectors to plot on a chart with a `category` assigned to each vector.

The each group is identified with a category of either `1` or a `0`.

In [None]:
df = pd.read_csv('../universal_datasets/svm_train.csv', 
                 names=['x', 'y', 'category'])
df.head()

In [None]:
length = len(df)
X_train = df[['x', 'y']]
y_train = df['category']

It can be really useful to take a look at the features matrix and target array of the training data. 

* In the raw form
* In a visualization tool

For this dataset, let's use a xx plot.

In [None]:
X_train[:5]

In [None]:
y_train[:5]

In [None]:
plt.title("A Gulf Between Red and Blue Dots")

plt.scatter(df['x'], df['y'], c=df['category'],
            cmap='seismic');


### Prep the test data

The test data is a similar set of vectors (x, y points) but there are no category labels/classifications.

In the following plot, we chose to set the alpha channel for the dots at 0.15 which makes the dots largely transparent, so that they are visually distinct.

In [None]:
df_test = pd.read_csv('../universal_datasets/svm_test.csv',
                     names=['x', 'y'])

X_test = df_test[['x', 'y']]

plt.scatter(X_test['x'], X_test['y'], alpha=0.15);

## Choose the Model

In this case, we have already decided upon using the Support Vector Classification (SVC) model, so importing it is straightforward. But if we aren't sure what model we want we can always refer back to the [API Reference](https://scikit-learn.org/stable/modules/classes.html).

In [None]:
from sklearn.svm import SVC

## Choose Appropriate Hyperparameters

Here we choose to assign two hyperparameters: `gamma` and `C`. We will discuss both later.

In [None]:
model = SVC(gamma='scale', C=100)

There are a number of hyperparameters, which potentially makes this model a bit more complicated to use well. Later, we will talk about gamma, kernel and C and leave the rest of the parameters for the student to explore.

```python
SVC(
    C=1.0,
    kernel='rbf',
    degree=3,
    gamma='auto_deprecated',
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight=None,
    verbose=False,
    max_iter=-1,
    decision_function_shape='ovr',
    random_state=None,
)
```

## Fit the Model

In [None]:
model.fit(X_train, y_train)

## Apply the Model

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred.shape

In [None]:
y_pred[:5]

## Examine the results

In [None]:
# look at the support vectors
model.support_vectors_

In [None]:
# look at the indices of the support vectors
model.support_ 

In [None]:
# If we wanted to confirm these indices match up to the 
#     input data... 

X_train.loc[[12, 3]]

In [None]:
# look at the number of support vectors for each class
model.n_support_


In [None]:
plt.scatter(df['x'], df['y'], c=df['category'], 
            cmap='seismic')

plt.scatter(df_test['x'], df_test['y'], c=category_test,
            cmap='seismic', alpha=0.15);

# Gotchas
---

# Deep Dive
---

In [None]:
gamma : float, optional (default='auto')
    Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.

    Current default is 'auto' which uses 1 / n_features,
    if ``gamma='scale'`` is passed then it uses 1 / (n_features * X.var())
    as value of gamma. The current default of gamma, 'auto', will change
    to 'scale' in version 0.22. 'auto_deprecated', a deprecated version of
    'auto' is used as a default indicating that no explicit value of gamma
    was passed.
    
kernel : string, optional (default='rbf')
    Specifies the kernel type to be used in the algorithm.
    It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
    a callable.
    If none is given, 'rbf' will be used. If a callable is given it is
    used to pre-compute the kernel matrix from data matrices; that matrix
    should be an array of shape ``(n_samples, n_samples)``.
    
C : float, optional (default=1.0)
    Penalty parameter C of the error term.

# How to learn more: tips and hints
---

In [None]:
model.get_params()

# Experience Points!
---

# delete_this_line: task 01

In **`jupyter`** create a simple script to complete the following tasks:


**REPLACE THE FOLLOWING**

Create a function called `me()` that prints out 3 things:

* Your name
* Your favorite food
* Your favorite color

Lastly, call the function, so that it executes when the script is run

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: task 02

In **`jupyter`** create a simple script to complete the following tasks:

**REPLACE THE FOLLOWING**

Task | Sample Object(s)
:---|:---
Compare two items using `and` | 'Bruce', 0
Compare two items using `or` | '', 42
Use the `not` operator to make an object False | 'Selina' 
Compare two numbers using comparison operators | `>, <, >=, !=, ==`
Create a more complex/nested comparison using parenthesis and Boolean operators| `('kara' _ 'clark') _ (0 _ 0.0)`

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: sample 03

In your **text editor** create a simple script called:

```bash
my_lessonname_03.py```

Execute your script on the command line using **`ipython`** via this command:

```bash
ipython -i my_lessonname_03.py```

**REPLACE THE FOLLOWING**

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Create a variable with your first name as a string AND save it with the label: `myfname`.
1. Create a variable with your age as an integer AND save it with the label: `myage`.

1. Use `input()` to prompt for your first name AND save it with the label: `fname`.
1. Create an `if` statement to test whether `fname` is equivalent to `myfname`. 
1. In the `if` code block: 
   1. Use `input()` prompt for your age AND save it with the label: `age` 
   1. NOTE: don't forget to convert the value to an integer.
   1. Create a nested `if` statement to test whether `myage` and `age` are equivalent.
1. If both tests pass, have the script print: `Your identity has been verified`

When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# References
---

Below are references that may assist you in learning more:
    
|Title (link)|Comments|
|---|---|
|[General API Reference](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)||
|[SVM API Reference](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)||
|[User Guide](https://scikit-learn.org/stable/modules/svm.html#svm-classification)||