# Welcome to the Dark Art of Coding:
## Introduction to Machine Learning
Special Topics

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Understand the use of the `PolynomialFeatures()` method
* Explore the use of Pipelines to create a workflow of transforms with a final estimator
* Use PolynomialFeatures in a Pipeline to explore under- and overfitting

# Overview: PolynomialFeatures
---

## PolynomialFeatures

The PolynomialFeature class has a `.fit_transform()` method that transforms input values into a series of output values. These values are often used as inputs in other models.

PolynomialFeatures generates a new feature matrix that has all polynomial combinations of the original features with a degree less than or equal to the specified degree. 

As an example: 

An input sample has two dimensions (i.e. $[a, b]$) the resulting degree-2 polynomial features will be $[1, a, b, a^2, ab, b^2]$.

We start with some standard imports:
    

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import PolynomialFeatures

Notice the initial `1`. If you don't want to have the leading `1`, that can be *turned off* by using the `include_bias=False` argument.

Let's start with a three element matrix:

In [None]:
X = np.arange(3).reshape(3, 1)
X

The simplest PolynomialFeatures is simply to return the original array, but notice that in this case, the function returns a column of `1`s as well as the original matrix.

In [None]:
poly = PolynomialFeatures(1)
poly.fit_transform(X)

Yields $1, a$ for each element in the X matrix

If you want to have a features matrix that doesn't include the column of `1`s, you can avoid it by using the `include_bias=False` argument.

Including a bias column acts as an intercept term in a linear model.

In [None]:
poly = PolynomialFeatures(1, include_bias=False)
poly.fit_transform(X)

In [None]:
poly = PolynomialFeatures(2)
poly.fit_transform(X)

Yields $1, a, a^2$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(4)
poly.fit_transform(X)

Yields $1, a, a^2, a^3, a^4$ for each element in the X matrix

In [None]:
X2 = np.arange(6).reshape(3, 2)
X2

In [None]:
poly = PolynomialFeatures(1)
poly.fit_transform(X2)

Yields $1, a, b$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(2)
poly.fit_transform(X2)

Yields $1, a, b, a^2, ab, b^2$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(3)
poly.fit_transform(X2)

#         1     a     b     a^2   ab   b^2   a^3  a^2*b a*b^2 b^3

Yields $1, a, b, a^2, ab, b^2, a^3, a^2b, ab^2, b^3$ for each element in the X matrix

Thus for any degree that we feed into the PolynomialFeature model, we can transform an input matrix into a higher order matrix that may allow for potentially more precise calculations of `y` values, given values of `x`.

Why does this matter?... if you recall from your math days it is possible to create very sophisticated curves using formulas such as this:

$$
y = mx + b   \\
y = ax^2 + bx + c \\
y = ax^3 + bx^2 + cx + d \\
y = ax^4 + bx^3 + cx^2 + dx + e \\
$$

With every additional argument and with the appropriate slopes, you have the ability to match a wide array of datasets.

PolynomialFeatures helps you to generate matrices with multiple degrees so that you can run them through models like the LinearRegression model to identify the coefficients and intercept values.

With that, we will turn our attention to a new topic, **Pipelines**, but will come back to PolynomialFeatures momentarily.

# Overview: Pipelines
---

In some cases, it might be necessary to transform the data in some way before feeding it into a particular machine learning model.

The data may need to be scaled, changed into another format, etc.

In the example we just looked at, we used a PolynomialFeatures function to generate a higher degree matrix.

Pipelines allow you to feed inputs into one "end" of a series of components and get transformations or predictions out the other end, without having to take the output of one model and manually drop into the inputs of the next model.

The following example uses the PolynomialFeatures model to transform inputs from a degree 1 polynomial into higher degree polynomials. It then takes the results of those transformations and then feeds them into the LinearRegression model. 

The Pipeline simplifies things so that we only have to call `.fit()` once on the pipeline.

## A first trivial example...

## Prep the data

Start with some standard imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

### Prep the training and test data

In [None]:
df = pd.read_csv('../universal_datasets/skincancer.txt',
                 delim_whitespace=True,
                 header=0,
                 names=['state', 'lat', 'mort', 'ocean', 'long'])
df.head()

In [None]:
X = df['lat'].to_frame()
y = df['mort']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
plt.scatter(X_train, y_train)
plt.title("Mortality vs Latitude")
plt.xlabel("Latitude")
plt.ylabel("Number of deaths");

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

**NOTE**: for this example, we are simply gonna regurgitate the input data rather than change the degree, we will use a `degree=1`. In a moment, we will look at tweaking the degree to explore underfitting and overfitting. In this first example, I merely want to focus on putting the Pipeline together.

In [None]:
polynomial_features = PolynomialFeatures(degree=1,
                                         include_bias=False)

In [None]:
linear_regression = LinearRegression()

This is where the magic comes into play. By providing as an argument to the Pipeline a list containing a series of tuples, we can establish which models to call and in what order.

Each tuple is a step in the pipeline.
Each tuple is comprised of a name for that step and the function or model to call during that step.

Every step, except for the last step must have either a `.transform()` OR `.fit_transform()` method. As we have seen, PolynomialFeatures does indeed have a `.fit_transform()` method.

In [None]:
pipeline = Pipeline([("poly_f", polynomial_features),
                     ("linear_r", linear_regression)])

NOTE: in the next cell, we simply call `.fit()` on the Pipeline. We don't have to call the `fit_transform()` method on the PolynomialFeatures at all, the Pipeline does it automagically.

In [None]:
pipeline.fit(X_train, y_train)

Now that our model has been fit, we simply call `.predict()`, like normal.

In [None]:
y_test = pipeline.predict(X_test)

Of course, let's take a quick look via a chart.

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.scatter(X_train, y_train);

## An example of under/overfitting

Now that we have a sense for how we can use a Pipeline, we are gonna create one and use it to explore the phenomena of **Underfitting** and **Overfitting**.

A risk in machine learning is using a model that doesn't match the data well enough (**underfitting**) OR matches the training data so well, that it doesn't apply well to test data, it only applies to the training data (**overfitting**).

For this example, we will look at three graphs. This example comes from the Scikit Learn [Underfitting/Overfitting documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html), with various degrees of modification by me.

We will do this process three times using `degree=` of `1`, `4`, and `15` to demonstrate underfitting, a good fit, and overfitting.

Two of these cases will generate linear regressions that are not straight lines.

In the example, they create a function (`true_fun`) that generates a series of points on a graph in the shape of a Cosine.

## Prep the training and test data

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
def true_fun(X):
    return np.cos(1.5 * np.pi * X)

Using 30 random values as `X` inputs, they use the function to generate 30 related `y` values.

In [None]:
np.random.seed(0)

n_samples = 30

x = np.sort(np.random.rand(n_samples))
y = true_fun(x) + np.random.randn(n_samples) * 0.1

Let's look at X and y.

In [None]:
X = x[:, np.newaxis]

X[:5]

In [None]:
y[:5]

In [None]:
plt.scatter(X, y)
plt.title("Cosine Dots");

In [None]:
X_test = np.linspace(0.05, 1, 100)[:, np.newaxis]

## Choose Appropriate Hyperparameters

Let's:
* start with PolynomialFeatures **degree of 1**
* use the default values for LinearRegression
* feed each into our Pipeline

In [None]:
polynomial_features = PolynomialFeatures(degree=1,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

We only have to call `.fit()` on the pipeline, not on each of the components in the pipeline.

In [None]:
pipeline.fit(X, y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test)

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")

plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.legend()
plt.title("Underfit");    

## Choose Appropriate Hyperparameters

Repeating the process to generate polynomial features of **degree 4**:

In [None]:
polynomial_features = PolynomialFeatures(degree=4,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

In [None]:
pipeline.fit(X, y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test)

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.legend()
plt.title("Good match");    

## Choose Appropriate Hyperparameters

Lastly, let's generate polynomial features of **degree 15**:

In [None]:
polynomial_features = PolynomialFeatures(degree=15,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

In [None]:
pipeline.fit(X, y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test)

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.legend()
plt.title("Overfit");    

# Gotchas
---

# Deep Dive
---

## Pipeline

The Pipeline class accepts any number of models as input and creates a sequence of steps.

All models except the last must have some form of `*transform()` method that will output an appropriate matrix to feed into the next model in the pipeline.

Once a pipeline is created, the user only needs to call the `.fit()` and `predict()` methods once on the pipeline.



To create a Pipeline, we first instantiate any of the models we want to use, just as if we were creating standalone models.

> ```python
polynomial_features = PolynomialFeatures(degree=15,
                                         include_bias=False)
linear_regression = LinearRegression()
```

Next we provide a `list` of `tuples` to the Pipeline class, where each tuple contains a key, value pair where the key is a name we want to call the step of the pipeline and the value is the model we want to use at that step:

> ```python
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
```

With a Pipeline in hand, we simply call `.fit()` just as we would for any model.

> ```python
pipeline.fit(X[:, np.newaxis], y)
```

Jupyter will output the Pipeline parameters for us and we can see each of the steps we defined in the correct order and we can see that each step includes the hyperparameters that we provided.

> ```python
Pipeline(memory=None,
     steps=[('polynomial_features', PolynomialFeatures(degree=15,
             include_bias=False, interaction_only=False)), 
            ('linear_regression', LinearRegression(copy_X=True,
             fit_intercept=True, n_jobs=None,
             normalize=False))])
```

# Gotchas
---

# How to learn more: tips and hints
---

# Experience Points!
---

# delete_this_line: task 01

In **`jupyter`** create a simple script to complete the following tasks:


**REPLACE THE FOLLOWING**

Create a function called `me()` that prints out 3 things:

* Your name
* Your favorite food
* Your favorite color

Lastly, call the function, so that it executes when the script is run

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: task 02

In **`jupyter`** create a simple script to complete the following tasks:

**REPLACE THE FOLLOWING**

Task | Sample Object(s)
:---|:---
Compare two items using `and` | 'Bruce', 0
Compare two items using `or` | '', 42
Use the `not` operator to make an object False | 'Selina' 
Compare two numbers using comparison operators | `>, <, >=, !=, ==`
Create a more complex/nested comparison using parenthesis and Boolean operators| `('kara' _ 'clark') _ (0 _ 0.0)`

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# Experience Points!
---

# delete_this_line: sample 03

In your **text editor** create a simple script called:

```bash
my_lessonname_03.py```

Execute your script on the command line using **`ipython`** via this command:

```bash
ipython -i my_lessonname_03.py```

**REPLACE THE FOLLOWING**

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Create a variable with your first name as a string AND save it with the label: `myfname`.
1. Create a variable with your age as an integer AND save it with the label: `myage`.

1. Use `input()` to prompt for your first name AND save it with the label: `fname`.
1. Create an `if` statement to test whether `fname` is equivalent to `myfname`. 
1. In the `if` code block: 
   1. Use `input()` prompt for your age AND save it with the label: `age` 
   1. NOTE: don't forget to convert the value to an integer.
   1. Create a nested `if` statement to test whether `myage` and `age` are equivalent.
1. If both tests pass, have the script print: `Your identity has been verified`

When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# References
---

Below are references that may assist you in learning more:
    
|Title (link)|Comments|
|---|---|
|[General API Reference](https://scikit-learn.org/stable/modules/classes.html)||
|[XX API Reference]()||
|[User Guide]()||