# Advanced Pipelines

We now have a pretty strong repetoire of regression models. Depending on the data set there may be a number of preprocessing steps that should be taken prior to fitting the model. While we've learned basic pipelines and out of the box transformer objects you may need to perform preprocessing tasks that are too complicated for these simple tools.

## What We'll Accomplish in This Notebook

- We'll review the differences and necessity for fit, transform and fit_transform
- Introduce the popular California Housing Data Set
- Demonstrate how to construct custom transformer objects for more advanced pipelines

In [None]:
## Import packages

## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## `fit`, `transform`, and `fit_transform`

Hopefully you remember from the `Basic Pipelines` notebook the terms, `fit`, `transform` and `fit_transform`. Let's return to the `StandardScaler` object as a reminder.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a>

In [None]:
from sklearn.preprocessing import StandardScaler

From the documentation listed above we know that the standard scaler will take in the features, `X`, and scale them like so:
$$
\frac{X_i - \overline{X_i}}{s_{X_i}}.
$$

Let's generate some data.

In [None]:
X = 10*np.random.randn(100,1)-5

In [None]:
print("The mean of X is",np.mean(X))
print("The variance of X is",np.var(X))

Now we'll scale $X$.

In [None]:
## first we make a scaler object
scaler = StandardScaler()

In [None]:
## Then we fit it
scaler.fit(X)

print("The scaler was fit to have mean",scaler.mean_)
print("and variance",scaler.var_)

In [None]:
## The we transform the data, aka scale it
X_scaled = scaler.transform(X)

In [None]:
print("The mean of X is",np.mean(X_scaled))
print("The standard deviation of X is",np.std(X_scaled))

Now let's imagine we're ready to check the test error on our model. So we have to scale the test features.

In [None]:
X_test = 10*np.random.randn(100,1)-5.1

In [None]:
np.shape(X_test)

In [None]:
print("The mean of X_test is",np.mean(X_test))
print("The variance of X_test is",np.var(X_test))

Now what code should we write to scale the test data?

In [None]:
X_test_scaled = scaler.transform(X_test)

print(np.mean(X_test_scaled))

print(np.var(X_test_scaled))

The order in which these sorts of steps gets done is important. 

This is because you only fit the model on the training data, and the scaler (and other preprocessing steps) is thought of as part of the model. 

Let's do a short practice

### You Code

#### A New Scaler

Go to the documentation and read about the `MinMaxScaler` object, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html</a>.

Use the `MinMaxScaler` to scale the following training and test data.

In [None]:
## Your train and test data
X_train = np.random.randint(1,1000,1000)
X_test = np.random.randint(1,1000,1000)

In [None]:
## Import MinMaxScaler here



In [None]:
## Fit and transform the training and test data 
## using a MinMaxScaler here



#### Imputing Values

Sometimes your data may have missing values. It is often bad practice to throw away missing values, one option is to instead <i>impute</i> them.

Imputation is when you use the non-missing values to fill in the missing values. Three simple ways would be to replace the missing values with the mean, median, or mode of the training data.

Here is the documentation on the `SimpleImputer`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html</a>.

We'll now impute the missing values on the following data using the median of the data.

In [None]:
## Here is some data
X_train = np.random.randn(1000)
X_test = np.random.randn(1000)

## With some values missing
X_train[np.random.choice(range(1000),20)] = np.nan
X_test[np.random.choice(range(1000),20)] = np.nan

In [None]:
## Import the SimpleImputer
from sklearn.impute import SimpleImputer

In [None]:
## Make the imputer object with the desired "strategy"
imp = SimpleImputer(strategy = 'median')

print("X_train has", sum(np.isnan(X_train)), "missing values.")

## impute the missing values

# first fit the imputer
imp.fit(X_train.reshape(-1,1))

# then transform
X_train_imp = imp.transform(X_train.reshape(-1,1))

print("After imputing X_train has", sum(np.isnan(X_train_imp)), "missing values.")



In [None]:
## Now impute on the test data
## note that we don't use the "fit" step here
print("X_test has", sum(np.isnan(X_test)), "missing values.")

X_test_imp = imp.transform(X_test.reshape(-1,1))

print("After imputing X_test has", sum(np.isnan(X_test_imp)), "missing values.")

## The California Housing Data Set

We'll now introduce a popular machine learning data set, the California Housing data set. The data is used in the book <a href="https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/">Hands-On Machine Learning with Scikit-Learn and TensorFlow</a> as an example of a machine learning workflow. This is an excellent book and a useful reference if you're looking to purchase a book about machine learning with python.

We won't be using this data to build a predictive model, but rather to demonstrate the need for advanced pipelines.

In [None]:
## Read the data
df = pd.read_csv("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")

In [None]:
df_train = df.copy().sample(frac=.75, random_state = 440)
df_test = df.copy().drop(df_train.index)

In [None]:
## Let's look at the dataframe info
df_train.info()

In [None]:
## What kind of categories are possible for ocean proximity?
df_train.ocean_proximity.value_counts()

In [None]:
## Each dot is at it's longitude and latitude
## the size of the dot is proportional to its population
## the color of the dot represents the median_house_value of the dot
df_train.plot(kind="scatter", x = "longitude", y = "latitude",
             alpha = .9, s = df_train["population"]/50, label="population",
             figsize=(12,14), c="median_house_value",cmap = plt.get_cmap("viridis"), 
             colorbar=True)

plt.xlabel("Longitude", fontsize=16)
plt.ylabel("Latitude", fontsize=16)

plt.show()

In [None]:
plt.figure(figsize=(12,14))
sns.scatterplot(data=df_train,x="longitude",y="latitude",hue="ocean_proximity")

plt.xlabel("Longitude", fontsize=16)
plt.ylabel("Latitude", fontsize=16)

plt.show()

Now from our exploration of the data we can see that this data set has a number of preprocessing steps:
1. `total_bedrooms` has a number of missing values that could be imputed
2. `ocean_proximity` needs to be one-hot-encoded
3. Many columns have vastly differing scales, so we should scale them
4. We may want to create additional features from our other features.

Now we'll review how to do 1. and 2. then it will be your job to incorporate 3. and 4. 

As we go through let's remember two main points:
- Fitting should only be performed on the training set
- A good pipeline takes in the features and target without any preprocessing and outputs the fit or prediction

#### Imputing `total_bedrooms`

Recall that we only want to impute the column for `total_bedrooms`. If we were to put `SimpleImputer` as is into the pipeline we'd be imputing the entire dataframe. While this isn't an issue for this dataset (because only `total_bedrooms` is missing data), it's an excellent time to introduce how you can create a custom imputer object.

`sklearn` is quite nice because it gives us the functionality to make custom transformers relatively easily. To do this we make our own transformer object. 

To fully grasp everything going on check out the bonus content notebook in the `python prep` folder where I review objects and classes in python. If you're happy just copying and pasting the code for your own transformers (no shame in that for now, we're learning a lot of data science) no need to read through those notes.

In [None]:
## We'll need these
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

A python object is an instance of a python class.

Below we define our `BedroomImputer` class.

In [None]:
## Define our custom imputer
class BedroomImputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # BedroomImputer
    def __init__(self):
        # I want to initiate each object with
        # the SimpleImputer method
        self.SimpleImputer = SimpleImputer(strategy = "median")
    
    # For my fit method I'm just going to "steal"
    # SimpleImputer's fit method using only the
    # 'total_bedrooms' column
    def fit(self, X, y = None ):
        self.SimpleImputer.fit(X['total_bedrooms'].values.reshape(-1,1))
        return self
    
    # Now I want to transform the total_bedrooms columns
    # and return it with imputed values
    def transform(self, X, y = None):
        copy_X = X.copy()
        copy_X['total_bedrooms'] = self.SimpleImputer.transform(copy_X['total_bedrooms'].values.reshape(-1,1))
        return copy_X

We now have a custom imputer let's put it to work.

In [None]:
imputer = BedroomImputer()

In [None]:
df_train.total_bedrooms.describe()

In [None]:
imputer.fit(df_train)

imputer.transform(df_train).total_bedrooms.describe()

imputer.fit_transform(df_train)

#### One-Hot-Encoding `ocean_proximity`

Now let's see how we can one-hot-encode `ocean_proximity`.

Here we can use the `FunctionTransformer` object.

In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
# define our preprocessing function
# This creates bedrooms_per_room
# and one hot encodes ocean_proximity
def one_hot_encode(df):
    df_copy = df.copy()
    
    hot_encoding = pd.get_dummies(df_copy['ocean_proximity'])
    df_copy[hot_encoding.columns[:-1]] = hot_encoding[hot_encoding.columns[:-1]]
    
    return df_copy

In [None]:
one_hot = FunctionTransformer(one_hot_encode)

In [None]:
one_hot.transform(df_train)

Great!

Now it's your turn.

### You Code

Your boss has told you that her end goal is to regress `median_house_value` on `median_income`, `ocean_proximity`, and a new feature, `bedrooms_per_room`.

Write a function called `get_feats` that takes in a feature dataframe and returns the columns for `median_income` the one-hot-encoded `ocean_proximity`, and `bedrooms_per_room`. Feel free to use the function, `one_hot_encode` or not. Then create a `FunctionTransformer` object using `get_feats`, check to make sure that running `df_train` through your transformer object returns a dataframe with the desired columns, i.e. `median_income`, the one-hot-encoded `ocean_proximity` and `bedrooms_per_room`.

In [None]:
## Code here



In [None]:
## Code here



Now you remember that it's important to scale the data prior to fitting your model. However, you only want to scale the columns for `median_income` and `bedrooms_per_room`, not the one-hot-encoded columns. Following the approach we took for `BedroomImputer` define a custom scaler called, `NumericScale` that takes in the dataframe produced by get_feats and scales the `median_income` and `bedrooms_per_room` columns. Hint: use `StandardScaler` in a manner similar to how `SimpleImputer` was used above.

In [None]:
## Below is a SAMPLE SOLUTION

In [None]:
## Code here


In [None]:
## Code here



Now we can put it all together!

## That's it!

That's it for this notebook! You're now able to create more advanced pipelines which will make your code cleaner and more understandable.

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)