# More Advanced Pipelines

Sometimes you will need more customized functionality in your pipeline.

## What we will accomplish

In this notebook we will:
- Return to our penguins data set example,
- See how to build custom transformer objects,
- Introduce `FunctionTransformer`,
- Again review the `fit`, `transform` and `fit_transform` paradigm and
- Build a pipeline to predict penguin type.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from seaborn import set_style, pairplot
set_style("whitegrid")

## Penguins Data

A motivating problem for this notebook will be our desire to build a model that classifies different penguins into three unique species. We saw a version of this `seaborn` data set in the `Imputation` notebook. 

Let's load the version we will use in this notebook.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
penguins = pd.read_csv("../../Data/penguins_for_pipes.csv")

In [None]:
penguins.info()

In [None]:
peng_train, peng_test = train_test_split(penguins,
                                            random_state = 440,
                                            shuffle = True,
                                            test_size=.25)

In [None]:
pairplot(peng_train, hue='species')

plt.show()

In [None]:
print(peng_train.island.value_counts())

print()

print(peng_train.sex.value_counts())

## Our desired pipeline

Our desired pipeline for this model is:

1. Impute the missing values of `body_mass_g` with the `median` value,
2. Impute the missing values of `sex` with the most common value,
3. One hot encode `island` and `sex` and
4. Fit a random forest model to the data.

The reason that our basic pipeline approach may face issues here is that we are mixing data with categorical and numeric features. For example, just calling `SimpleImputer` in a pipeline will cause the same imputation to be applied to all missing values regardless of column. Another time this could be an issue is if you want to scale some columns but not others.

## Transformer objects

`sklearn` has the functionality for us to write our own custom <i>transformer objects</i> which we can use to make custom imputers. A transformer object in this context is an `sklearn` object that performs some transformation on the data. We have seen a number of these in prior notebooks including:
- `StandardScaler`,
- `PolynomialFeatures` and
- `SimpleImputer`.

### Making custom trasformer objects

We can now learn how to make our own transformer objects.

We will start by making a custom imputer for `body_mass_g`.

<i>Note that this material may be more easily digested if you have consumed the optional `Classes and Objects in Python` notebook in the `Python Prep` material in this repository.</i>

In [None]:
## We'll need these
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
## Define our custom imputer
class BodyMassImputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # BodyMassImputer
    def __init__(self):
        # I want to initiate each object with
        # the SimpleImputer method
        self.SimpleImputer = SimpleImputer(strategy = "median")
    
    # For my fit method I'm just going to "steal"
    # SimpleImputer's fit method using only the
    # 'body_mass_g' column
    def fit(self, X, y = None ):
        self.SimpleImputer.fit(X['body_mass_g'].values.reshape(-1,1))
        return self
    
    # Now I want to transform the 'body_mass_g' columns
    # and return it with imputed values
    def transform(self, X, y = None):
        copy_X = X.copy()
        copy_X['body_mass_g'] = self.SimpleImputer.transform(copy_X['body_mass_g'].values.reshape(-1,1))
        return copy_X

We now have a custom imputer let's test it.

In [None]:
imputer = BodyMassImputer()

In [None]:
## Look at the data where body_mass_g is missing
peng_train.loc[peng_train.body_mass_g.isna()]

In [None]:
## Here is what results at the missing values of body_mass_g in the train set
## note we can use fit_transform because it is the training set
imputer.fit_transform(peng_train).loc[peng_train.body_mass_g.isna()]

We can make a custom imputer for `sex` in a similar fashion.

In [None]:
## Define our custom imputer
class SexImputer(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # SexImputer
    def __init__(self):
        # I want to initiate each object with
        # the SimpleImputer method
        
    
    # For my fit method I'm just going to "steal"
    # SimpleImputer's fit method using only the
    # 'sex' column
    
    
    
    # Now I want to transform the 'sex' columns
    # and return it with imputed values
    
    

In [None]:
imputer = SexImputer()

In [None]:
peng_train.loc[peng_train.sex.isna()]

In [None]:
imputer.fit_transform(peng_train).loc[peng_train.sex.isna()]

Finally we need a one hot encoder for these data. Luckily we can use a `FunctionTransformer` object for this, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html</a>.

`FunctionTransformer` allows you to write a python function and have it applied to your data set as a transformer object.

First we define a function that will take in the dataframe and return one with one hot encoded data for `island` and `sex`.

In [None]:
def one_hot_encoder(df):
    df_copy = df.copy()
    
    ## first replace Male Female with 0-1s
    df_copy['sex'] = pd.get_dummies(df['sex'])['Female'].copy()
    
    ## Now get island columns
    df_copy[['Biscoe', 'Dream', 'Torgersen']] = pd.get_dummies(df['island'])[['Biscoe', 'Dream', 'Torgersen']]
    
    return df_copy[['bill_length_mm', 'bill_depth_mm',
               'flipper_length_mm', 'body_mass_g', 
               'sex', 'Biscoe', 'Dream', 'Torgersen']]

In [None]:
peng_train.head()

In [None]:
## Look at the one hot encoded data


We can now wrap the function `one_hot_encoder` in the `FunctionTransformer` object to turn it into a transformer object that does the one hot encoding we would like.

In [None]:
one_hot_transformer = 

In [None]:
one_hot_transformer.transform(peng_train).head()

Note that with `one_hot_transformer` we did not have to call `fit` prior to `transform` this is because `FunctionTransformer` does not need to be fitted, it is just applying the function we wrote.

### Making the pipeline

We can now put all of this together in a pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [None]:
pipe = 

In [None]:
pipe.fit(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']],
         peng_train['species'])

In [None]:
train_pred = pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

pipe.predict(peng_train[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

In [None]:
test_pred = pipe.predict(peng_test[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

pipe.predict(peng_test[['bill_length_mm', 'bill_depth_mm',
                       'flipper_length_mm', 'body_mass_g',
                       'island', 'sex']])

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print("Training accuracy", accuracy_score(peng_train.species, train_pred))
print("Test accuracy", accuracy_score(peng_train.species, train_pred))

You should now be able to create more complicated pipelines!

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)