# Data Science Quick Tip #004: Using Custom Transformers in Scikit-Learn Pipelines!
In our last post, we covered how to use Scikit-Learn pipelines to conjoin all the appropriate transformers into a single output. In this new post, we'll take things a step further by adding custom transformers to the pipeline. Because this is very much building on top of the last post, much of this code should already appear to be familiar to you.

## Project Setup
Let's go ahead and import the libraries we'll be using as well as the datasets.

In [23]:
# Importing the libraries we'll be using for this project
import pandas as pd
import joblib

from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

In [2]:
# Importing the training dataset
raw_train = pd.read_csv('../data/titanic/train.csv')

In [3]:
# Splitting the training data into appropriate training and validation sets
X = raw_train.drop(columns = ['Survived'])
y = raw_train[['Survived']]

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)

In [4]:
# Viewing first few rows of X_train dataset
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
298,299,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,S
884,885,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
247,248,2,"Hamalainen, Mrs. William (Anna)",female,24.0,0,2,250649,14.5,,S
478,479,3,"Karlsson, Mr. Nils August",male,22.0,0,0,350060,7.5208,,S
305,306,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S


In [11]:
# Viewing first few rows of y_train dataset
y_train.head()

Unnamed: 0,Survived
298,1
884,0
247,1
478,0
305,1


## Creating Our Pipeline (Now With Custom Transformers!)
With our data imported, we're ready to go ahead and start creating our pipeline. As mentioned above, we'll only be using the default transformers here, so we definitely won't be getting great results out of our model predictions. But that's okay! The purpose here is learning how to use a pipeline.

Note: You might be wondering in the next cell why we're creating a column transformer for a single column. This is because in the next post, we'll be adding custom transformers making use of mostly the same code you'll see below. (With a few additions!)

In [27]:
# Creating a function to appropriately engineer the 'Age' column
def create_age_bins(col):
    '''Engineers age bin variables for pipeline'''
    
    # Defining / instantiating the necessary variables
    age_bins = [-1, 12, 18, 25, 50, 100]
    age_labels = ['child', 'teen', 'young_adult', 'adult', 'elder']
    age_imputer = SimpleImputer(strategy = 'median')
    age_ohe = OneHotEncoder()
    
    # Performing basic imputation for nulls
    imputed = age_imputer.fit_transform(col)
    ages_filled = pd.DataFrame(data = imputed, columns = ['Age'])
    
    # Segregating ages into age bins
    age_cat_cols = pd.cut(ages_filled['Age'], bins = age_bins, labels = age_labels)
    age_cats = pd.DataFrame(data = age_cat_cols, columns = ['Age'])
    
    # One hot encoding new age bins
    ages_encoded = age_ohe.fit_transform(age_cats[['Age']])
    ages_encoded = pd.DataFrame(data = ages_encoded.toarray())
    
    return ages_encoded

In [6]:
# Creating function to appropriately engineer the 'Embarked' column
def create_embarked_columns(col):
    '''Engineers the embarked variables for pipeline'''
    
    # Instantiating the transformer objects
    embarked_imputer = SimpleImputer(strategy = 'most_frequent')
    embarked_ohe = OneHotEncoder()
    
    # Performing basic imputation for nulls
    imputed = embarked_imputer.fit_transform(col)
    embarked_filled = pd.DataFrame(data = imputed, columns = ['Embarked'])
    
    # Performing OHE on the col data
    embarked_columns = embarked_ohe.fit_transform(embarked_filled[['Embarked']])
    embarked_columns_df = pd.DataFrame(data = embarked_columns.toarray())
    
    return embarked_columns_df

In [31]:
# Creating a preprocessor to transform the 'Sex' column
data_preprocessor = ColumnTransformer(transformers = [
    ('sex_transformer', OneHotEncoder(), ['Sex']),
#     ('age_transformer', FunctionTransformer(create_age_bins, validate = False), ['Age']),
    ('embarked_transformer'), FunctionTransformer(create_embarked_columns, validate = False), ['Embarked']
])

In [32]:
# Creating our pipeline that first preprocesses the data, then scales the data, then fits the data to a RandomForestClassifier
rfc_pipeline = Pipeline(steps = [
    ('data_preprocessing', data_preprocessor),
    ('data_scaling', StandardScaler()),
    ('model', RandomForestClassifier(max_depth = 10,
                                     min_samples_leaf = 3,
                                     min_samples_split = 4,
                                     n_estimators = 200))
])

In [33]:
# Fitting the training data to our pipeline
rfc_pipeline.fit(X_train, y_train)

TypeError: zip argument #4 must support iteration

In [15]:
# Saving our pipeline to a binary pickle file
joblib.dump(rfc_pipeline, 'model/rfc_pipeline.pkl')

['model/rfc_pipeline.pkl']

In [16]:
# Loading back in our serialized model
loaded_model = joblib.load('model/rfc_pipeline.pkl')

In [17]:
# Checking out our predicted results using the validation dataset
pipeline_preds = loaded_model.predict(X_val)

val_accuracy = accuracy_score(y_val, pipeline_preds)
val_roc_auc = roc_auc_score(y_val, pipeline_preds)
val_confusion_matrix = confusion_matrix(y_val, pipeline_preds)

print(f'Accuracy Score: {val_accuracy}')
print(f'ROC AUC Score: {val_roc_auc}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.7847533632286996
ROC AUC Score: 0.7718430320308569
Confusion Matrix: 
[[112  22]
 [ 26  63]]
