## FunctionTransformer

The ojective of this notebook is show how to use `FunctionTransformer` from `sklearn.preprocessing`. We have already covered [`ColumnTransformer`](https://github.com/amriteshkt/feature-engineering-practice/tree/main/04_column_transformer) and [`Pipeline`](https://github.com/amriteshkt/feature-engineering-practice/tree/main/05_pipeline).  

This notebook might seem repetitive in the beginning as I plan to write down every single step, column by column of dataset without using any other class except `FunctionTransformer`.  
At the end of the notebook, I have shown how we can use other classes like `ColumnTransformer` and `Pipeline` to write less and do more. 

[scikit-learn documentation for FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import FunctionTransformer

We are using titanic dataset, downloaded from [Kaggle](https://www.kaggle.com/competitions/titanic/data), in this notebook.

In [2]:
df = pd.read_csv('titanic_data.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We drop columns `PassengerId`, `Name`, `Ticket`, and `Cabin` from DataFrame `df` to simplify our analysis.

In [3]:
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)

In [4]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


We separate the feature `X` from target `y`.

In [5]:
X = df.drop(columns=['Survived'])
y = df['Survived']
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


Split the dataset into train and test sets.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


`FunctionTransformer` from `sklearn.preprocessing`, takes a built-in or custom function, and uses it to transform specified columns.  

We start with first column: `Pclass`. Our aim is to impute null values with *mode*. This can be done by using `SimpleImputer(strategy='most_frequent')`, but in this notebook, we will use `FunctionTransformer`.

In [8]:
# for 'Pclass' column, we create a function to impute null values.
def pclass(column):
    """Fill null values in Pclass column with mode.
    """
    return column.fillna(column.mode().iloc[0])

In [9]:
# pass the function as argument in FunctionTransformer
tranformer_pclass = FunctionTransformer(func=pclass)

# fit and transform the training and testing data.
X_train_pclass = tranformer_pclass.fit_transform(X_train['Pclass'])
X_test_pclass = tranformer_pclass.transform(X_test['Pclass'])

For `Sex` column, we need to perform imputation with *mode*, then one-hot encoding.

In [10]:
# for 'Sex' column, custom function to impute null values and one-hot encoding

def sex(column):
    """Impute missing values with mode and one-hot encode.
    """
    temp = column.fillna(column.mode().iloc[0])
    return pd.get_dummies(temp)

In [11]:
transformer_sex = FunctionTransformer(func=sex)
X_train_sex = transformer_sex.fit_transform(X_train['Sex'])
X_test_sex = transformer_sex.transform(X_test['Sex'])

For `Age` column, we need to impute missing values with *mean*.

In [12]:
# for 'Age' column, we create a custom function to fill missing values with mean.

def age(column):
    """Impute null values with mean.
    """
    return column.fillna(column.mean())

In [13]:
transformer_age = FunctionTransformer(func=age)
X_train_age = transformer_age.fit_transform(X_train['Age'])
X_test_age = transformer_age.transform(X_test['Age'])

For `SibSp` column, we need to impute missing values with *mode*.  We could have used function `pclass` described above as it perform same fuction of imputing missing values with *mode*, but to avoid confusion, we create different function named `sibsp`.

In [14]:
# for 'SibSp' column, custom function to fill missing values with mode.

def sibsp(column):
    """Fill null values in SibSp column with mode.
    """
    return column.fillna(column.mode().iloc[0])

In [15]:
transformer_sibsp = FunctionTransformer(func=sibsp)
X_train_sibsp = transformer_sibsp.fit_transform(X_train['SibSp'])
X_test_sibsp = transformer_sibsp.transform(X_test['SibSp'])

Similar to `SibSp` column, we need to fill missing values in `Parch` column with *mode*.

In [16]:
# for 'Parch' column, custom function to fill missing values with mode.

def parch(column):
    """Fill null values in Parch column with mode.
    """
    return column.fillna(column.mode().iloc[0])

In [17]:
transformer_parch = FunctionTransformer(func=parch)
X_train_parch = transformer_parch.fit_transform(X_train['Parch'])
X_test_parch = transformer_parch.transform(X_test['Parch'])

For `Fare` column, we impute missing values with *mean*. Again, we could have used `age` function defined above as it also imputes missing values with *mean*, but for clarity we define new function `fare` for `Fare` column.

In [18]:
# for Fare column, custom function to fill missing values with mean.

def fare(column):
    """Impute null values with mean.
    """
    return column.fillna(column.mean())

In [19]:
transformer_fare = FunctionTransformer(func=fare)
X_train_fare = transformer_fare.fit_transform(X_train['Fare'])
X_test_fare = transformer_fare.transform(X_test['Fare'])

For `Embarked` column, we impute missing values with *mode*, then perform one-hot encoding. Similar to what we did with `Sex` column.

In [20]:
# for Embarked column

def embarked(column):
    """Impute missing values and one-hot encode.
    """
    temp = column.fillna(column.mode().iloc[0])
    return pd.get_dummies(temp)

In [21]:
transformer_embarked = FunctionTransformer(func=embarked)
X_train_embarked = transformer_embarked.fit_transform(X_train['Embarked'])
X_test_embarked = transformer_embarked.transform(X_test['Embarked'])

We have transformed all columns. We concatenate them to form a train and test DataFrames. We store them in variables `X_train_transformed` and `X_test_transformed`.

In [22]:
X_train_transformed = pd.concat([X_train_pclass, X_train_sex, X_train_age, X_train_sibsp, X_train_parch, X_train_fare, X_train_embarked], axis=1)
X_train_transformed.head()

Unnamed: 0,Pclass,female,male,Age,SibSp,Parch,Fare,C,Q,S
331,1,False,True,45.5,0,0,28.5,False,False,True
733,2,False,True,23.0,0,0,13.0,False,False,True
382,3,False,True,32.0,0,0,7.925,False,False,True
704,3,False,True,26.0,1,0,7.8542,False,False,True
813,3,True,False,6.0,4,2,31.275,False,False,True


In [23]:
X_test_transformed = pd.concat([X_test_pclass, X_test_sex, X_test_age, X_test_sibsp, X_test_parch, X_test_fare, X_test_embarked], axis=1)
X_test_transformed.head()

Unnamed: 0,Pclass,female,male,Age,SibSp,Parch,Fare,C,Q,S
709,3,False,True,30.505845,1,1,15.2458,True,False,False
439,2,False,True,31.0,0,0,10.5,False,False,True
840,3,False,True,20.0,0,0,7.925,False,False,True
720,2,True,False,6.0,0,1,33.0,False,False,True
39,3,True,False,14.0,1,0,11.2417,True,False,False


## ColumnTransformer

The steps that we did above seemed to be repetitive. It was done so that if somebody who doesn't know about `ColumnTransformer` and `Pipeline` can understand `FunctionTransformer` without much difficulty.  

If the reader knows about `ColumnTransformer` and `Pipeline`, following piece of code can help write clean code.  

It is recommended to read these notebooks: [ColumnTransformer](https://github.com/amriteshkt/feature-engineering-practice/blob/main/04_column_transformer/ColumnTransformer.ipynb) and [Pipeline](https://github.com/amriteshkt/feature-engineering-practice/blob/main/05_pipeline/Pipeline.ipynb)

In [24]:
from sklearn.compose import ColumnTransformer

In [25]:
transformer = ColumnTransformer(transformers=[
    ('plass_transformer', tranformer_pclass, ['Pclass']),
    ('sex_transformer', transformer_sex, ['Sex']),
    ('age_transformer', transformer_age, ['Age']),
    ('sibsp_transformer', transformer_sibsp, ['SibSp']),
    ('parch_transformer', transformer_parch, ['Parch']),
    ('fare_transformer', transformer_fare, ['Fare']),
    ('embarked_transformer', transformer_embarked, ['Embarked'])
], remainder='passthrough')

In [26]:
# fit and transform the train and test sets.

X_train_transformed_col_trans = transformer.fit_transform(X_train)
X_test_transformed_col_trans = transformer.transform(X_test)

We check if we are getting same result when we use `ColumnTransformer`. 

`X_train_transformed` is a pandas DataFrame object while `X_train_transformed_col_trans` is a `numpy.ndarray` object.

In [27]:
(X_train_transformed == X_train_transformed_col_trans).sum()

Pclass    712
female    712
male      712
Age       712
SibSp     712
Parch     712
Fare      712
C         712
Q         712
S         712
dtype: int64

We get same elements after performing tranformations

In [28]:
X_train_transformed.isna().sum()

Pclass    0
female    0
male      0
Age       0
SibSp     0
Parch     0
Fare      0
C         0
Q         0
S         0
dtype: int64

## Pipeline

The final step to conclude this notebook is to use `Pipeline` to perform transformations on columns,train a model, and check it accuracy.

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline(steps=[
    ('columns_transformer', transformer),
    ('rf', RandomForestClassifier(random_state=42))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1])

In [30]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8100558659217877

TODO
Use visualization to show the transformations.