<a href="https://www.kaggle.com/code/eleonoraricci/learning-pipelines-productivity-reusability?scriptVersionId=128571613" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Leverage Pipelines to enhance reusability and increase productivity**

Recently I've been learning about **sklearn pipelines** and I wanted to apply this useful workflow to the Titanic competition, to streamline data handling and model training, in a way that can be **easily reused** with other datasets or to simplify model and feature engineering tests, and I thought it would be useful to share a walkthrough applied to this dataset.

Here I don't dig deep into the feature engineering aspects, but focus on showcasing a general and reusable preprocessing workflow. 

Let's dig in! 💪

In [1]:
import pandas as pd

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

seed = 17

In [2]:
training_set = pd.read_csv("/kaggle/input/titanic/train.csv")
submission_set = pd.read_csv("/kaggle/input/titanic/test.csv")

# **Check missing values**

In [3]:
#Check which columns have missing values
print("Missing values in training set : ")
print(training_set.isna().sum())
print("\n")
print("Missing values in submission set : ")
print(submission_set.isna().sum())
print("\n")

Missing values in training set : 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Missing values in submission set : 
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64




# **Preprocessing steps**

For starters, we can divide the preprocessing into several steps:
- Features to drop
- Features containing missing values
- Features to encode
- Features to scale/normalize

In [4]:
# Listing the columns that we want to pre-process with each strategy. Here no missing values to handle.
fill_with_mean = []
fill_with_median = ["Fare"]
fill_with_most_frequent = []
fill_with_value, fill_value = [], None  #Neglecting "Cabin" and "Embarked" because we'll drop them in this example

one_hot_encode = ["Sex"]
ordinal_encode = []

columns_to_drop = ["Name", "Ticket", "Cabin", "PassengerId", "Embarked", "Age", "Parch", "SibSp"]

min_max_scale = []
standard_scale = []
to_normalize = []

# **Dropping DataFrame columns**
We can accomplish this by [creating a custom column transformer](https://www.andrewvillazon.com/custom-scikit-learn-transformers/) to add to the pipeline, to remove feature we don't want to use.
By inheriting from *BaseEstimator* and *TransformerMixin* and implementing the *fit* and *transform* methods, we achieve the required functionality and we can then use it inside the pipelines.

In [5]:
class ColumnsDropper(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        super(ColumnsDropper, self).__init__()
        self.columns=columns

    def transform(self,X, y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self

# **Imputing missing values**

We can define various imputers to **handle missing values** according to a specified strategy, using the sklearn SimpleImputer. Different imputer strategies can be applied to different columns. "*Mean*" and "*median*" strategies can only be applied to numerical columns, "*most_frequent*" and "*constant*" can be applied to both strings and numbers.

We can feed **empty lists** to our Pipeline, therefore we can **keep the implementation general** and have the code ready for reuse in more complex cases by defining a **ColumnTransformer** that applies all imputing strategies to the specified lists of columns. It is important to specify the *remainder='passthrough'* argument, otherwise the default behaviour is to drop the columns that were not transformed by any transformer.

**[SimpleImputer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)**

**[ColumnTransformer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)**

In [6]:
col_transformers=[
    ('mean', SimpleImputer(strategy='mean'), fill_with_mean),
    ('median', SimpleImputer(strategy='median'), fill_with_median),
    ('most_frequent', SimpleImputer(strategy='most_frequent'), fill_with_most_frequent),
    ('value', SimpleImputer(strategy='constant', fill_value = fill_value), fill_with_value),        
]

# **Encoding categorical variables**
We may need to use an encoder to transform labeled categories into a numerical input, either with **Ordinal Encoding** or **One-Hot encoding**.

|City | Ordinal Encoding | One-Hot Encoding|
|----|------|------|
|Tokyo| 0 | [1,0,0] |
|Madrid| 1 |[0,1,0] |
|Tunisi| 2 |[0,0,1] |

Or maybe we have a categorical variable expressed numerically and we want to transform it to a One-Hot encoded one. For example, we could have a variable "Season" taking values 0, 1, 2, 3, which are numerical, but they do not really contain a numerical relationship between the values (it is not meaningful to consider "autumn" > " winter" or vice versa), therefore we might prefer to one-hot encode it instead.

Documentation for **[OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)**, **[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder)**, and **[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)**. 

In [7]:
# We append the Encoders to the column transformers list
col_transformers.append(('one_hot', OneHotEncoder(), one_hot_encode))
col_transformers.append(('ordinal', OrdinalEncoder(), ordinal_encode))

Pipelines handle tranformation of the input data, not of the target, therefore these need to be handled separately. In this case we need to obtain an Ordinal Encoding of the labels, thus we can apply LabelEncoder.

In [8]:
# This is an Ordinal transformer and it should be used to encode target values, i.e. y, and not the input X.
label_encoder = LabelEncoder()

# **Scaling/Standardizing/Normalizing**

Many models benefit in terms of accuracy and robustness when features are on a relatively similar scale and close to normally distributed. **MinMaxScaler**, **StandardScaler**, and **Normalizer** are scikit-learn methods to preprocess data in this way, with different use cases, which are discussed, for example, in this [blog post](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02).

In [9]:
col_transformers.append(('min_max', MinMaxScaler(), min_max_scale))
col_transformers.append(('std', StandardScaler(), standard_scale))
col_transformers.append(('norm', Normalizer(), to_normalize))

# **Collecting all the pieces**
With all the preprocessing steps listed, we can create a ColumnTransformer, that will take case of the preprocessing.

Documentation for **[ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)**.

In [10]:
preprocessor = ColumnTransformer(
    transformers=col_transformers,      
    remainder='passthrough')

# **Pipeline**
The **Pipeline** allows stacking various operations to perform on the data *in the specified sequence*, concluding with an estimator. A difference between a **Pipeline** and a **ColumnTransformer**, is that in the Pipeline all the steps specified are **applied to the same data**, whereas in the ColumnTransformer we can **specify different columns for each action to perform**. 

Intermediate steps of the pipeline must be *transforms*, that is, they must implement *fit* and *transform* methods. The final estimator only needs to implement *fit*. 

In the example in the comment below one has (1) defined a single transformation for the numerical features (the SimpleImputer), (2) created a Pipeline of transformations to be performed in sequence for the categorical variables (fill in missing values and then OneHotEncoding). (3) All defined column transformations are then grouped into a ColumnTransformer.

Documentation for **[Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)**.

In [11]:
######   EXAMPLE   ######
## Preprocessing for numerical data
#numerical_transformer = SimpleImputer(strategy='constant')

## Preprocessing for categorical data
#categorical_transformer = Pipeline(steps=[
#    ('imputer', SimpleImputer(strategy='most_frequent')),
#    ('onehot', OneHotEncoder(handle_unknown='ignore'))
#])

# Bundle preprocessing for numerical and categorical data
#preprocessor = ColumnTransformer(
#    transformers=[
#        ('num', numerical_transformer, numerical_cols),
#        ('cat', categorical_transformer, categorical_cols)
#    ])

### **>>Insert Model Here<<**
At this point we need **initialize the model**, to concatenate data processing and model fitting in the subsequent cell.

In [12]:
from sklearn.ensemble._forest import RandomForestClassifier

model = RandomForestClassifier(criterion='gini',
                               n_estimators=1750,
                               max_depth=7,
                               min_samples_split=6,
                               min_samples_leaf=6,
                               max_features='auto',
                               oob_score=True,
                               random_state=seed,
                               n_jobs=-1,
                               verbose=1) 

Finally we can **concatenate the preprocessing and modelling in the final Pipeline**. 

Yes, **pipelines can be nested**! Here the *dropper* is a Pipeline itself, and the *preprocessor* is a ColumnTransformer that in general could contain Pipelines as well. This nested structure creates a hierarchy which controls the sequence of the execution.

In [13]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('dropper', ColumnsDropper(columns = columns_to_drop)),
                              ('preprocessor', preprocessor),
                              ('model', model) ]) 

# **Splitting the data and training the model**

In [14]:
X = training_set.copy()
# We remove the target column from the training data... 
y = X.pop("Survived")
# ...and we split it in train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=seed)

The next line applies the pipeline to the train set: preprocessing and training the model.

In [15]:
my_pipeline.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 1750 out of 1750 | elapsed:    3.3s finished


Pipeline(steps=[('dropper',
                 ColumnsDropper(columns=['Name', 'Ticket', 'Cabin',
                                         'PassengerId', 'Embarked', 'Age',
                                         'Parch', 'SibSp'])),
                ('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('mean', SimpleImputer(), []),
                                                 ('median',
                                                  SimpleImputer(strategy='median'),
                                                  ['Fare']),
                                                 ('most_frequent',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  []),
                                                 ('value',
                                                  SimpleImputer(strategy='constant'),
                               

# **Evaluate the model predictions**
Now with one line we can apply the same pipeline to the test data and get predictions from the trained model.

In [16]:
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 1750 out of 1750 | elapsed:    0.4s finished


In [17]:
# This competition uses the accuracy_score as a metric
from sklearn.metrics import accuracy_score

# Evaluate the model
score = accuracy_score(y_test, preds)
print('Accuracy:', score)

Accuracy: 0.8156424581005587


This concludes the demonstration of the Pipeline implementation. To go forward one could leverage the piepline to create a workflow to check different models, different hyperparameters, or different preprocessing strategies... all while keeping the code modular, readable, and resusable!

# **Submission**
After we have completed all preliminary analysis and chosen the best modelling approach, we can retrain the model using the whole training data to further increase its accuracy (in this final stage the test set is actually the one used to score the submission on the leaderboard!).

In [18]:
my_pipeline.fit(X, y)
X_sub = submission_set.copy()
results = my_pipeline.predict(X_sub)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 1750 out of 1750 | elapsed:    3.3s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 1750 out of 1750 | elapsed:    0.5s finished


In [19]:
submission_df = pd.DataFrame({'PassengerId': submission_set['PassengerId'], 'Survived': results})
submission_df.to_csv('submission.csv', index=False)

Since I started using Pipelines I have greatly reduced the time spent writing boilerplate code and can concentrate more on the more creative aspects of ML modelling. 🎨 I hope this walkthrough was useful and that others can benefit from implementing Pipelins as well!! 🔥 

Happy modelling! :D