**Author:** Cainã Max Couto da Silva  
**LinkedIn:** [@cmcouto-silva](https://www.linkedin.com/in/cmcouto-silva/)

&nbsp;

---

This notebook continues the [previous notebook](https://colab.research.google.com/drive/1TrtQDyd9evEIl1GbvxZeIbENiYQVsN7T?usp=sharing), where we have studied how to build manual transformers, sklearn transformers, and sklearn column transformers. Please check it out!

This notebook aims to familiarize you with the scikit-learn pipelines.

It's divided into two main topics:
- Pipeline demonstration using a small and unrealistic fake data
- Scikit-learn pipeline using the IBM churn dataset

# Setup

Please make sure to use scikit-learn>=1.3.2:

In [None]:
# %pip install scikit-learn==1.3.2

## Libraries

Like before, the libraries will be imported as needed for didactic reasons.

In [None]:
import numpy as np
import pandas as pd

# For displaying pipelines
from sklearn import set_config
set_config(display='diagram')
set_config(transform_output="pandas")

## Dataset

Let's start by reproducing the fake data from the previous notebook:

In [None]:
# Create simulated data set

df_train = pd.DataFrame({
    'tool_id': [1,2,3,4,5],
    'temperature': [180,100,120,np.nan,90],
    'pressure': [13000,5000,11000,4500,np.nan],
    'due_maintenance': ['Yes', 'No', 'Yes', 'Yes', 'No'],
    'age_status': ['old','new','old','old','new'],
    'failed':[True,False,True,False,False]
}).set_index('tool_id')

df_test = pd.DataFrame({
    'tool_id': [6,7,8],
    'temperature': [85,110,np.nan],
    'pressure': [6000,10500,3300],
    'due_maintenance': ['Yes', 'Yes', 'No'],
    'age_status': ['new', 'old','ancient'],
    'failed':[False,True,False]
}).set_index('tool_id')

df_future_unique = pd.DataFrame({
    'tool_id': [10],
    'temperature': [12],
    'pressure': [7500],
    'due_maintenance': ['No'],
    'age_status': ['new'],
}).set_index('tool_id')

print('Train data')
display(df_train)
print()

print('Test data')
display(df_test)
print()

print('Future data')
display(df_future_unique)

Train data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status,failed
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,180.0,13000.0,Yes,old,True
2,100.0,5000.0,No,new,False
3,120.0,11000.0,Yes,old,True
4,,4500.0,Yes,old,False
5,90.0,,No,new,False



Test data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status,failed
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,85.0,6000,Yes,new,False
7,110.0,10500,Yes,old,True
8,,3300,No,ancient,False



Future data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,12,7500,No,new


In [None]:
# List features and target
NUMERICAL_FEATURES = [
    'temperature',
    'pressure'
]

CATEGORICAL_FEATURES = [
    'due_maintenance',
    'age_status'
]

FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES
TARGET = 'failed'

In [None]:
# Train features and target
X_train = df_train[FEATURES]
y_train = df_train[TARGET]

# Test features and target
X_test = df_test[FEATURES]
y_test = df_test[TARGET]

# Instance with unknown target
X_new = df_future_unique[FEATURES]

# **ML Pipelines**

## **Fake data**

In our first example, we apply distinct transformations to numeric and categorical features before training our model. This is where `Pipeline`, `make_pipeline`, `ColumnTransformer`, and `make_column_transformer` from scikit-learn play crucial roles.

In summary:

- **`Pipeline`:** This requires specifying a list of tuples, each representing a step in the pipeline. Each tuple consists of a name (a string) and an object (transformer or estimator). The Pipeline then sequentially applies these transformations and a final estimator.

- **`make_pipeline`:** This is a simpler way to create a pipeline. Instead of naming each step, you just list the transformations and estimators. It automatically assigns names to each step based on their types.

- **`ColumnTransformer`:** This tool is essential for applying different transformations to different columns. You define a list of tuples, where each tuple contains a name, a transformer, and column indices or names. The ColumnTransformer then applies these transformers to the respective columns.

- **`make_column_transformer`:** Similar to make_pipeline, this function simplifies the creation of a ColumnTransformer. You pass the transformers along with the columns they should be applied to, without manually naming each transformer.

In practice, you can use ColumnTransformer or make_column_transformer to handle different types of data (like numeric and categorical) within your dataset, and then encapsulate the entire preprocessing and model training process within a Pipeline or make_pipeline. This not only makes the workflow more streamlined and efficient but also helps in preventing data leakage by ensuring proper separation of training and validation data during transformations.

### Simple pipeline 1

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression

In [None]:
# Preprocessors (transformers)
numeric_preprocessor = SimpleImputer(strategy='mean')
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Create Column transformer with make_column_transformer (tuples: transformer, list of columns)
preprocessor = make_column_transformer(
    (numeric_preprocessor, NUMERICAL_FEATURES),
    (categorical_preprocessor, CATEGORICAL_FEATURES),
)

# Create pipeline (list of tuples - step name, transformer/estimator)
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

# Display pipeline
model_pipeline

Likewise, we can also use the simpler version of the Pipeline:

In [None]:
model_pipeline = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
model_pipeline.fit(X_train, y_train)

In [None]:
# Predict train, test, and new data with pipeline
try:
  print('Train predictions:', model_pipeline.predict(X_train))
  print('Test predictions:', model_pipeline.predict(X_test))
  print('New predictions:', model_pipeline.predict(X_new))
except Exception as e:
  print(e)

Train predictions: [ True False  True False False]
Test predictions: [False  True False]
New predictions: [False]


We can access the pipeline steps through the attribute `.named_steps`. This feature allows us to retrieve and use specific trained transformers, which might be essential for debugging purposes.

In [None]:
# List pipeline steps
model_pipeline.named_steps

{'columntransformer': ColumnTransformer(transformers=[('simpleimputer', SimpleImputer(),
                                  ['temperature', 'pressure']),
                                 ('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  ['due_maintenance', 'age_status'])]),
 'logisticregression': LogisticRegression(max_iter=1000)}

In [None]:
# Access preprocessing
model_pipeline.named_steps['columntransformer'] # or model_pipeline['columntransformer']

In [None]:
# Access preprocessing - imputer
model_pipeline.named_steps['columntransformer'].named_transformers_['simpleimputer']

In [None]:
# Use trained imputer
trained_imputer = model_pipeline.named_steps['columntransformer'].named_transformers_['simpleimputer']
trained_imputer.transform(X_test[NUMERICAL_FEATURES])

Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,85.0,6000.0
7,110.0,10500.0
8,122.5,3300.0


### Simple pipeline 2

What if we use multiple/consecutive transformations to the same features (*e.g.*, imputation and scaling)?

In [None]:
# Preprocessors (transformers)
numeric_preprocessor = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler()
    )

categorical_preprocessor = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Create Column transformer (list ot tuples: step name, transformer, list of columns)
preprocessor = ColumnTransformer([
    ('numeric', numeric_preprocessor, NUMERICAL_FEATURES),
    ('categorical', categorical_preprocessor, CATEGORICAL_FEATURES),
])

# Create pipeline (list of tuples - step name, transformer/estimator)
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

# Train pipeline (transformers and model)
model_pipeline.fit(X_train, y_train)

In [None]:
model_pipeline.named_steps

{'preprocessor': ColumnTransformer(transformers=[('numeric',
                                  Pipeline(steps=[('simpleimputer',
                                                   SimpleImputer()),
                                                  ('standardscaler',
                                                   StandardScaler())]),
                                  ['temperature', 'pressure']),
                                 ('categorical',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  ['due_maintenance', 'age_status'])]),
 'model': LogisticRegression()}

In [None]:
# Access preprocessor step
model_pipeline.named_steps['preprocessor']

In [None]:
# Use trained preprocessor to transform the test data
model_pipeline.named_steps['preprocessor'].transform(X_test)

Unnamed: 0_level_0,numeric__temperature,numeric__pressure,categorical__due_maintenance_No,categorical__due_maintenance_Yes,categorical__age_status_new,categorical__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6,-1.200961,-0.718132,0.0,1.0,1.0,0.0
7,-0.40032,0.64254,0.0,1.0,0.0,1.0
8,0.0,-1.534536,1.0,0.0,0.0,0.0


In [None]:
# Access model and list parameters
model_pipeline.named_steps['model'].get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

### Intermediate pipeline

In this example, let's add an extra step to the pipeline to apply PCA before training the model.

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Preprocessors (transformers)
# numeric preprocessor
numeric_preprocessor = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler()
    )
# categorical preprocessor
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Create the column transformer (list ot tuples: step name, transformer, list of columns)
preprocessor = ColumnTransformer([
    ('numeric', numeric_preprocessor, NUMERICAL_FEATURES),
    ('categorical', categorical_preprocessor, CATEGORICAL_FEATURES),
])

# Create pipeline (list of tuples - step name, transformer/estimator)
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('pca', PCA(n_components=.9)), # we will retain the top N components responsible for 90% of the data variance
    ('model', LogisticRegression())
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

Let's retrieve the trained preprocessor and PCA to output the components

In [None]:
# Let's retrieve the trained preprocessor and PCA to output the components

# Retrieve the trained preprocessor and PCA model
trained_preprocessor = model_pipeline.named_steps['preprocessor']
trained_pca = model_pipeline.named_steps['pca']

# Apply transformations to the test data
trained_pca.transform( trained_preprocessor.transform(X_test) )

Unnamed: 0_level_0,pca0,pca1
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,-1.260717,0.309649
7,0.536802,-0.005577
8,-1.219695,0.711296


In [None]:
# Predict train, test, and new data with pipeline
try:
  print('Train predictions:', model_pipeline.predict(X_train))
  print('Test predictions:', model_pipeline.predict(X_test))
  print('New predictions:', model_pipeline.predict(X_new))
except Exception as e:
  print(e)

Train predictions: [ True False  True False False]
Test predictions: [False  True False]
New predictions: [False]


### Advanced pipeline

Finally, let's try a complex pipeline with multiple distinct steps.

**Target pipeline:**

1. Numeric features
  - Preprocess temperature with mean imputation and standard scaler
  - Preprocess pressure with median imputation and min-max scaler
  - Apply PCA to both outputs

2. Categorical features
  - Imput all categorical variables with the respective most frequent category
  - Apply one-hot encoder

3. Clusterize records
  - Use both processed numeric and categorical variables to cluster observations using KMeans, so we can levarage the trained centroids to create a new column with the cluster labels

4. Feature selection
  - Select top 2 processed features

5. Model
  - Train a predictive model for the final processed/selected features



_**Note:** such a complex pipeline for this fake data doesn't make sense at all. I'm just highlighting the possibilites for building a custom and complex pipeline._

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.base import BaseEstimator, TransformerMixin

The default KMeans model from scikit-learn returns the trained centroids using the .transform method.

Let's create a custom KMeans model to output the input data with an extra column (cluster labels) so we can use it as part of the model pipeline.

In [None]:
class KMeansTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=3):
        self.n_clusters = n_clusters
        self.kmeans = KMeans(n_clusters=self.n_clusters, n_init='auto')

    def fit(self, X, y=None):
        self.kmeans.fit(X)
        return self

    def transform(self, X):
        clusters = self.kmeans.predict(X)
        return X.assign(cluster=clusters)

In [None]:
## Preprocessing for numerical data

# Temperature preprocessor
temperature_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])

# Pressure preprocessor
pressure_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler()),
])

# Numeric transformer
numeric_transformer = ColumnTransformer([
    ('temp', temperature_transformer, ['temperature']),
    ('press', pressure_transformer, ['pressure']),

])

# Add PCA as an additional step to the numeric transformer
numeric_preprocessor = make_pipeline(numeric_transformer, PCA(.9))

# Specify categorical preprocessing steps
categorical_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Create a major preprocessor step with ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numeric_preprocessor, NUMERICAL_FEATURES),
    ('cat', categorical_preprocessor, CATEGORICAL_FEATURES)
])

# Build our advanced pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('cluster', KMeansTransformer(n_clusters=4)),
    ('feat_selection', SelectKBest(k=2)),
    ('classifier', LogisticRegression())
])

# Train pipeline
model_pipeline.fit(X_train, y_train)

In [None]:
# Predict train, test, and new data with pipeline
try:
  print('Train predictions:', model_pipeline.predict(X_train))
  print('Test predictions:', model_pipeline.predict(X_test))
  print('New predictions:', model_pipeline.predict(X_new))
except Exception as e:
  print(e)

Train predictions: [ True False False False False]
Test predictions: [False False False]
New predictions: [False]


Let's show the outputs of our transformers:

In [None]:
# Preprocess numerical & categorical features
X_train_transformed = model_pipeline.named_steps['preprocessor'].transform(X_train)
display(X_train_transformed)

# Add clusters using trained centroids
X_train_transformed_clst = model_pipeline.named_steps['cluster'].transform(X_train_transformed)
display(X_train_transformed_clst)

Unnamed: 0_level_0,num__pca0,cat__due_maintenance_No,cat__due_maintenance_Yes,cat__age_status_new,cat__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1.922478,0.0,1.0,0.0,1.0
2,-0.799199,1.0,0.0,1.0,0.0
3,0.009706,0.0,1.0,0.0,1.0
4,-0.122062,0.0,1.0,0.0,1.0
5,-1.010923,1.0,0.0,1.0,0.0


Unnamed: 0_level_0,num__pca0,cat__due_maintenance_No,cat__due_maintenance_Yes,cat__age_status_new,cat__age_status_old,cluster
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1.922478,0.0,1.0,0.0,1.0,2
2,-0.799199,1.0,0.0,1.0,0.0,0
3,0.009706,0.0,1.0,0.0,1.0,1
4,-0.122062,0.0,1.0,0.0,1.0,1
5,-1.010923,1.0,0.0,1.0,0.0,3


In [None]:
# Selecting the two best features as suggested by SelectKBest
model_pipeline.named_steps['feat_selection'].transform(X_train_transformed_clst)

Unnamed: 0_level_0,num__pca0,cat__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.922478,1.0
2,-0.799199,0.0
3,0.009706,1.0
4,-0.122062,1.0
5,-1.010923,0.0


## **Telco Churn**

Now, let's use a more extensive dataset from IBM about churn in a telecommunication company. This dataset is simulated, but it's realistic enough for practical purposes.

## Setup

### Libraries

In [None]:
import joblib
import pandas as pd

from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler, PowerTransformer, OneHotEncoder
from sklearn.utils.validation import check_is_fitted

### Functions

In [None]:
def get_metrics(y_true, y_pred, y_proba=None):
    """
    Calculate various performance metrics for a classification model.

    Args:
    y_true (array-like): True labels.
    y_pred (array-like): Predicted labels.
    y_proba (array-like, optional): Predicted probabilities for the positive class.

    Returns:
    dict: A dictionary containing calculated metrics such as Accuracy, Balanced Accuracy, Recall, Precision, F1, and optionally ROC_AUC.
    """
    dict_metrics = {
        'Accuracy': metrics.accuracy_score(y_true, y_pred),
        'Balanced Accuracy': metrics.balanced_accuracy_score(y_true, y_pred),
        'Recall': metrics.recall_score(y_true, y_pred),
        'Precison': metrics.precision_score(y_true, y_pred),
        'F1': metrics.f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        dict_metrics['ROC_AUC'] = metrics.roc_auc_score(y_true, y_proba)

    return dict_metrics


def get_metrics_from_estimator(model, X, y):
    """
    Compute performance metrics for an estimator given features and true labels.

    Args:
    model (estimator): The fitted model/estimator to evaluate.
    X (array-like): Feature data used for prediction.
    y (array-like): True labels.

    Returns:
    dict: A dictionary of performance metrics calculated by the `get_metrics` function.
    """
    check_is_fitted(model)
    y_pred = model.predict(X)
    try:
      y_proba = model.predict_proba(X)[:,1]
    except:
      y_proba = None
    return get_metrics(y_true=y, y_pred=y_pred, y_proba=y_proba)

### Dataset

In [None]:
data_url = 'https://raw.githubusercontent.com/cmcouto-silva/datasets/main/datasets/telco_churn.csv'
df = pd.read_csv(data_url, index_col='CustomerID')
display(df)

Unnamed: 0_level_0,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.307420,Female,No,...,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,1,67,2701,Moved
9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,...,Month-to-month,Yes,Electronic check,99.65,820.50,Yes,1,86,5372,Moved
7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,...,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes,1,84,5003,Moved
0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,...,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.30,Yes,1,89,5340,Competitor had better devices
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2569-WGERO,1,United States,California,Landers,92285,"34.341737, -116.539416",34.341737,-116.539416,Female,No,...,Two year,Yes,Bank transfer (automatic),21.15,1419.40,No,0,45,5306,
6840-RESVB,1,United States,California,Adelanto,92301,"34.667815, -117.536183",34.667815,-117.536183,Male,No,...,One year,Yes,Mailed check,84.80,1990.50,No,0,59,2140,
2234-XADUH,1,United States,California,Amboy,92304,"34.559882, -115.637164",34.559882,-115.637164,Female,No,...,One year,Yes,Credit card (automatic),103.20,7362.90,No,0,71,5560,
4801-JZAZL,1,United States,California,Angelus Oaks,92305,"34.1678, -116.86433",34.167800,-116.864330,Female,No,...,Month-to-month,Yes,Electronic check,29.60,346.45,No,0,59,2793,


## EDA

Since the goal of this workshop is not EDA, I'll skip this step to avoid a longer notebook (I have analyzed the features separately).

Let's explore only duplicate and missing values:

In [None]:
# Show duplicates
df.index.duplicated().any()

False

In [None]:
# Show missing values
missings = df.isna().sum()
missings[missings>0]

Churn Reason    5163
dtype: int64

So, as we can see, there are no duplicated customers, and there are missing values only for the churn reason, which we should expect since non-churn customers won't have a churn reason.

## Modeling

### Split data

Here, I'm selecting only potential features based on a separate EDA. Features were excluded because of constant values, redundant information, target leakage, or high granularity with no correlation with the target.

Let's first split our data into train and test sets.

In [None]:
NUMERIC_FEATURES = [
    'Tenure Months',
    'Monthly Charges',
    'Total Charges',
    'CLTV'
]

CATEGORICAL_FEATURES = [
    'Senior Citizen',
    'Partner',
    'Dependents',
    'Multiple Lines',
    'Internet Service',
    'Online Security',
    'Online Backup',
    'Device Protection',
    'Tech Support',
    'Streaming TV',
    'Streaming Movies',
    'Contract',
    'Paperless Billing',
    'Payment Method'
]

FEATURES = NUMERIC_FEATURES + CATEGORICAL_FEATURES
TARGET = 'Churn Value'

In [None]:
# Split features and target
X,y = df[FEATURES], df[TARGET]

# Split train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=2023)

### Simple pipeline

Then, we create a simple pipeline with classical transformers.

In [None]:
# Numeric transformer with Z-score scaler and simple mean imputer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
])

# Categorical transformer with constant imputer and one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False))
])

# Wrap main preprocessor (numeric + categorical)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
])

# Model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegressionCV(max_iter=1_000))
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

In [None]:
# Assess metrics
get_metrics_from_estimator(model_pipeline, X_train, y_train)

{'Accuracy': 0.809833401056481,
 'Balanced Accuracy': 0.7385556039318787,
 'Recall': 0.5834586466165413,
 'Precison': 0.6701208981001727,
 'F1': 0.6237942122186495,
 'ROC_AUC': 0.8577057203141484}

In this scenario, optimizing the model for recall is crucial to avoid losing customers through misclassification. Next, we will focus on tuning key parameters to enhance this aspect.

**Note:** I'm using `LogisticRegressionCV` instead of `LogisticRegression` because it offers built-in cross-validation to automatically find the optimal regularization parameter, making it more efficient and convenient than using `LogisticRegression` with separate cross-validation and tuning steps. I'll not test other models, but feel free to do it :)


### Hyperparameter tuning

The `GridSearchCV` in combination with our pipeline enables us to efficiently search for optimal parameters.

In this setup, each step within the pipeline can be referenced by its name, followed by a double underscore __. This allows us to either modify an entire step, such as the scaler, or adjust specific parameters within a step, like the class weight, as demonstrated in the example.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Specify parameters
params = {
    'preprocessor__num__scaler': [StandardScaler(), PowerTransformer()],
    'model__class_weight': [None, 'balanced']
}

# Search for optimal parameters using cross-validation in the train data
grid = GridSearchCV(model_pipeline, param_grid=params, scoring='recall', n_jobs=-1)
grid.fit(X_train, y_train)

In [None]:
# Show best estimator
display(grid.best_estimator_)

# Show best params
print('Best params:', grid.best_params_)

Best params: {'model__class_weight': 'balanced', 'preprocessor__num__scaler': StandardScaler()}


In [None]:
# Show best regularization value
grid.best_estimator_['model'].C_

array([2.7825594])

Now, let's compute the metrics for the tuned model:

In [None]:
# Compute metrics
get_metrics_from_estimator(grid, X_test, y_test)

{'Accuracy': 0.7635071090047393,
 'Balanced Accuracy': 0.7771995668240099,
 'Recall': 0.8051948051948052,
 'Precison': 0.5241545893719807,
 'F1': 0.6349670811997074,
 'ROC_AUC': 0.8636723829049009}

As we can see, the recall has improved, accompanied by slight increases in the F1 and AUC scores. This highlights the significance of class weighting in enhancing model performance, especially for unbalanced datasets like this one.

### Final pipeline

Finally, let's save our model to use it later.

In [None]:
# Select final trained model
final_model = grid.best_estimator_

# Save the final model as joblib
joblib.dump(final_model, 'model.joblib')

['model.joblib']

Now we can load the trained model to predict churn in production:

In [None]:
# Load the trained model
trained_model = joblib.load('model.joblib')

# Simulate a customer data to predict churn
customer_data = {
  'Tenure Months': [13],
  'Monthly Charges': [96.85],
  'Total Charges': [1235.55],
  'CLTV': [3098],
  'Senior Citizen': ['Yes'],
  'Partner': ['No'],
  'Dependents': ['No'],
  'Multiple Lines': ['No'],
  'Internet Service': ['Fiber optic'],
  'Online Security': ['No'],
  'Online Backup': ['No'],
  'Device Protection': ['Yes'],
  'Tech Support': ['No'],
  'Streaming TV': ['Yes'],
  'Streaming Movies': ['Yes'],
  'Contract': ['Month-to-month'],
  'Paperless Billing': ['Yes'],
  'Payment Method': ['Electronic check']
}

df_customer = pd.DataFrame(customer_data)
display(df_customer)

Unnamed: 0,Tenure Months,Monthly Charges,Total Charges,CLTV,Senior Citizen,Partner,Dependents,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method
0,13,96.85,1235.55,3098,Yes,No,No,No,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check


In [None]:
# Predict churn
trained_model.predict(df_customer)

array([1])

Do you want to explore how to use transformers and pipelines using other excellent open-source libraries?  
In [this notebook](#), we explore `feature-engine` transformers and `imbalanced-learn` strategies.