**Author:** Cainã Max Couto da Silva  
**LinkedIn:** [@cmcouto-silva](https://www.linkedin.com/in/cmcouto-silva/)

&nbsp;

---

In the [previous notebook](https://drive.google.com/file/d/13q0UmHCZshnyJv0T3fIwvi8qDBwjeT_x/view?usp=sharing), we explored building pipelines with scikit-learn (beginner to advanced examples).

In this notebook, we'll learn how to use feature-engine transformers and sampling strategies with imbalanced-learn pipelines while avoiding data leakage.

# **Settings**

## **Libraries**

In [None]:
%pip install imblearn
%pip install feature-engine

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score

# For displaying pipelines
from sklearn import set_config
set_config(display='diagram')
set_config(transform_output="pandas")

## **Load dataset**

In [3]:
data_url = 'https://raw.githubusercontent.com/cmcouto-silva/datasets/main/datasets/telco_churn.csv'
df = pd.read_csv(data_url, index_col='CustomerID')
display(df)

Unnamed: 0_level_0,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.307420,Female,No,...,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,1,67,2701,Moved
9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,...,Month-to-month,Yes,Electronic check,99.65,820.50,Yes,1,86,5372,Moved
7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,...,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes,1,84,5003,Moved
0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,...,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.30,Yes,1,89,5340,Competitor had better devices
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2569-WGERO,1,United States,California,Landers,92285,"34.341737, -116.539416",34.341737,-116.539416,Female,No,...,Two year,Yes,Bank transfer (automatic),21.15,1419.40,No,0,45,5306,
6840-RESVB,1,United States,California,Adelanto,92301,"34.667815, -117.536183",34.667815,-117.536183,Male,No,...,One year,Yes,Mailed check,84.80,1990.50,No,0,59,2140,
2234-XADUH,1,United States,California,Amboy,92304,"34.559882, -115.637164",34.559882,-115.637164,Female,No,...,One year,Yes,Credit card (automatic),103.20,7362.90,No,0,71,5560,
4801-JZAZL,1,United States,California,Angelus Oaks,92305,"34.1678, -116.86433",34.167800,-116.864330,Female,No,...,Month-to-month,Yes,Electronic check,29.60,346.45,No,0,59,2793,


In [4]:
NUMERIC_FEATURES = [
    'Tenure Months',
    'Monthly Charges',
    'Total Charges',
    'CLTV'
]

CATEGORICAL_FEATURES = [
    'Senior Citizen',
    'Partner',
    'Dependents',
    'Multiple Lines',
    'Internet Service',
    'Online Security',
    'Online Backup',
    'Device Protection',
    'Tech Support',
    'Streaming TV',
    'Streaming Movies',
    'Contract',
    'Paperless Billing',
    'Payment Method'
]

FEATURES = NUMERIC_FEATURES + CATEGORICAL_FEATURES
TARGET = 'Churn Value'

In [5]:
# Split features and target
X,y = df[FEATURES], df[TARGET]

# Split train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=2023)

# **Last model recap**

Let's have a quick recap of the final scikit-learn pipeline we used in the previous notebook:

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

In [7]:
# Numeric transformer with Z-score scaler and simple mean imputer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
])

# Categorical transformer with constant imputer and one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False))
])

# Wrap main preprocessor (numeric + categorical)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
])

# Classifier
clf = LogisticRegression(C=2.7825594, class_weight='balanced', max_iter=1_000)

# Model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', clf)
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

# **Feature-engine**

Feature-engine has several options for data transformations, feature selection, etc. Please check the complete list [here](https://feature-engine.trainindata.com/en/latest/).

Let's consider replacing some sklearn transformers with feature-engine transformers. An option would be to replace the transformers using the original sklearn Pipeline and ColumnTransformers:

In [8]:
# from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.encoding import OneHotEncoder as OneHotEncoderFe
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer

In [9]:
# Numeric transformer with Z-score scaler and simple mean imputer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', MeanMedianImputer()), # replaced StandardScaler by MeanMedianImputer
])

# Categorical transformer with constant imputer and one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', CategoricalImputer(imputation_method='missing', fill_value='Missing')), # replaced SimpleImputer by CategoricalImputer
    ('encoder', OneHotEncoderFe(drop_last_binary=True, variables=CATEGORICAL_FEATURES)) # replaced OneHotEncoder by OneHotEncoderFe
])

# Wrap main preprocessor (numeric + categorical)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
])

# Classifier
clf = LogisticRegression(C=2.7825594, class_weight='balanced', max_iter=1_000)

# Model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', clf)
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

Another option without using scklearn columntransformers levarage the nature of feature-engine: each transformer includes a 'variables' parameter, allowing us to specify the columns on which to apply the transformation. For example:

In [10]:
from feature_engine.wrappers import SklearnTransformerWrapper

In [11]:
# Preprocessor pipeline
preprocessor = Pipeline([
    ('numeric_imputation', MeanMedianImputer(variables=NUMERIC_FEATURES)),
    ('numeric_scaler', SklearnTransformerWrapper(transformer=StandardScaler(), variables=NUMERIC_FEATURES)),
    ('categorical_imputer', CategoricalImputer(imputation_method='missing', fill_value='Missing', variables=CATEGORICAL_FEATURES)),
    ('categorical_encoder', OneHotEncoderFe(drop_last_binary=True, variables=CATEGORICAL_FEATURES))
])

# Classifier
clf = LogisticRegression(C=2.7825594, class_weight='balanced', max_iter=1_000)

# Complete model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=1_000))
])

# Fit pipeline (train transformers & model)
model_pipeline.fit(X_train, y_train)

Note that `SklearnTransformerWrapper` was used to wrap a scikit-learn transformer to specify the 'variables' parameter.

In [12]:
# Example of transformation
preprocessor.fit_transform(X_train)

Unnamed: 0_level_0,Tenure Months,Monthly Charges,Total Charges,CLTV,Senior Citizen_No,Partner_No,Dependents_No,Multiple Lines_No,Multiple Lines_Yes,Multiple Lines_No phone service,...,Streaming Movies_No,Streaming Movies_No internet service,Contract_Month-to-month,Contract_One year,Contract_Two year,Paperless Billing_Yes,Payment Method_Electronic check,Payment Method_Credit card (automatic),Payment Method_Bank transfer (automatic),Payment Method_Mailed check
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4931-TRZWN,-0.770527,0.243826,-0.588062,-0.655908,1,1,1,1,0,0,...,0,0,1,0,0,1,1,0,0,0
9351-LZYGF,0.415509,0.057716,0.217274,-0.514026,1,0,1,0,1,0,...,1,0,1,0,0,1,0,1,0,0
1575-KRZZE,-1.138607,-0.317858,-0.901991,-0.286846,1,1,1,1,0,0,...,1,0,1,0,0,0,1,0,0,0
4808-YNLEU,0.129224,-0.084801,-0.011454,0.737573,1,0,1,1,0,0,...,1,0,0,1,0,1,0,0,1,0
1000-AJSLD,-1.261301,-1.494879,-0.991522,0.006207,1,1,1,1,0,0,...,0,1,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1813-JYWTO,1.642443,0.528859,1.560936,-0.151721,1,0,1,0,1,0,...,1,0,0,0,1,0,0,0,1,0
8089-UZWLX,1.601545,1.320247,2.309129,-0.226884,0,1,1,0,1,0,...,0,0,0,0,1,0,0,0,1,0
2995-YWTCD,-0.525140,-1.333919,-0.791701,0.856652,1,0,0,0,1,0,...,0,1,0,0,1,1,0,0,1,0
6196-HBOBZ,1.356159,1.162640,1.833235,1.620111,1,0,1,0,1,0,...,1,0,0,0,1,1,1,0,0,0


In [13]:
# Get predictions on test set
model_pipeline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 1])

# Imblearn

`Imbalanced-learn`, commonly known as `imblearn`, is a Python library offering various methods to address the issue of imbalanced datasets in machine learning. It provides techniques for under-sampling the majority class, over-sampling the minority class, and generating synthetic samples, helping to create balanced datasets that can lead to more accurate and reliable model predictions, especially in classification problems where one class significantly outweighs the others.

When adopting any of these strategies, it's essential to use pipelines to ensure the validation/test set remains intact, including when using cross-validation techniques.

Let's first add SMOTE into our pipeline (it should be a pipeline from imblearn):

In [14]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

In [21]:
# Numeric transformer with Z-score scaler and simple mean imputer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
])

# Categorical transformer with constant imputer and one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False))
])

# Wrap main preprocessor (numeric + categorical)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
])

# Classifier
clf = LogisticRegression(C=2.7825594, class_weight='balanced', max_iter=1_000)

# Model pipeline
model_pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('over_sampling', SMOTE(random_state=2023)),  # oversampling to equalize class proportions
    ('model', clf)
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

In [24]:
# Apply cross validation on train data with sampling strategy
cross_val_score(model_pipeline, X_train, y_train, cv=5, scoring='recall', n_jobs=-1)

array([0.81578947, 0.84210526, 0.82330827, 0.78195489, 0.80827068])

When using a sampling strategy inside the pipeline, we ensure only the train data gets resampled while the test set remains untouchable.

Now, let's try two distinct resampling strategies in the same pipeline:

In [26]:
# Numeric transformer with Z-score scaler and simple mean imputer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
])

# Categorical transformer with constant imputer and one-hot encoder
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False))
])

# Wrap main preprocessor (numeric + categorical)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, NUMERIC_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
])

# Classifier
clf = LogisticRegression(C=2.7825594, class_weight='balanced', max_iter=1_000)

# Model pipeline
model_pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('undersampling', RandomUnderSampler(sampling_strategy=0.5, random_state=2023)), # undersampling to 50%
    ('over_sampling', SMOTE(random_state=2023)),                                     # oversampling
    ('model', clf)
])

# Fit pipeline
model_pipeline.fit(X_train, y_train)

In [27]:
# Apply cross validation on train data with sampling strategy
cross_val_score(model_pipeline, X_train, y_train, cv=5, scoring='recall', n_jobs=-1)

array([0.82330827, 0.84210526, 0.83082707, 0.77443609, 0.80827068])

That's all for now =)

If you want to see how SparkML uses a pipeline, please check the [last notebook](https://drive.google.com/file/d/13l5w4wGtNWtnXNaTRbDnZ5ib1CBYhy8E/view?usp=sharing) from our workshop!