Finding the best data transformation with Randomized Search¶
In a previous notebook, I made a grid search to optimize the hyperparameters of various feature engineering transformers and a gradient boosting classifier.

What if I am not sure which transformer to use to begin with? Can I also make a search to find the best transformation?

Yes, we can!

In this notebook, I will:

assemble a feature engineering pipeline
automatically find out the best data transformation
train a Logistic Regression
Using Randomized search.

We will:

set up a series of feature engineering steps using Feature-engine
train a Logistic Regression
train the pipeline with cross-validation, looking over different feature-engineering transformation and model hyperparameters

In [1]:
!pip install feature-engine



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# for the model
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler, 
    MinMaxScaler,
    RobustScaler,
    )

# for feature engineering
from feature_engine import imputation as mdi
from feature_engine import encoding as ce
from feature_engine import discretisation as disc
from feature_engine import transformation as t

In [3]:
data = pd.read_csv(r'C:\Users\vish8\OneDrive\Desktop\Cursos\HyperparemetersCourse\Datasets\Titanic\train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
cols = [
    'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
    'Embarked', 'Survived'
]

data = data[cols]

data.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Survived
0,3,male,22.0,1,0,7.25,,S,0
1,1,female,38.0,1,0,71.2833,C85,C,1
2,3,female,26.0,0,0,7.925,,S,1
3,1,female,35.0,1,0,53.1,C123,S,1
4,3,male,35.0,0,0,8.05,,S,0


In [5]:
# Cabin: extract numerical and categorical part and delete original variable

data['cabin_num'] = data['Cabin'].str.extract('(\d+)') # captures numerical part
data['cabin_num'] = data['cabin_num'].astype('float')
data['cabin_cat'] = data['Cabin'].str[0] # captures the first letter

data.drop(['Cabin'], axis=1, inplace=True)

data.head()

  data['cabin_num'] = data['Cabin'].str.extract('(\d+)') # captures numerical part


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,cabin_num,cabin_cat
0,3,male,22.0,1,0,7.25,S,0,,
1,1,female,38.0,1,0,71.2833,C,1,85.0,C
2,3,female,26.0,0,0,7.925,S,1,,
3,1,female,35.0,1,0,53.1,S,1,123.0,C
4,3,male,35.0,0,0,8.05,S,0,,


In [6]:
# make list of variables types
# we need these lists to tell Feature-engine which variables it should modify

# numerical: discrete
discrete = [
    var for var in data.columns if data[var].dtype != 'O' and var != 'Survived'
    and data[var].nunique() < 10
]

# numerical: continuous
continuous = [
    var for var in data.columns
    if data[var].dtype != 'O' and var != 'Survived' and var not in discrete
]

# categorical
categorical = [var for var in data.columns if data[var].dtype == 'O']

print('There are {} discrete variables'.format(len(discrete)))
print('There are {} continuous variables'.format(len(continuous)))
print('There are {} categorical variables'.format(len(categorical)))

There are 3 discrete variables
There are 3 continuous variables
There are 3 categorical variables


In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('Survived', axis=1),  # predictors
    data['Survived'],  # target
    test_size=0.1,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((801, 9), (90, 9))

In [8]:
mean_imputer = mdi.MeanMedianImputer(imputation_method = 'mean', variables=['Age', 'Fare', 'cabin_num'])

median_imputer = mdi.MeanMedianImputer(imputation_method = 'median', variables=['Age', 'Fare', 'cabin_num'])

arbitrary_imputer = mdi.EndTailImputer(variables=['Age', 'Fare', 'cabin_num'])

num_imputer = [mean_imputer, median_imputer, arbitrary_imputer]

In [9]:
onehot_enc = ce.OneHotEncoder(variables=categorical)
ordinal_enc = ce.OrdinalEncoder(encoding_method='ordered', variables=categorical)
mean_enc = ce.MeanEncoder(variables=categorical)

cat_encoder = [onehot_enc, ordinal_enc, mean_enc]

In [10]:
efd = disc.EqualFrequencyDiscretiser(q=5, variables=continuous)
dtd = disc.DecisionTreeDiscretiser(variables=continuous)

yj = t.YeoJohnsonTransformer(variables=continuous)

transformers = [efd, dtd, yj]