**Author:** Cainã Max Couto da Silva  
**LinkedIn:** [@cmcouto-silva](https://www.linkedin.com/in/cmcouto-silva/)

# **Settings**

Wrap output text on Colab for a nicer output:

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## **Libraries**

Although it's good practice to load all libraries at the beginning of the notebook, I'll load them as necessary to facilitate understanding of the notebook.

In [None]:
# %pip install scikit-learn==1.3.2

In [None]:
import pickle
import numpy as np
import pandas as pd

In [None]:
# For displaying pipelines
from sklearn import set_config
set_config(display='diagram')
set_config(transform_output="pandas")

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

## **Data**

In this first notebook, we will use a small sample of fake data to highlight the preprocessing techniques without data leakage.

In [None]:
# Create simulated data set

df_train = pd.DataFrame({
    'tool_id': [1,2,3,4,5],
    'temperature': [180,100,120,np.nan,90],
    'pressure': [13000,5000,11000,4500,np.nan],
    'due_maintenance': ['Yes', 'No', 'Yes', 'Yes', 'No'],
    'age_status': ['old','new','old','old','new'],
    'failed':[True,False,True,False,False]
}).set_index('tool_id')

df_test = pd.DataFrame({
    'tool_id': [6,7,8],
    'temperature': [85,110,np.nan],
    'pressure': [6000,10500,3300],
    'due_maintenance': ['Yes', 'Yes', 'No'],
    'age_status': ['new', 'old','ancient'],
    'failed':[False,True,False]
}).set_index('tool_id')

df_future_unique = pd.DataFrame({
    'tool_id': [10],
    'temperature': [12],
    'pressure': [7500],
    'due_maintenance': ['No'],
    'age_status': ['new'],
}).set_index('tool_id')

print('Train data')
display(df_train)
print()

print('Test data')
display(df_test)
print()

print('Future data')
display(df_future_unique)

Train data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status,failed
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,180.0,13000.0,Yes,old,True
2,100.0,5000.0,No,new,False
3,120.0,11000.0,Yes,old,True
4,,4500.0,Yes,old,False
5,90.0,,No,new,False



Test data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status,failed
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,85.0,6000,Yes,new,False
7,110.0,10500,Yes,old,True
8,,3300,No,ancient,False



Future data


Unnamed: 0_level_0,temperature,pressure,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,12,7500,No,new


In [None]:
# List features and target
NUMERICAL_FEATURES = [
    'temperature',
    'pressure'
]

CATEGORICAL_FEATURES = [
    'due_maintenance',
    'age_status'
]

FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES

TARGET = 'failed'

# Manual preprocessing

**Note:** I do not recommend a manual approach like this to anyone. It's just for didactic purposes.

We should handle:
- Missing values
- Numerical features
- Categorical features

## Numerical features

In [None]:
# Train features and target
X_train = df_train[NUMERICAL_FEATURES]
y_train = df_train[TARGET]

# Test features and target
X_test = df_test[NUMERICAL_FEATURES]
y_test = df_test[TARGET]

# Instance with unknown target
X_new = df_future_unique[NUMERICAL_FEATURES]

In [None]:
# Instantiate model
model = LogisticRegression()

### Missing values

In [None]:
# If missing values are present, training fails
try:
  model.fit(X_train, y_train)
except Exception as e:
  print(e)

Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


In [None]:
# Compute mean for every column
training_numerical_means = X_train.mean()
training_numerical_means

temperature     122.5
pressure       8375.0
dtype: float64

In [None]:
# Apply our imputation method to train, test, and new data
X_train_imputed = X_train.fillna(training_numerical_means)
X_test_imputed = X_test.fillna(training_numerical_means)
X_new_imputed = X_new.fillna(training_numerical_means)

In [None]:
# Without missing values, the training runs ok
try:
  model.fit(X_train_imputed, y_train)
except Exception as e:
  print(e)

In [None]:
# Predict train, test, and new data successfully
try:
  print('Train predictions:', model.predict(X_train_imputed))
  print('Test predictions:', model.predict(X_test_imputed))
  print('New predictions:', model.predict(X_new_imputed))
except Exception as e:
  print(e)

Train predictions: [ True False  True False  True]
Test predictions: [ True  True False]
New predictions: [ True]


### Data scaling

What if we need scale the numerical data?

Let's consider the common standard scaler:

$$
\frac{x - \bar{x}}{\sigma}
$$

In [None]:
# Compute standard deviation for every column
training_numerical_std = X_train.std()
display(training_numerical_std)

temperature      40.311289
pressure       4269.562819
dtype: float64

In [None]:
# Applying scaling preprocessing to train, test, and new data
X_train_imputed_scaled = (X_train_imputed - training_numerical_means) / training_numerical_std
X_test_imputed_scaled = (X_test_imputed - training_numerical_means) / training_numerical_std
X_new_imputed_scaled = (X_new_imputed - training_numerical_means) / training_numerical_std

In [None]:
# Train model
model.fit(X_train_imputed_scaled, y_train)

# Predict train, test, and new data successfully
try:
  print('Train predictions:', model.predict(X_train_imputed_scaled))
  print('Test predictions:', model.predict(X_test_imputed_scaled))
  print('New predictions:', model.predict(X_new_imputed_scaled))
except Exception as e:
  print(e)

Train predictions: [ True False  True False False]
Test predictions: [False False False]
New predictions: [False]


Let's save the preprocessing parameters so we can use them in the future/pipeline.

In [None]:
trained_preprocessor = {
    'imputer_parameters': {
      'mean': training_numerical_means
    },
    'scaler_parameters': {
      'mean': training_numerical_means,
      'std': training_numerical_std
    }
}

trained_model = model
trained_model

In [None]:
# Saving preprocessing parameters & trained model
with open('num_preprocessor.pkl', 'wb') as preprocessor_file:
  pickle.dump(trained_preprocessor, preprocessor_file)

with open('num_model.pkl', 'wb') as model_file:
  pickle.dump(trained_model, model_file)

What we needed so far:
- Compute and store means (for imputation & scaling)
- Compute and store standard deviations (scaling)
- **Store** the computed preprocessing parameters (mean & std)
- Create a temporary table with imputed values
- Create another temporary table with imputed and scaled values
- Manually applying the preprocessing with the trained parameters

Wouldn't be better to have a class for computing and storing the preprocessing parameters, able to transform the new/test data using the learned parameters?

### Custom classes

Here, let's create two simple classes to simulate `SimpleImputer` and `SantadardScaler` from `sklearn.preprocessing`. It's worth noticing these classes have limited capabilities compared to the sklearn classes.

In [None]:
# Creating a class to learn and store paremeters, able to transform new data with imputed values
class MySimpleImputer():

  def fit(self, X, y=None):
    self.feature_names_in_ = X.columns.tolist()
    self.mean_ = X.mean()
    return self

  def transform(self, X, y=None):
    return X.fillna(self.mean_)

  def fit_transform(self, X, y=None):
    self.fit(X)
    return X.fillna(self.mean_)

In [None]:
# Creating a class to learn and store parameters, able to transform new data through imputation
class MySimpleImputer:
    """
    A simple imputer for filling missing values with the mean of columns.

    This class provides a basic implementation of an imputer that fills missing
    values in a dataset with the mean of each column. It offers methods to fit
    the imputer to the data, transform the data, and a combined fit and transform method.

    Methods
    -------
    fit(X, y=None)
        Calculates the mean of each column in the dataset X.

    transform(X, y=None)
        Fills missing values in X with the mean calculated during the fitting process.

    fit_transform(X, y=None)
        Fits the imputer and transforms the dataset X in a single step.

    Attributes
    ----------
    feature_names_in_ : list
        The names of the features (columns) in the dataset.

    mean_ : pandas.Series
        The mean values of the features (columns) in the dataset.

    Parameters
    ----------
    X : pandas.DataFrame
        The input dataset with or without missing values.

    y : Ignored
        This parameter exists only for compatibility with sklearn's imputer interface.

    Returns
    -------
    self : object
        Returns the instance itself.
    """
    def fit(self, X, y=None):
      self.feature_names_in_ = X.columns.tolist()
      self.mean_ = X.mean()
      return self

    def transform(self, X, y=None):
      return X.fillna(self.mean_)

    def fit_transform(self, X, y=None):
      self.fit(X)
      return self.transform(X)

Note that there is no error check; the input X should be a panda.DataFrame, and there are no options for imputation other than the mean (e.g., median or a constant value). However, it's a straightforward class for didactic purposes able to fit the data, store the parameters for each feature (the mean), and then use them to transform new data.

In [None]:
# Instantiate our imputer
my_imputer = MySimpleImputer()

In [None]:
# Trying to access attributes without .fit() fails:
try:
  print('Trained feature names:', my_imputer.feature_names_in_)
  print('Trained averages', my_imputer.mean_)
except Exception as e:
  print(e)

'MySimpleImputer' object has no attribute 'feature_names_in_'


In [None]:
# Fit our imputer preprocessor
my_imputer.fit(X_train)

<__main__.MySimpleImputer at 0x7b9b2702a890>

In [None]:
# Trying to access attributes after .fit() runs successfully
try:
  print('Trained feature names:', my_imputer.feature_names_in_)
  print('Trained averages', my_imputer.mean_.to_dict())
except Exception as e:
  print(e)

Trained feature names: ['temperature', 'pressure']
Trained averages {'temperature': 122.5, 'pressure': 8375.0}


In [None]:
# Transform train, test, and new data successfully (no exception raised)
try:
  my_imputer.transform(X_train)
  my_imputer.transform(X_test)
  my_imputer.transform(X_new)
except Exception as e:
  print(e)

In [None]:
print('Test features without imputation')
display(X_test)
print('\n')

print('Test features after imputation')
my_imputer.transform(X_test)

Test features without imputation


Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,85.0,6000
7,110.0,10500
8,,3300




Test features after imputation


Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,85.0,6000
7,110.0,10500
8,122.5,3300


Let's create a simular class for scaling the data using the z-score formula:

In [None]:
class MyStandardScaler():
    """
    A simple standard scaler for standardizing features by removing the mean and scaling to unit variance.

    This class provides a basic implementation of a standard scaler that standardizes features
    by subtracting the mean and scaling to unit variance. It includes methods to fit the scaler to the data,
    transform the data, and a combined fit and transform method.

    Methods
    -------
    fit(X, y=None)
        Calculates the mean and standard deviation of each column in the dataset X.

    transform(X, y=None)
        Standardizes the dataset X based on the mean and standard deviation calculated during the fitting process.

    fit_transform(X, y=None)
        Fits the scaler and transforms the dataset X in a single step.

    Attributes
    ----------
    feature_names_in_ : list
        The names of the features (columns) in the dataset.

    mean_ : pandas.Series
        The mean values of the features (columns) in the dataset.

    std_ : pandas.Series
        The standard deviation of the features (columns) in the dataset.

    Parameters
    ----------
    X : pandas.DataFrame
        The input dataset to be standardized.

    y : Ignored
        This parameter exists only for compatibility with sklearn's scaler interface.

    Returns
    -------
    self : object
        Returns the instance itself.
    """
    def fit(self, X, y=None):
      self.feature_names_in_ = X.columns.tolist()
      self.mean_ = X.mean()
      self.std_ = X.std()
      return self

    def transform(self, X, y=None):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X, y=None):
      self.fit(X)
      return self.transform(X)

Since we want to scale the imputed features, we first need to impute the data using the stored training means for every feature. We use the transformed data to fit the scaler so it can be used to transform new/test data.

In [None]:
# Learning parameters
my_imputer = MySimpleImputer()
X_train_imputed = my_imputer.fit_transform(X_train)

my_scaler = MyStandardScaler().fit(X_train_imputed)

In [None]:
# Applying transformation
X_test_transformed = my_scaler.transform( my_imputer.transform(X_test) )
X_new_transformed = my_scaler.transform( my_imputer.transform(X_new) )
X_new_transformed

Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,-3.165228,-0.236643


Since we structured our classes into a structure compatible with sklearn transformers, we can use them in a sklearn pipeline:

In [None]:
# Let's use the make_pipeline (to be discussed in the next notebook) to wrap our classes and model into a pipeline
model_pipeline = make_pipeline(MySimpleImputer(), MyStandardScaler(), LogisticRegression())
model_pipeline.fit(X_train, y_train)

In [None]:
# Predict train, test, and new data using a pipeline, sucessfully
try:
  print('Train predictions:', model_pipeline.predict(X_train))
  print('Test predictions:', model_pipeline.predict(X_test))
  print('New predictions:', model_pipeline.predict(X_new))
except Exception as e:
  print(e)

Train predictions: [ True False  True False False]
Test predictions: [False False False]
New predictions: [False]


In [None]:
# Saving preprocessing parameters & trained model
with open('num_imputer.pkl', 'wb') as preprocessor_file:
  pickle.dump(my_imputer, preprocessor_file)

with open('num_scaler.pkl', 'wb') as preprocessor_file:
  pickle.dump(my_scaler, preprocessor_file)

## Categorical features

Extract categorical features

In [None]:
# Train features and target
X_train = df_train[CATEGORICAL_FEATURES]
y_train = df_train[TARGET]

# Test features and target
X_test = df_test[CATEGORICAL_FEATURES]
y_test = df_test[TARGET]

# Instance with unknown target
X_new = df_future_unique[CATEGORICAL_FEATURES]

Avoid using functions like pd.get_dummies. It's not suitable for model reproducibility. When applying `pd.get_dummies` to the train, test, and future data, the output columns may be different, breaking our model prediction and possible usage on a pipeline.

In [None]:
# Train data
print('Train data')
display(X_train)
print()

# Test data
print('Test data')
display(X_test)
print()

# New data
print('New data')
display(X_new)

Train data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Yes,old
2,No,new
3,Yes,old
4,Yes,old
5,No,new



Test data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,Yes,new
7,Yes,old
8,No,ancient



New data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,No,new


In [None]:
# Transform data using pd.get_dummies (not recommended!)
X_train_transformed = pd.get_dummies(X_train)
X_test_transformed = pd.get_dummies(X_test)
X_new_transformed = pd.get_dummies(X_new)

In [None]:
# Train data (transformed)
print('Train data')
display(X_train_transformed)
print()

# Test data (transformed) - new column
print('Test data')
display(X_test_transformed)
print()

# New data - missing columns
print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,0,1
5,1,0,1,0



Test data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_ancient,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,0,1,0,1,0
7,0,1,0,0,1
8,1,0,1,0,0



New data


Unnamed: 0_level_0,due_maintenance_No,age_status_new
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,1,1


In [None]:
# Train model with categorical data
model.fit(X_train_transformed, y_train)

In [None]:
# Predict train data (same structure learnt)
try:
  model.predict(X_train_transformed)
except Exception as e:
  print(e)

In [None]:
# Predict test data fails due to an extra column
try:
  model.predict(X_test_transformed)
except Exception as e:
  print(e)

The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- age_status_ancient



In [None]:
# Predict test data fails due to missing columns
try:
  model.predict(X_new_transformed)
except Exception as e:
  print(e)

The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- age_status_old
- due_maintenance_Yes



Let's create a minimal class for computing dummies properly (transformation applied to train data should be applied to test and upcoming data):

In [None]:
class MyOneHotEncoder:
  """
  A simple one-hot encoder for categorical features.

  This class provides a basic implementation of a one-hot encoder that converts
  categorical features into a one-hot numeric array. It includes methods to fit
  the encoder to the data, transform the data, and a combined fit and transform method.

  Methods
  -------
  fit(X, y=None)
      Learns the categories and output columns from the dataset X.

  transform(X, y=None)
      Transforms the dataset X to a one-hot encoded format based on learned categories.

  fit_transform(X, y=None)
      Fits the encoder and transforms the dataset X in a single step.

  Attributes
  ----------
  categories_ : dict
      The categories of each feature.

  output_columns_ : list
      The columns of the transformed dataset.

  Parameters
  ----------
  X : pandas.DataFrame
      The input dataset with categorical features.

  y : Ignored
      This parameter exists only for compatibility with sklearn's transformer interface.

  Returns
  -------
  self : object
      Returns the instance itself when fitting.

  X_transformed : pandas.DataFrame
      Returns the transformed dataset when transforming.
  """
  def fit(self, X, y=None):
    self.categories_ = {feature: set(X[feature]) for feature in X.columns}
    self.output_columns_ = pd.get_dummies(X).columns.tolist()
    return self

  def transform(self, X, y=None):
    X_transformed = (
        pd.get_dummies(X)
        .reindex(self.output_columns_, axis=1)
        .fillna(0)
        .astype(int)
    )
    return X_transformed

  def fit_transform(self, X, y=None):
    self.fit(X)
    return self.transform(X)

In [None]:
# Transform the categorical data
my_encoder = MyOneHotEncoder()

X_train_transformed = my_encoder.fit_transform(X_train)
X_test_transformed = my_encoder.transform(X_test)
X_new_transformed = my_encoder.transform(X_new)

When using the `MyOneHotEncoder` class, we can see that the output columns for the test and new data are equivalent to the ones from the training set:

In [None]:
# Train data (transformed)
print('Train data')
display(X_train_transformed)
print()

# Test data (transformed)
print('Test data')
display(X_test_transformed)
print()

# New data
print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,0,1
5,1,0,1,0



Test data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,0,1,1,0
7,0,1,0,1
8,1,0,0,0



New data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,1,0,1,0


In [None]:
# Train model with processed categorical variables
model.fit(X_train_transformed, y_train)

In [None]:
# Predict train, test, and new data successfully
try:
  print('Train predictions:', model.predict(X_train_transformed))
  print('Test predictions:', model.predict(X_test_transformed))
  print('New predictions:', model.predict(X_new_transformed))
except Exception as e:
  print(e)

Train predictions: [ True False  True  True False]
Test predictions: [False  True False]
New predictions: [False]


In [None]:
# Save categorical preprocessor and trained model
with open('cat_encoder.pkl', 'wb') as preprocessor_file:
  pickle.dump(my_encoder, preprocessor_file)

with open('cat_model.pkl', 'wb') as model_file:
  pickle.dump(model, model_file)

After saving the trained transformer and model, we can load them to transform our data and get the model prediction directly.

In [None]:
# Load preprocessor
with open('cat_encoder.pkl', 'rb') as preprocessor_file:
  categorical_preprocessor = pickle.load(preprocessor_file)

# Load logistic regression (lr) model
with open('cat_model.pkl', 'rb') as model_file:
  lr_model_cat = pickle.load(model_file)

In [None]:
# Apply trained model to transformed data
lr_model_cat.predict( categorical_preprocessor.transform(X_test) )

array([False,  True, False])

## All features

So far, we have split the features into numerical and categorical to apply the respective transformations.

How could we use a simple function to process numerical and categorical data at once, returning the concatenated transformed features?

Let's create a simple function for doing that.

In [None]:
# Train features and target
X_train = df_train[FEATURES]
y_train = df_train[TARGET]

# Test features and target
X_test = df_test[FEATURES]
y_test = df_test[TARGET]

# Instance with unknown target
X_new = df_future_unique[FEATURES]

In [None]:
# Load trained preprocessor classes
with open('num_imputer.pkl', 'rb') as imputer_file:
  imputer = pickle.load(imputer_file)

with open('num_scaler.pkl', 'rb') as scaler_file:
  scaler = pickle.load(scaler_file)

with open('cat_encoder.pkl', 'rb') as encoder_file:
  encoder = pickle.load(encoder_file)

Function for transforming numerical

In [None]:
from functools import reduce

def preprocess_data(X: pd.DataFrame, numeric_features: list, categoric_features: list, numeric_preprocessors: list, categoric_preprocessor: list) -> pd.DataFrame:
  """
  Preprocess a pandas DataFrame using separate preprocessors for numeric and categorical features.

  This function applies a series of preprocessing steps to the numeric and categorical features of a dataset.
  It uses the `reduce` function to sequentially apply each preprocessor to the relevant subset of features.
  The transformed numeric and categorical features are then concatenated and returned as a single DataFrame.

  Parameters
  ----------
  X : pd.DataFrame
      The input dataset to be preprocessed.

  numeric_features : list
      A list of column names in X corresponding to numeric features.

  categoric_features : list
      A list of column names in X corresponding to categorical features.

  numeric_preprocessors : list
      A list of preprocessing objects (like those from scikit-learn) for numeric features.

  categoric_preprocessor : list
      A list of preprocessing objects for categorical features.

  Returns
  -------
  pd.DataFrame
      The preprocessed dataset with transformed numeric and categorical features.
  """
  X_num = reduce(lambda X, preprocessor: preprocessor.transform(X), numeric_preprocessors, X[numeric_features])
  X_cat = reduce(lambda X, preprocessor: preprocessor.transform(X), categoric_preprocessor, X[categoric_features])
  return pd.concat([X_num, X_cat], axis=1)

The above function applies the trained transformers to the numerical and categorical features in the order they were provided.

Let's see it in action:

In [None]:
# Preprocess train, test, and new data using trained parameters
X_train_transformed = preprocess_data(X_train, NUMERICAL_FEATURES, CATEGORICAL_FEATURES, [imputer, scaler], [encoder])
X_test_transformed = preprocess_data(X_test, NUMERICAL_FEATURES, CATEGORICAL_FEATURES, [imputer, scaler], [encoder])
X_new_transformed = preprocess_data(X_new, NUMERICAL_FEATURES, CATEGORICAL_FEATURES, [imputer, scaler], [encoder])

# Display transformed data
print('Train data')
display(X_train_transformed)
print()

print('Test data')
display(X_test_transformed)
print()

print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,temperature,pressure,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1.647064,1.250828,0,1,0,1
2,-0.644503,-0.912767,1,0,1,0
3,-0.071611,0.70993,0,1,0,1
4,0.0,-1.047991,0,1,0,1
5,-0.930949,0.0,1,0,1,0



Test data


Unnamed: 0_level_0,temperature,pressure,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6,-1.074172,-0.642317,0,1,1,0
7,-0.358057,0.574705,0,1,0,1
8,0.0,-1.372531,1,0,0,0



New data


Unnamed: 0_level_0,temperature,pressure,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,-3.165228,-0.236643,1,0,1,0


# Scikit-learn Transformers

We have manually built some working transformers with limited capabilities. But we don't need to use them, since there are excellent similar transformers with more parameter options (capabilities).

Let's see those same transformers from the scikit-learn.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In scikit-learn, a common convention is to designate attributes that are estimated from the data as having a trailing underscore (_).

Therefore, we will use the following function to list only those attributes.

In [None]:
def list_trained_attributes(obj):
  """List trained attributes from sklearn classes"""
  return [attr for attr in dir(obj) if not attr.startswith('_') and attr.endswith('_')]

## Imputer

In [None]:
# Instantiate simple imputer
imputer = SimpleImputer(strategy='mean', fill_value=None, add_indicator=False)

In [None]:
# No trained attribues (ending with "_") are shown when .fit() is not applied
list_trained_attributes(imputer)

[]

In [None]:
# Train imputer
imputer.fit(X_train[NUMERICAL_FEATURES])

In [None]:
# Trained imputer attributes
list_trained_attributes(imputer)

['feature_names_in_', 'indicator_', 'n_features_in_', 'statistics_']

In [None]:
# Show stored statistics to be used to impute missing data (defined by the `strategy` parameter)
imputer.statistics_

array([ 122.5, 8375. ])

In [None]:
# Instantiate simple imputer with indicator
imputer = SimpleImputer(strategy='mean', fill_value=None, add_indicator=True)

# Train imputer
imputer.fit(X_train[NUMERICAL_FEATURES])

# Show trained imputer attributes
list_trained_attributes(imputer)

['feature_names_in_', 'indicator_', 'n_features_in_', 'statistics_']

In [None]:
# Show transformed data with missing indicator
imputer.transform(X_train[NUMERICAL_FEATURES])

Unnamed: 0_level_0,temperature,pressure,missingindicator_temperature,missingindicator_pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,180.0,13000.0,0.0,0.0
2,100.0,5000.0,0.0,0.0
3,120.0,11000.0,0.0,0.0
4,122.5,4500.0,1.0,0.0
5,90.0,8375.0,0.0,1.0


In [None]:
# Show transformed test data with missing indicator
imputer.transform(X_test[NUMERICAL_FEATURES])

Unnamed: 0_level_0,temperature,pressure,missingindicator_temperature,missingindicator_pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,85.0,6000.0,0.0,0.0
7,110.0,10500.0,0.0,0.0
8,122.5,3300.0,1.0,0.0


## Scaler

In [None]:
from sklearn.preprocessing import StandardScaler, scale

In [None]:
# Instantiate & fit standard scaler
scaler = StandardScaler().fit(X_train[NUMERICAL_FEATURES])

# List learned attributes
list_trained_attributes(scaler)

['feature_names_in_',
 'mean_',
 'n_features_in_',
 'n_samples_seen_',
 'scale_',
 'var_']

In [None]:
# Show scaled train, test, and new data
print('Scaled train data')
display (scaler.transform(X_train_imputed[NUMERICAL_FEATURES]) )
print()

print('Scaled test data')
display (scaler.transform(X_test_imputed[NUMERICAL_FEATURES]) )
print()

print('Scaled new data')
display (scaler.transform(X_new[NUMERICAL_FEATURES]) )

Scaled train data


Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.647064,1.250828
2,-0.644503,-0.912767
3,-0.071611,0.70993
4,0.0,-1.047991
5,-0.930949,0.0



Scaled test data


Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,-1.074172,-0.642317
7,-0.358057,0.574705
8,0.0,-1.372531



Scaled new data


Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,-3.165228,-0.236643


If this preprocessing will not be part of a pipeline for model / production, you can use `scale` directly:

In [None]:
# Applying standard scale with function
scale(X_train[NUMERICAL_FEATURES])

array([[ 1.64706421,  1.2508283 ],
       [-0.64450339, -0.9127666 ],
       [-0.07161149,  0.70992957],
       [        nan, -1.04799128],
       [-0.93094934,         nan]])

In [None]:
# Trick to keep dataframe index and column names :)
X_train[NUMERICAL_FEATURES].apply(scale)

Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.647064,1.250828
2,-0.644503,-0.912767
3,-0.071611,0.70993
4,,-1.047991
5,-0.930949,


## Categorical Encoders

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

**One-hot encoder**

Note that, unlike our `MyOneHotEncoder` class, this sklearn class offers multiple options. For instance, if a new category appears, it could throw either an error or ignore it. We can decide to drop some categories (for binary variables, for example, it might be worth dropping a column since it'll be redundant). Also, we can establish a minimum frequency to create a new column for the category. In case this frequency is not reached, the low-frequent categories will be grouped into a column called "infrequent_sklearn".

In [None]:
# Instantiate and fit one-hot encoder
ohe_encoder = OneHotEncoder(handle_unknown='ignore', drop=None, sparse_output=False, min_frequency=None, max_categories=None)
ohe_encoder.fit(df_train[CATEGORICAL_FEATURES])

In [None]:
# Transform categorical data
X_train_transformed = ohe_encoder.transform(X_train[CATEGORICAL_FEATURES])
X_test_transformed = ohe_encoder.transform(X_test[CATEGORICAL_FEATURES])
X_new_transformed = ohe_encoder.transform(X_new[CATEGORICAL_FEATURES])

# Show transformed categorical data
print('Train data')
display(X_train_transformed)
print()

print('Test data')
display(X_test_transformed)
print()

print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.0,1.0,0.0,1.0
2,1.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0
4,0.0,1.0,0.0,1.0
5,1.0,0.0,1.0,0.0



Test data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,0.0,1.0,1.0,0.0
7,0.0,1.0,0.0,1.0
8,1.0,0.0,0.0,0.0



New data


Unnamed: 0_level_0,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,1.0,0.0,1.0,0.0


In [None]:
# List learned attributes
list_trained_attributes(ohe_encoder)

['categories_',
 'drop_idx_',
 'feature_names_in_',
 'infrequent_categories_',
 'n_features_in_']

**Ordinal encoder**

OrdinalEncoder transforms categorical variables into numerical variables without creating new columns. It replaces the categories by numbers.

Despite the name "ordinal", the `OrdinalEncoder` replaces the categories with ordinal numbers randomly (probably alphabetically), meaning that a category labeled "3" is not necessarily more significant than the category labeled as "1".

This information is essential because, although such transformation works for tree-based models, it'll not function appropriately for models like GLM and distance-based because they assume a linear for numerical features.

In [None]:
# Instantiate and fit "ordinal" encoder
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
ordinal_encoder.fit(df_train[CATEGORICAL_FEATURES])

In [None]:
# Transform categorical data
X_train_transformed = ordinal_encoder.transform(X_train[CATEGORICAL_FEATURES])
X_test_transformed = ordinal_encoder.transform(X_test[CATEGORICAL_FEATURES])
X_new_transformed = ordinal_encoder.transform(X_new[CATEGORICAL_FEATURES])

# Show transformed categorical data
print('Train data')
display(X_train_transformed)
print()

print('Test data')
display(X_test_transformed)
print()

print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.0,1.0
2,0.0,0.0
3,1.0,1.0
4,1.0,1.0
5,0.0,0.0



Test data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,1.0,0.0
7,1.0,1.0
8,0.0,-1.0



New data


Unnamed: 0_level_0,due_maintenance,age_status
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,0.0,0.0


In [None]:
# List learned attributes
list_trained_attributes(ordinal_encoder)

['categories_', 'feature_names_in_', 'n_features_in_']

In [None]:
print(ordinal_encoder.categories_)
print(ordinal_encoder.feature_names_in_)

[array(['No', 'Yes'], dtype=object), array(['new', 'old'], dtype=object)]
['due_maintenance' 'age_status']


**LabelEncoder**

Like `OrdinalEncoder,` `LabelEncoder` replaces the categories by numbers. Unlike `OrdinalEncoder,` `LabelEncoder` was developed to be applied to the target variable, not to features. For this reason, it doesn't work in a feature pre-preprocessing pipeline.

In [None]:
# Instantiate and train label encoder with target
le_encoder = LabelEncoder()
le_encoder.fit(y_train)

In [None]:
# Show train y values (boolean)
y_train

tool_id
1     True
2    False
3     True
4    False
5    False
Name: failed, dtype: bool

In [None]:
# Transform target data
y_train_transformed = le_encoder.transform(y_train)
y_test_transformed = le_encoder.transform(y_test)

# Show transformed targets
print('Train data')
display(y_train_transformed)
print()

print('Test data')
display(y_test_transformed)

Train data


array([1, 0, 1, 0, 0])


Test data


array([0, 1, 0])

In [None]:
# List learned attributes
list_trained_attributes(le_encoder)

['classes_']

In [None]:
# Show y classes
le_encoder.classes_

array([False,  True])

In [None]:
# Inverse transform
le_encoder.inverse_transform(y_train_transformed)

array([ True, False,  True, False, False])

## Column transformers

So far, we have been subsetting the numerical and categorical data to apply our preprocessing.

Wouldn't it be easier if we could directly specify the columns we want to apply the respective preprocessing?

We can use `ColumnTransformer` for this:

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# Preprocessors (transformers)
numeric_preprocessor = SimpleImputer(strategy='mean')
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Create Column transformer (list ot tuples: step name, transformer, list of columns)
preprocessor = ColumnTransformer([
    ('numeric', numeric_preprocessor, NUMERICAL_FEATURES),
    ('categorical', categorical_preprocessor, CATEGORICAL_FEATURES),
])

# Fit preprocessor
preprocessor.fit(X_train)

In [None]:
# Transform data
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
X_new_transformed = preprocessor.transform(X_new)

# Show transformed categorical data
print('Train data')
display(X_train_transformed)
print()

print('Test data')
display(X_test_transformed)
print()

print('New data')
display(X_new_transformed)

Train data


Unnamed: 0_level_0,numeric__temperature,numeric__pressure,categorical__due_maintenance_No,categorical__due_maintenance_Yes,categorical__age_status_new,categorical__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,180.0,13000.0,0.0,1.0,0.0,1.0
2,100.0,5000.0,1.0,0.0,1.0,0.0
3,120.0,11000.0,0.0,1.0,0.0,1.0
4,122.5,4500.0,0.0,1.0,0.0,1.0
5,90.0,8375.0,1.0,0.0,1.0,0.0



Test data


Unnamed: 0_level_0,numeric__temperature,numeric__pressure,categorical__due_maintenance_No,categorical__due_maintenance_Yes,categorical__age_status_new,categorical__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6,85.0,6000.0,0.0,1.0,1.0,0.0
7,110.0,10500.0,0.0,1.0,0.0,1.0
8,122.5,3300.0,1.0,0.0,0.0,0.0



New data


Unnamed: 0_level_0,numeric__temperature,numeric__pressure,categorical__due_maintenance_No,categorical__due_maintenance_Yes,categorical__age_status_new,categorical__age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,12.0,7500.0,1.0,0.0,1.0,0.0


Note that when using `ColumnTransformer` (and scikit-learn pipelines as well), the ultimate step name is returned at the beginning of the column names so we can identify which transformation has been applied.

In case you don't want to see the step name, you can rename the columns as follows:

In [None]:
# Rename columns by removing the transformer step name
X_new_transformed.rename(columns={col: col.split('__', maxsplit=1)[1] for col in X_new_transformed.columns})

Unnamed: 0_level_0,temperature,pressure,due_maintenance_No,due_maintenance_Yes,age_status_new,age_status_old
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,12.0,7500.0,1.0,0.0,1.0,0.0


In [None]:
# List learned attributes
list_trained_attributes(preprocessor)

['feature_names_in_',
 'n_features_in_',
 'named_transformers_',
 'output_indices_',
 'sparse_output_',
 'transformers_']

In [None]:
# We can access the transformers separately as intended
preprocessor.named_transformers_

{'numeric': SimpleImputer(),
 'categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False)}

In [None]:
# Accessing and using one transformer from the Column transformer
trained_imputer = preprocessor.named_transformers_['numeric']
trained_imputer.transform(X_train[NUMERICAL_FEATURES])

Unnamed: 0_level_0,temperature,pressure
tool_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,180.0,13000.0
2,100.0,5000.0
3,120.0,11000.0
4,122.5,4500.0
5,90.0,8375.0


An alternative (simpler) way to use `ColumnTransformer`, without specifying the step names

In [None]:
from sklearn.compose import make_column_transformer

In [None]:
# Preprocessors (transformers)
numeric_preprocessor = SimpleImputer(strategy='mean')
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Create Column transformer with make_column_transformer (tuples: transformer, list of columns)
preprocessor = make_column_transformer(
    (numeric_preprocessor, NUMERICAL_FEATURES),
    (categorical_preprocessor, CATEGORICAL_FEATURES),
)

# Fit preprocessor
preprocessor.fit(X_train)

In [None]:
# Transformer names
preprocessor.named_transformers_

{'simpleimputer': SimpleImputer(),
 'onehotencoder': OneHotEncoder(handle_unknown='ignore', sparse_output=False)}

In the [next notebook](https://drive.google.com/file/d/13q0UmHCZshnyJv0T3fIwvi8qDBwjeT_x/view?usp=sharing), let's learn how to build scikit-learn from simple to complex scikit-learn pipelines. Check it out!