# Encoding, Transforming and Scaling Features

We will cover the following topics in this chapter:
- creating training datasets and avoiding deep leakage
- identifying irrelevant or redundant observations to be removed
- encoding categorical features
- encoding features with medium or high cardinality
- transforming features
- binning features
- scaling features

## Create training datasets and avoiding data leakage

**Data leakage** occurs whenever our models are informed by data that is not in the training dataset. 


For example, if we have missing values for a feature, we might impute the mean across the whole dataset for those values. However, in order to validate our model, we subsequently split our data into training and testing data. We would have then accidentally introduced data leakage into our training dataset since the information from the full dataset (that is, global mean) would have been used.

## Removing redundant or unhelpful features

In [1]:
!pip install -qq feature-engine

In [7]:
import feature_engine.selection as fesel
from sklearn.model_selection import train_test_split

from data.load import load_ltpoland, load_nls97b

In [8]:
nls97 = load_nls97b()
ltpoland = load_ltpoland()

In [9]:
feature_cols = [
    "satverbal",
    "satmath",
    "gpascience",
    "gpaenglish",
    "gpamath",
    "gpaoverall",
]
X_train, X_test, y_train, y_test = train_test_split(
    nls97[feature_cols], nls97[["wageincome"]], test_size=0.3, random_state=0
)

In [10]:
X_train.corr()

Unnamed: 0,satverbal,satmath,gpascience,gpaenglish,gpamath,gpaoverall
satverbal,1.0,0.72889,0.438588,0.443692,0.375226,0.420707
satmath,0.72889,1.0,0.479757,0.430359,0.51777,0.484701
gpascience,0.438588,0.479757,1.0,0.671744,0.60634,0.792695
gpaenglish,0.443692,0.430359,0.671744,1.0,0.599713,0.843816
gpamath,0.375226,0.51777,0.60634,0.599713,1.0,0.750494
gpaoverall,0.420707,0.484701,0.792695,0.843816,0.750494,1.0


^ Here, `gpaoverall` is highly correlated with `gpascience`, `gpaenglish` and `gpamath`. 

The `corr` method returns the `Pearson` coefficients by default. This is fine when we can assume a linear relationship with the features. However, when this assumption does not make sense, we should consider requesting `Spearman` coefficients instead. 

Let's drop features that have a correlation higher than 0.75 with another feature.

In [11]:
tr = fesel.DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.75)
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6288 entries, 574974 to 370933
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   satverbal   1001 non-null   float64
 1   satmath     1001 non-null   float64
 2   gpascience  3998 non-null   float64
 3   gpaenglish  4078 non-null   float64
 4   gpamath     4056 non-null   float64
dtypes: float64(5)
memory usage: 294.8 KB


^ The column `gpaoverall` is dropped.

In [12]:
feature_cols = [
    "year",
    "month",
    "latabs",
    "latitude",
    "elevation",
    "longitude",
    "country",
]
X_train, X_test, y_train, y_test = train_test_split(
    ltpoland[feature_cols], ltpoland[["temperature"]], test_size=0.3, random_state=0
)

In [13]:
X_train.sample(5, random_state=99)

Unnamed: 0_level_0,year,month,latabs,latitude,elevation,longitude,country
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SIEDLCE,2019,11,52.25,52.25,152.0,22.25,Poland
OKECIE,2019,6,52.166,52.166,110.3,20.967,Poland
BALICE,2019,1,50.078,50.078,241.1,19.785,Poland
BALICE,2019,7,50.078,50.078,241.1,19.785,Poland
BIALYSTOK,2019,11,53.1,53.1,151.0,23.167,Poland


In [14]:
X_train.year.value_counts()

2019    84
Name: year, dtype: int64

In [15]:
X_train.country.value_counts()

Poland    84
Name: country, dtype: int64

In [16]:
(X_train.latitude != X_train.latabs).sum()

0

In [17]:
tr = fesel.DropConstantFeatures()
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.head()

Unnamed: 0_level_0,month,latabs,latitude,elevation,longitude
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OKECIE,1,52.166,52.166,110.3,20.967
LAWICA,8,52.421,52.421,93.9,16.826
LEBA,11,54.75,54.75,2.0,17.5331
SIEDLCE,10,52.25,52.25,152.0,22.25
BIALYSTOK,11,53.1,53.1,151.0,23.167


^ The feature `country` and `year` has been dropped.

In [18]:
tr = fesel.DropDuplicateFeatures()
tr.fit(X_train_tr)
X_train_tr = tr.transform(X_train_tr)
X_train_tr.head()

Unnamed: 0_level_0,month,latabs,elevation,longitude
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OKECIE,1,52.166,110.3,20.967
LAWICA,8,52.421,93.9,16.826
LEBA,11,54.75,2.0,17.5331
SIEDLCE,10,52.25,152.0,22.25
BIALYSTOK,11,53.1,151.0,23.167


^ Features that have the same values as other features are dropped. In this case, the transform drop `latitude`, which has the same values as `latabs`.

## Encoding categorical features

### One-hot encoding

One-hot encoding a features creates a binary vector for each value of that feature.

In [3]:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from data.load import load_nls97b

In [4]:
nls97 = load_nls97b()

In [5]:
feature_cols = ["gender", "maritalstatus", "colenroct99"]
nls97_demo = nls97[["wageincome"] + feature_cols].dropna()

X_demo_train, X_demo_test, y_demo_train, y_demo_test = train_test_split(
    nls97_demo[feature_cols], nls97_demo[["wageincome"]], test_size=0.3, random_state=0
)

In [68]:
pd.get_dummies(X_demo_train, columns=["gender", "maritalstatus"]).head(2).T

personid,736081,832734
colenroct99,1. Not enrolled,1. Not enrolled
gender_Female,1,0
gender_Male,0,1
maritalstatus_Divorced,0,0
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Separated,0,0
maritalstatus_Widowed,0,0


Typically, we create `k-1` dummy variables for `k` unique values for a feature.

In [7]:
pd.get_dummies(X_demo_train, columns=["gender", "maritalstatus"], drop_first=True).head(
    2
).T

personid,736081,832734
colenroct99,1. Not enrolled,1. Not enrolled
gender_Male,0,1
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Separated,0,0
maritalstatus_Widowed,0,0


In [8]:
ohe = OneHotEncoder(drop_last=True, variables=["gender", "maritalstatus"])
ohe.fit(X_demo_train)

X_demo_train_ohe = ohe.transform(X_demo_train)
X_demo_test_ohe = ohe.transform(X_demo_test)
X_demo_train_ohe.filter(regex="gen|mar", axis="columns").head(2).T

personid,736081,832734
gender_Female,1,0
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Divorced,0,0
maritalstatus_Separated,0,0


In [36]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [61]:
columns = ["gender", "maritalstatus"]
ohe = OneHotEncoder(drop="first")
ohe.fit(X_demo_train[columns])

X_demo_train_ohe = ohe.transform(X_demo_train[columns])
X_demo_train_ohe = pd.DataFrame(
    X_demo_train_ohe.toarray(),
    columns=ohe.get_feature_names_out(),
    index=X_demo_train.index,
)
X_demo_train_ohe.filter(regex="gen|mar", axis="columns").head(2).T

personid,736081,832734
gender_Male,0.0,1.0
maritalstatus_Married,1.0,0.0
maritalstatus_Never-married,0.0,1.0
maritalstatus_Separated,0.0,0.0
maritalstatus_Widowed,0.0,0.0


A better option is to use `ColumnTransformer` to transform perform one hot encoding on selected columns.

In [66]:
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(), ["gender", "maritalstatus"]),
    remainder="passthrough",
    verbose_feature_names_out=False,  # Prevent prefixing columns with `onehotencoder_` and `remainder_`.
)
transformed = transformer.fit_transform(X_demo_train)
pd.DataFrame(
    transformed, columns=transformer.get_feature_names_out(), index=X_demo_train.index
)

Unnamed: 0_level_0,gender_Female,gender_Male,maritalstatus_Divorced,maritalstatus_Married,maritalstatus_Never-married,maritalstatus_Separated,maritalstatus_Widowed,colenroct99
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
736081,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1. Not enrolled
832734,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1. Not enrolled
453537,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1. Not enrolled
322059,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1. Not enrolled
324323,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2. 2-year college
...,...,...,...,...,...,...,...,...
975681,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1. Not enrolled
686050,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1. Not enrolled
393058,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1. Not enrolled
565534,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1. Not enrolled


### Summary

One-hot encoding is a fairly straightforward way to prepare nominal data for a machine learning algorithm.

## Ordinal encoding

Categorical features can be either nominal or ordinal. Gender and marital status are norminal. Their values do not imply order.

However, when a categorical feature is ordinal, we want the encoding to capture the ranking of the values. One-hot encoding will lose this ordering.

In [74]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from data.load import load_nls97b

nls97 = load_nls97b()
feature_cols = ["gender", "maritalstatus", "colenroct99"]
nls97_demo = nls97[["wageincome"] + feature_cols].dropna()

X_demo_train, X_demo_test, y_demo_train, y_demo_test = train_test_split(
    nls97_demo[feature_cols], nls97_demo[["wageincome"]], test_size=0.3, random_state=0
)

The college enrollment for October 1999 can be considered an ordinal feature. The values are string, but there is an implied order.

In [71]:
X_demo_train.colenroct99.unique()

array(['1. Not enrolled', '2. 2-year college ', '3. 4-year college'],
      dtype=object)

In [73]:
X_demo_train.head()

Unnamed: 0_level_0,gender,maritalstatus,colenroct99
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
736081,Female,Married,1. Not enrolled
832734,Male,Never-married,1. Not enrolled
453537,Male,Married,1. Not enrolled
322059,Female,Divorced,1. Not enrolled
324323,Female,Married,2. 2-year college


In [75]:
oe = OrdinalEncoder(categories=[X_demo_train.colenroct99.unique()])
colenr_enc = pd.DataFrame(
    oe.fit_transform(X_demo_train[["colenroct99"]]),
    columns=["colenroct99"],
    index=X_demo_train.index,
)
colenr_enc

Unnamed: 0_level_0,colenroct99
personid,Unnamed: 1_level_1
736081,0.0
832734,0.0
453537,0.0
322059,0.0
324323,1.0
...,...
975681,0.0
686050,0.0
393058,0.0
565534,0.0


In [77]:
X_demo_train_enc = X_demo_train[["gender", "maritalstatus"]].join(colenr_enc)
X_demo_train_enc

Unnamed: 0_level_0,gender,maritalstatus,colenroct99
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
736081,Female,Married,0.0
832734,Male,Never-married,0.0
453537,Male,Married,0.0
322059,Female,Divorced,0.0
324323,Female,Married,1.0
...,...,...,...
975681,Male,Married,0.0
686050,Female,Never-married,0.0
393058,Male,Married,0.0
565534,Male,Married,0.0


In [78]:
X_demo_train.colenroct99.value_counts().sort_index()

1. Not enrolled       3050
2. 2-year college      142
3. 4-year college      350
Name: colenroct99, dtype: int64

In [79]:
X_demo_train_enc.colenroct99.value_counts().sort_index()

0.0    3050
1.0     142
2.0     350
Name: colenroct99, dtype: int64

### Summary

The ordinal encoding replaces the initial values for `colenroct99` with numbers from 0 to 2. 

> Note: Ordinal encoding is appropriate for non-linear models such as decision trees. It might not make sense in a linear regression model because that would assume that the distance between values was equally meaningful across the whole distribution.

## Encoding categorical features with medium or high cardinality


It can be impractical to create a dummy variable for each value when working with a categorical feature that has many unique values. 

There are couple of ways to handle medium or high cardinality:
- create dummies for the top k categories and group the remaining into an _other_ category
- use feature hashing

In [83]:
!pip install -qq category_encoders
import pandas as pd
from category_encoders.hashing import HashingEncoder
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split

from data.load import load_covid

In [85]:
covid = load_covid()
feature_cols = [
    "location",
    "population",
    "aged_65_older",
    "diabetes_prevalence",
    "region",
]
covid = covid[["total_cases"] + feature_cols].dropna()

In [88]:
X_train, X_test, y_train, y_test = train_test_split(
    covid[feature_cols], covid[["total_cases"]], test_size=0.3, random_state=0
)

The feature `region` has 16 unique values.

In [89]:
X_train.region.value_counts()

Eastern Europe     16
East Asia          12
Western Europe     12
West Africa        11
West Asia          10
East Africa        10
South Asia          7
Southern Africa     7
South America       7
Central Africa      7
Oceania / Aus       6
Caribbean           6
Central Asia        5
North Africa        4
Central America     3
North America       3
Name: region, dtype: int64

We use the `OneHotEncoder` module from `feature_engine` to encode the `region` feature. This time, we use the `top_categories` parameter to indicate that we only want to create dummmies for the top six category values. Any values that do not fall into the top six will have a 0 for all of the dummies.

In [90]:
ohe = OneHotEncoder(top_categories=6, variables=["region"])
covid_ohe = ohe.fit_transform(covid)
covid_ohe.filter(regex="location|region", axis="columns").sample(5, random_state=99).T

iso_code,ISR,SEN,IDN,LKA,KEN
location,Israel,Senegal,Indonesia,Sri Lanka,Kenya
region_Eastern Europe,0,0,0,0,0
region_Western Europe,0,0,0,0,0
region_West Africa,0,1,0,0,0
region_East Asia,0,0,1,0,0
region_West Asia,1,0,0,0,0
region_East Africa,0,0,0,0,1


An alternative approach to one-hot encoding, when a _categorical feature_ has many unique values, is to use __feature hashing__.