# Encoding, Transforming and Scaling Features

We will cover the following topics in this chapter:
- creating training datasets and avoiding deep leakage
- identifying irrelevant or redundant observations to be removed
- encoding categorical features
- encoding features with medium or high cardinality
- transforming features
- binning features
- scaling features

## Create training datasets and avoiding data leakage

**Data leakage** occurs whenever our models are informed by data that is not in the training dataset. 


For example, if we have missing values for a feature, we might impute the mean across the whole dataset for those values. However, in order to validate our model, we subsequently split our data into training and testing data. We would have then accidentally introduced data leakage into our training dataset since the information from the full dataset (that is, global mean) would have been used.

## Removing redundant or unhelpful features

In [1]:
!pip install -qq feature-engine

In [7]:
import feature_engine.selection as fesel
from sklearn.model_selection import train_test_split

from data.load import load_ltpoland, load_nls97b

In [8]:
nls97 = load_nls97b()
ltpoland = load_ltpoland()

In [9]:
feature_cols = [
    "satverbal",
    "satmath",
    "gpascience",
    "gpaenglish",
    "gpamath",
    "gpaoverall",
]
X_train, X_test, y_train, y_test = train_test_split(
    nls97[feature_cols], nls97[["wageincome"]], test_size=0.3, random_state=0
)

In [10]:
X_train.corr()

Unnamed: 0,satverbal,satmath,gpascience,gpaenglish,gpamath,gpaoverall
satverbal,1.0,0.72889,0.438588,0.443692,0.375226,0.420707
satmath,0.72889,1.0,0.479757,0.430359,0.51777,0.484701
gpascience,0.438588,0.479757,1.0,0.671744,0.60634,0.792695
gpaenglish,0.443692,0.430359,0.671744,1.0,0.599713,0.843816
gpamath,0.375226,0.51777,0.60634,0.599713,1.0,0.750494
gpaoverall,0.420707,0.484701,0.792695,0.843816,0.750494,1.0


^ Here, `gpaoverall` is highly correlated with `gpascience`, `gpaenglish` and `gpamath`. 

The `corr` method returns the `Pearson` coefficients by default. This is fine when we can assume a linear relationship with the features. However, when this assumption does not make sense, we should consider requesting `Spearman` coefficients instead. 

Let's drop features that have a correlation higher than 0.75 with another feature.

In [11]:
tr = fesel.DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.75)
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6288 entries, 574974 to 370933
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   satverbal   1001 non-null   float64
 1   satmath     1001 non-null   float64
 2   gpascience  3998 non-null   float64
 3   gpaenglish  4078 non-null   float64
 4   gpamath     4056 non-null   float64
dtypes: float64(5)
memory usage: 294.8 KB


^ The column `gpaoverall` is dropped.

In [12]:
feature_cols = [
    "year",
    "month",
    "latabs",
    "latitude",
    "elevation",
    "longitude",
    "country",
]
X_train, X_test, y_train, y_test = train_test_split(
    ltpoland[feature_cols], ltpoland[["temperature"]], test_size=0.3, random_state=0
)

In [13]:
X_train.sample(5, random_state=99)

Unnamed: 0_level_0,year,month,latabs,latitude,elevation,longitude,country
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SIEDLCE,2019,11,52.25,52.25,152.0,22.25,Poland
OKECIE,2019,6,52.166,52.166,110.3,20.967,Poland
BALICE,2019,1,50.078,50.078,241.1,19.785,Poland
BALICE,2019,7,50.078,50.078,241.1,19.785,Poland
BIALYSTOK,2019,11,53.1,53.1,151.0,23.167,Poland


In [14]:
X_train.year.value_counts()

2019    84
Name: year, dtype: int64

In [15]:
X_train.country.value_counts()

Poland    84
Name: country, dtype: int64

In [16]:
(X_train.latitude != X_train.latabs).sum()

0

In [17]:
tr = fesel.DropConstantFeatures()
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.head()

Unnamed: 0_level_0,month,latabs,latitude,elevation,longitude
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OKECIE,1,52.166,52.166,110.3,20.967
LAWICA,8,52.421,52.421,93.9,16.826
LEBA,11,54.75,54.75,2.0,17.5331
SIEDLCE,10,52.25,52.25,152.0,22.25
BIALYSTOK,11,53.1,53.1,151.0,23.167


^ The feature `country` and `year` has been dropped.

In [18]:
tr = fesel.DropDuplicateFeatures()
tr.fit(X_train_tr)
X_train_tr = tr.transform(X_train_tr)
X_train_tr.head()

Unnamed: 0_level_0,month,latabs,elevation,longitude
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OKECIE,1,52.166,110.3,20.967
LAWICA,8,52.421,93.9,16.826
LEBA,11,54.75,2.0,17.5331
SIEDLCE,10,52.25,152.0,22.25
BIALYSTOK,11,53.1,151.0,23.167


^ Features that have the same values as other features are dropped. In this case, the transform drop `latitude`, which has the same values as `latabs`.

## Encoding categorical features

### One-hot encoding

One-hot encoding a features creates a binary vector for each value of that feature.

In [20]:
import pandas as pd
from feature_engine.encoding import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from data.load import load_nls97b

In [21]:
nls97 = load_nls97b()

In [22]:
feature_cols = ["gender", "maritalstatus", "colenroct99"]
nls97_demo = nls97[["wageincome"] + feature_cols].dropna()

X_demo_train, X_demo_test, y_demo_train, y_demo_test = train_test_split(
    nls97_demo[feature_cols], nls97_demo[["wageincome"]], test_size=0.3, random_state=0
)

In [23]:
pd.get_dummies(X_demo_train, columns=["gender", "maritalstatus"]).head(2).T

personid,736081,832734
colenroct99,1. Not enrolled,1. Not enrolled
gender_Female,1,0
gender_Male,0,1
maritalstatus_Divorced,0,0
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Separated,0,0
maritalstatus_Widowed,0,0


Typically, we create `k-1` dummy variables for `k` unique values for a feature.

In [25]:
pd.get_dummies(X_demo_train, columns=["gender", "maritalstatus"], drop_first=True).head(
    2
).T

personid,736081,832734
colenroct99,1. Not enrolled,1. Not enrolled
gender_Male,0,1
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Separated,0,0
maritalstatus_Widowed,0,0


In [26]:
ohe = OneHotEncoder(drop_last=True, variables=["gender", "maritalstatus"])
ohe.fit(X_demo_train)

X_demo_train_ohe = ohe.transform(X_demo_train)
X_demo_test_ohe = ohe.transform(X_demo_test)
X_demo_train_ohe.filter(regex="gen|mar", axis="columns").head(2).T

personid,736081,832734
gender_Female,1,0
maritalstatus_Married,1,0
maritalstatus_Never-married,0,1
maritalstatus_Divorced,0,0
maritalstatus_Separated,0,0
