## One Hot Encoding - Feature-engine

[Feature Engineering for Machine Learning Course](https://www.trainindata.com/p/feature-engineering-for-machine-learning)

We will see how to perform one hot encoding with Feature-engine using the Titanic dataset.

For guidelines to obtain the data, check **section 2** of the course.

In [1]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: bayesian-optimization, bigframes, Boruta, category-encoders, cesium, eli5, fastai, hep_ml, imbalanced-learn, librosa, lime, mlxtend, nilearn, pyLDAvis, rgf-python, scikit-learn-intelex, scikit-optimize, scikit-plot, shap, sklearn-pandas, TPOT, tsfresh, woodwork, yellowbrick


Latest versions of scikit are incompatible with the feature-engine version I have:abs
https://stackoverflow.com/questions/79290968/super-object-has-no-attribute-sklearn-tags

In [2]:
!pip uninstall -y scikit-learn
!pip install scikit-learn==1.4

Found existing installation: scikit-learn 1.2.2
Uninstalling scikit-learn-1.2.2:
  Successfully uninstalled scikit-learn-1.2.2
Collecting scikit-learn==1.4
  Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25hInstalling collected packages: scikit-learn
Successfully installed scikit-learn-1.4.0


In [3]:
!pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.8.2-py2.py3-none-any.whl.metadata (9.9 kB)
Collecting pandas>=2.2.0 (from feature-engine)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Downloading feature_engine-1.8.2-py2.py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.0/375.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: pandas, feature-engine
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.4
    Uninstalling pandas-2.1.4:
      Successfully uninstalled p

In [4]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Feature-engine
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [5]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("/kaggle/input/feml-titanic/titanic.csv", usecols=usecols)
data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [6]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 6), (393, 6))

## One hot encoding with Feature-Engine

### Advantages

- quick
- returns dataframe
- returns feature names
- allows to select features to encode
- appends dummies to original dataset

Limitations

- Not sure yet.

In [7]:
# To start, we fillna manually. Later on
# we add this step to a pipeline

X_train.fillna("Missing", inplace=True)
X_test.fillna("Missing", inplace=True)

In [8]:
# set up encoder

encoder = OneHotEncoder(
    variables=None,  # alternatively pass a list of variables
    drop_last=True,  # to return k-1, use drop=false to return k dummies
)

In [9]:
# fit the encoder (finds categories)

encoder.fit(X_train)

In [None]:
# automatically found numerical variables

encoder.variables_

In [None]:
# we observe the learned categories

encoder.encoder_dict_

In [None]:
# transform the data sets

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

X_train_enc.head()

In [None]:
# we can retrieve the feature names as follows:

encoder.get_feature_names_out()

## imputation and encoding

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

In [None]:
# check for missing data

X_train.isnull().sum()

In [None]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe = Pipeline(
    [
        ("imputer", CategoricalImputer()),
        ("ohe", OneHotEncoder(drop_last=True)),
    ]
)

In [None]:
# fit pipeline

pipe.fit(X_train)

In [None]:
# transform data

X_train_enc = pipe.transform(X_train)
X_test_enc = pipe.transform(X_test)

X_train_enc.head()