## One Hot Encoding - Feature-engine

[Feature Engineering for Machine Learning Course](https://www.trainindata.com/p/feature-engineering-for-machine-learning)

We will see how to perform one hot encoding with Feature-engine using the Titanic dataset.

For guidelines to obtain the data, check **section 2** of the course.

In [1]:
!pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.8.2-py2.py3-none-any.whl.metadata (9.9 kB)
Collecting pandas>=2.2.0 (from feature-engine)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn>=1.4.0 (from feature-engine)
  Downloading scikit_learn-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Downloading feature_engine-1.8.2-py2.py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.0/375.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hDownloading scikit_learn-1.6.0-cp31

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Feature-engine
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [4]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("/kaggle/input/feml-titanic/titanic.csv", usecols=usecols)
data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [5]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 6), (393, 6))

## One hot encoding with Feature-Engine

### Advantages

- quick
- returns dataframe
- returns feature names
- allows to select features to encode
- appends dummies to original dataset

Limitations

- Not sure yet.

In [6]:
# To start, we fillna manually. Later on
# we add this step to a pipeline

X_train.fillna("Missing", inplace=True)
X_test.fillna("Missing", inplace=True)

In [9]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.6.0
Summary: A set of python modules for machine learning and data mining
Home-page: https://scikit-learn.org
Author: 
Author-email: 
License: BSD 3-Clause License
         
         Copyright (c) 2007-2024 The scikit-learn developers.
         All rights reserved.
         
         Redistribution and use in source and binary forms, with or without
         modification, are permitted provided that the following conditions are met:
         
         * Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
         
         * Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
         
         * Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse 

In [10]:
!pip uninstall -y scikit-learn
!pip install scikit-learn==1.3

Found existing installation: scikit-learn 1.6.0
Uninstalling scikit-learn-1.6.0:
  Successfully uninstalled scikit-learn-1.6.0
Collecting scikit-learn==1.5.2
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: scikit-learn
Successfully installed scikit-learn-1.5.2


In [12]:
# set up encoder

encoder = OneHotEncoder(
    variables=None,  # alternatively pass a list of variables
    drop_last=True,  # to return k-1, use drop=false to return k dummies
)

In [13]:
# fit the encoder (finds categories)

encoder.fit(X_train)



AttributeError: 'super' object has no attribute '__sklearn_tags__'

AttributeError: 'super' object has no attribute '__sklearn_tags__'

OneHotEncoder(drop_last=True)

In [7]:
# automatically found numerical variables

encoder.variables_

['sex', 'cabin', 'embarked']

In [8]:
# we observe the learned categories

encoder.encoder_dict_

{'sex': ['female'],
 'cabin': ['Missing', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

In [9]:
# transform the data sets

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

X_train_enc.head()

Unnamed: 0,pclass,sibsp,parch,sex_female,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T,embarked_S,embarked_C,embarked_Q
501,2,0,1,1,1,0,0,0,0,0,0,0,1,0,0
588,2,1,1,1,1,0,0,0,0,0,0,0,1,0,0
402,2,1,0,1,1,0,0,0,0,0,0,0,0,1,0
1193,3,0,0,0,1,0,0,0,0,0,0,0,0,0,1
686,3,0,0,1,1,0,0,0,0,0,0,0,0,0,1


In [10]:
# we can retrieve the feature names as follows:

encoder.get_feature_names_out()

['pclass',
 'sibsp',
 'parch',
 'sex_female',
 'cabin_Missing',
 'cabin_E',
 'cabin_C',
 'cabin_D',
 'cabin_B',
 'cabin_A',
 'cabin_F',
 'cabin_T',
 'embarked_S',
 'embarked_C',
 'embarked_Q']

## imputation and encoding

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

((916, 6), (393, 6))

In [12]:
# check for missing data

X_train.isnull().sum()

pclass        0
sex           0
sibsp         0
parch         0
cabin       702
embarked      2
dtype: int64

In [13]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe = Pipeline(
    [
        ("imputer", CategoricalImputer()),
        ("ohe", OneHotEncoder(drop_last=True)),
    ]
)

In [14]:
# fit pipeline

pipe.fit(X_train)

In [15]:
# transform data

X_train_enc = pipe.transform(X_train)
X_test_enc = pipe.transform(X_test)

X_train_enc.head()

Unnamed: 0,pclass,sibsp,parch,sex_female,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T,embarked_S,embarked_C,embarked_Q
501,2,0,1,1,1,0,0,0,0,0,0,0,1,0,0
588,2,1,1,1,1,0,0,0,0,0,0,0,1,0,0
402,2,1,0,1,1,0,0,0,0,0,0,0,0,1,0
1193,3,0,0,0,1,0,0,0,0,0,0,0,0,0,1
686,3,0,0,1,1,0,0,0,0,0,0,0,0,0,1
