# Feature Engineering – Feature Encoding
 
**Feature Encoding** is the process of converting categorical data (non-numeric) into a numerical format that machine learning algorithms can understand. Since many algorithms work only with numerical data, encoding categorical variables is a crucial preprocessing step.

## Why is Feature Encoding Required?
**1.	Machine Learning Algorithms:** Most ML models cannot handle non-numeric data.                             
**2.	Interpretability:** Encoded data provides a structured, numerical format that makes computations feasible.                          
**3.	Improved Model Performance:** Proper encoding ensures that categorical information is captured effectively, improving prediction accuracy.          

As machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers so that the model is able to understand and extract valuable information.

## Types of Categorical Variables
We can generally divide the categorical variables(features) into 3 types:

 - Binary: 
 
        (Yes, No) , (True, False) 
        
 - Ordinal: Specific ordered Groups.

         economic status (“low income”,”middle income”,”high income”), 
         
         education level (“high school”,”BS”,”MS”,”PhD”), 
         
         income level (“less than 50K”, “50K-100K”, “over 100K”),
         
         satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”) 
        
 - Nominal : Unordered Groups.
 
        (cat, dog, tiger),(pizza, burger, coke)

## 1: Import Required Libraries

In [4]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

## 2: Load the Dataset

In [7]:
# Load dataset
data = pd.read_csv("C:\\Users\\BINPAT\\Documents\\Python Self\\Feature Engineering\\Datasets\\customer.csv")
df = pd.read_excel("C:\\Users\\BINPAT\\Documents\\Python Self\\Feature Engineering\\Datasets\\feature_encoding.xlsx")

data.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


## 3: Explore and Understand the Data

We should perform a basic exploratory data analysis (EDA) to identify categorical features.

In [10]:
# View dataset info
print(data.shape)
print('\n')
print(data.info())
print('\n')
# Check for categorical features
categorical_features = data.select_dtypes(include=['object', 'category']).columns
print("Categorical Features:", categorical_features)

(50, 5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        50 non-null     int64 
 1   gender     50 non-null     object
 2   review     50 non-null     object
 3   education  50 non-null     object
 4   purchased  50 non-null     object
dtypes: int64(1), object(4)
memory usage: 2.1+ KB
None


Categorical Features: Index(['gender', 'review', 'education', 'purchased'], dtype='object')


## 4: Handle Missing Values

If there are missing values in categorical columns, We must handle them before encoding.

In [13]:
# Extracting Missing Values 
data.isna().sum()

age          0
gender       0
review       0
education    0
purchased    0
dtype: int64

In [15]:
# Fill missing categorical values with "Unknown"
data[categorical_features] = data[categorical_features].fillna("Unknown")

## Step 5: Split Dataset

Divide the dataset into training and testing subsets.

In [18]:
# Split the data into train and test sets (80% train, 20% test)
df_train, df_test = train_test_split(data, test_size=0.2, random_state=42)

df_train_original = df_train.copy()
df_test_original = df_test.copy()

In [20]:
# Check df train 
df_train.head(5)

Unnamed: 0,age,gender,review,education,purchased
12,51,Male,Poor,School,No
4,16,Female,Average,UG,No
37,94,Male,Average,PG,Yes
8,65,Female,Average,UG,No
3,72,Female,Good,PG,No


In [22]:
# Check df test 
df_test.head(5)

Unnamed: 0,age,gender,review,education,purchased
13,57,Female,Average,School,No
39,76,Male,Poor,PG,No
30,73,Male,Average,UG,No
45,61,Male,Poor,PG,Yes
17,22,Female,Poor,UG,Yes


## 6: Choose and Apply Encoding Method

## 6.1. Label Encoding: 
* Assigns a unique integer to each category (e.g., ["low", "medium", "high"] → [0, 1, 2]).
* Pros: Simple and fast. Suitable for ordinal categories (e.g., low < medium < high).
* Cons: Implies an order when used on nominal data, which can mislead the model.
* When to Use: For ordinal categorical features.

In [26]:
# How will number assigned
label_encoder = LabelEncoder()

for col in categorical_features:
    label_encoder.fit(df_train[col])
    print(f"Column: {col}")
    print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

Column: gender
{'Female': 0, 'Male': 1}
Column: review
{'Average': 0, 'Good': 1, 'Poor': 2}
Column: education
{'PG': 0, 'School': 1, 'UG': 2}
Column: purchased
{'No': 0, 'Yes': 1}


In [28]:
# Apply Label Encoding to all categorical columns

label_encoder = LabelEncoder()
for col in categorical_features:
    df_train[col] = label_encoder.fit_transform(df_train[col])
    df_test[col] = label_encoder.transform(df_test[col])

print("Encoded Data:")
df_train.head(5)

Encoded Data:


Unnamed: 0,age,gender,review,education,purchased
12,51,1,2,1,0
4,16,0,0,2,0
37,94,1,0,0,1
8,65,0,0,2,0
3,72,0,1,0,0


## 6.2. One-Hot Encoding: 
* Converts categories into binary vectors (e.g., ["red", "blue", "green"] → [1, 0, 0], [0, 1, 0], [0, 0, 1]).
* Pros: Avoids implying order among categories. Suitable for nominal data (no inherent order).
* Cons: Increases dimensionality, especially for features with many categories.
* When to Use: For nominal categorical features with a small to moderate number of categories.

**One-Hot Encoding creates binary columns for each category and is useful when there’s no ordinal relationship between categories.**

In [60]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [62]:
# Apply One-Hot Encoding to categorical features
df_train = pd.get_dummies(df_train, columns=categorical_features, drop_first=True)
df_test = pd.get_dummies(df_test, columns=categorical_features, drop_first=True)

# Check the results
print("Encoded Data: ")
df_train.head()

Encoded Data: 


Unnamed: 0,age,gender_Male,review_Good,review_Poor,education_School,education_UG,purchased_Yes
12,51,True,False,True,True,False,False
4,16,False,False,False,False,True,False
37,94,True,False,False,False,False,True
8,65,False,False,False,False,True,False
3,72,False,True,False,False,False,False


**We have to convert it into 1 and 0**

## 6.3. Binary Encoding: 
* Converts categories into binary numbers and encodes them as columns (e.g., ["A", "B", "C"] → [01, 10, 11]).
* Pros: Reduces dimensionality compared to One-Hot Encoding. Efficient for high-cardinality features.
* Cons: Harder to interpret compared to One-Hot Encoding.
* When to Use: For features with many unique categories (high cardinality).

**Binary Encoding works by first applying Label Encoding, then converting the labels into binary form. It’s good for high-cardinality features.**

In [66]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [72]:
from category_encoders import BinaryEncoder

# Initialize BinaryEncoder
binary_encoder = BinaryEncoder(cols=categorical_features)

# Apply Binary Encoding to categorical features
df_train = binary_encoder.fit_transform(df_train)
df_test = binary_encoder.transform(df_test)

# Check the results
print("Encoded Data:")
df_train.head(5)

Encoded Data:


Unnamed: 0,age,gender_0,gender_1,review_0,review_1,education_0,education_1,purchased_0,purchased_1
12,51,0,1,0,1,0,1,0,1
4,16,1,0,1,0,1,0,0,1
37,94,0,1,1,0,1,1,1,0
8,65,1,0,1,0,1,0,0,1
3,72,1,0,1,1,1,1,0,1


## 6.4. Frequency/Count Encoding: 
* Replaces each category with its frequency or count in the dataset (e.g., ["A", "A", "B", "C"] → [2, 2, 1, 1]).
* Pros: Simple and fast. Retains information about category prevalence.
* Cons: May not capture relationships between categories effectively.
* When to Use: When you want a quick encoding method and category frequencies are important.

**Frequency Encoding replaces categories with their frequency of occurrence.**

In [74]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [76]:
# Apply Frequency Encoding to categorical features
for col in categorical_features:
    freq_encoding = df_train[col].value_counts() / len(df_train)
    df_train[col] = df_train[col].map(freq_encoding)
    df_test[col] = df_test[col].map(freq_encoding)

# Check the results
print("Encoded Data:")
df_train.head(5)

Encoded Data:


Unnamed: 0,age,gender,review,education,purchased
12,51,0.4,0.325,0.35,0.525
4,16,0.6,0.275,0.3,0.525
37,94,0.4,0.275,0.35,0.475
8,65,0.6,0.275,0.3,0.525
3,72,0.6,0.4,0.35,0.525


## 6.5. Target Encoding (Mean Encoding): 
* Replaces each category with the mean of the target variable for that category (e.g., ["A", "B", "C"] → [0.8, 0.3, 0.5] based on target values).
* Pros: Captures the relationship between the category and the target. Useful for high-cardinality features.
* Cons: Risk of data leakage. Can overfit if not done with caution (e.g., via K-Fold encoding).
* When to Use: For regression problems or categorical features with many levels.
  
**Target Encoding replaces categories with the mean of the target variable for that category. This is especially useful when there’s a strong relationship between the categorical feature and the target variable.**

Since there’s no target column, we can skip this one for now. But here’s the code for reference if we have a target variable:

In [79]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [None]:
from category_encoders import TargetEncoder

# Assuming 'target' is your target column
target_encoder = TargetEncoder(cols=categorical_features)

# Apply Target Encoding
df_train = target_encoder.fit_transform(df_train, df_train['target'])
df_test = target_encoder.transform(df_test)

# Check the results
print("Encoded Data:")
df_train.head(5)

## 6.6. Hashing Encoding:             
* Applies a hash function to convert categories into numerical values. The number of columns is fixed.             
* Pros: Efficient for high-cardinality features. Fixed dimensionality regardless of the number of categories.               
* Cons: May lead to collisions (two categories hashed to the same value). Hard to interpret.               
* When to Use: For datasets with extremely high cardinality and when interpretability is not a priority.

**Hashing Encoding applies a hash function to categorical values and maps them to a fixed number of buckets (useful when there are too many categories).**

In [83]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [85]:
from category_encoders import HashingEncoder

# Initialize HashingEncoder
hashing_encoder = HashingEncoder(cols=categorical_features, n_components=8)

# Apply Hashing Encoding to categorical features
df_train = hashing_encoder.fit_transform(df_train)
df_test = hashing_encoder.transform(df_test)

# Check the results
print("Encoded Data:")
df_train.head(5)

Encoded Data:


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,age
12,0,0,0,1,1,2,0,0,51
4,0,1,0,1,0,0,1,1,16
37,0,2,0,0,0,2,0,0,94
8,0,1,0,1,0,0,1,1,65
3,0,1,0,1,1,0,0,1,72


## 6.7. Ordinal Encoding: 
* Assigns numerical values based on a predefined order (e.g., ["low", "medium", "high"] → [1, 2, 3]).
* Pros: Simple to implement. Retains the ordinal relationship between categories.
* Cons: Not suitable for nominal data.
* When to Use: For ordinal features where the order matters.

**Ordinal Encoding is used when categories have a meaningful order but are not numerical. You need to define the custom order.**

In [99]:
df_train = df_train_original.copy()
df_test = df_test_original.copy()

In [101]:
print(df_train_original.gender.unique())
print(df_train_original.review.unique())
print(df_train_original.education.unique())
print(df_train_original.purchased.unique())

['Male' 'Female']
['Poor' 'Average' 'Good']
['School' 'UG' 'PG']
['No' 'Yes']


In [103]:
# Define the custom order for each categorical feature
ordinal_mapping = {
    'gender': {'Male': 0, 'Female': 1},
    'review': {'Poor': 0, 'Average': 1, 'GOOD': 2},
    'education': {'School': 0, 'UG': 1, 'PG': 2},
    'purchased': {'No': 0, 'Yes': 1}
}

# Apply Ordinal Encoding based on predefined mapping
for col in categorical_features:
    df_train[col] = df_train[col].map(ordinal_mapping[col])
    df_test[col] = df_test[col].map(ordinal_mapping[col])

# Check the results
print("Encoded Data:")
df_train.head(5)

Encoded Data:


Unnamed: 0,age,gender,review,education,purchased
12,51,0,0.0,0,0
4,16,1,1.0,1,0
37,94,0,1.0,2,1
8,65,1,1.0,1,0
3,72,1,,2,0


## General Guidelines

**For Ordinal Data:** Use Label Encoding or Ordinal Encoding.                                            
**For Nominal Data:** Use One-Hot Encoding or Binary Encoding.                                        
**For High-Cardinality Features:** Use Target Encoding, Binary Encoding, or Hashing Encoding.                         
**For Regression Problems:** Target Encoding can be effective but handle with caution to avoid overfitting.    
**Avoid Curse of Dimensionality**