# Author Info
Name: **Ejaz-ur-Rehman**\
Business Unit Head | Data Analyst\
MBA (Accounting & Finance), MS (Finance)\
Crystal Tech (Project of MUZHAB Group)\
Karachi, Pakistan

![Date](https://img.shields.io/badge/Date-25--Aug--2025-green?logo=google-calendar)
[![Email](https://img.shields.io/badge/Email-ijazfinance%40gmail.com-blue?logo=gmail)](mailto:ijazfinance@gmail.com)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Ejaz--ur--Rehman-blue?logo=linkedin)](https://www.linkedin.com/in/ejaz-ur-rehman/)
[![GitHub](https://img.shields.io/badge/GitHub-ejazurrehman-black?logo=github)](https://github.com/ejazurrehman)

# Feature Encoding

In Machine Learning, most algorithms can’t work directly with categorical data (data stored as labels or strings like "Red", "Male", "Yes", "Dog"). They require numeric values for computations.

Feature Encoding is the process of converting **categorical variables** into numerical representations so that ML models can understand and process them.

## Types of Feature Encoding in Python

There are mainly two categories of encoding techniques:
### 1. Label Encoding:
   - Converts categories into numbers (integers).
   - Example: {"Red", "Green", "Blue"} → {0, 1, 2}.
   - Useful when categories are ordinal (have a natural order, like Low < Medium < High).
   - ⚠️ But can be misleading for nominal (unordered) data, since numbers might imply ranking.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']
})

# Apply Label Encoding
le = LabelEncoder()
df['Color_Encoded'] = le.fit_transform(df['Color'])

print(df)


   Color  Color_Encoded
0    Red              2
1  Green              1
2   Blue              0
3  Green              1
4    Red              2


### 2. One-Hot Encoding:
   - Creates dummy variables (binary columns) for each category.
   - Example: {"Red", "Green", "Blue"} → three columns:
     - Red → [1,0,0]
     - Green → [0,1,0]
     - Blue → [0,0,1]
   - Avoids the problem of implying an order, but increases dimensionality (especially for many categories).

In [2]:
# One Hot Encoding using pandas
df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']
})

df_encoded = pd.get_dummies(df, columns=['Color'])

print(df_encoded)


   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False         True      False
4       False        False       True


### 3. Ordinal Encoding:
 - Similar to Label Encoding but you define the order of categories.
 - Example: {'Low': 1, 'Medium': 2, 'High': 3}.

In [3]:
from sklearn.preprocessing import OrdinalEncoder

# Sample data
df = pd.DataFrame({
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
})

# Define order manually
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])

print(df)


     Size  Size_Encoded
0   Small           0.0
1  Medium           1.0
2   Large           2.0
3  Medium           1.0
4   Small           0.0


### Other Advanced Encodings
 - Target Encoding → Replace categories with mean of target variable.
 - Frequency Encoding → Replace categories with their frequency.
 - Binary Encoding → Convert categories into binary code.
 - These are useful when there are too many categories (e.g., city names).

### Summary:
  - Label Encoding → Simple, but may imply order incorrectly.
  - One-Hot Encoding → Best for unordered categories; expands features.
  - Ordinal Encoding → Works when categories have a meaningful order.
  - Advanced Methods → Used for high-cardinality categorical data.