# Encoding Categorical Data

Encoding categorical data in machine learning is the process of converting non-numerical (categorical) data into a numerical format so that ML algorithms can understand and use it for training and prediction. Most machine learning models can’t work directly with categorical variables, so this step is essential.


## Why Encode Categorical Data?

Machine learning models understand numbers, not text. If your dataset has columns like "Color": ["Red", "Blue", "Green"] or "City": ["New York", "London", "Tokyo"], you need to convert those values into numbers before feeding them into a model.

## Types of Categorical Data

   1. Nominal: No order (e.g., color, city, gender)

   2. Ordinal: Has a meaningful order (e.g., size: Small < Medium < Large)

## Common Encoding Techniques
1. Label Encoding

    Converts each category into a unique integer.

    Example: ["Red", "Green", "Blue"] → [0, 1, 2]

    Use when the categories have an order (ordinal data).

    ⚠️ Can mislead models into thinking there's a ranking if used on nominal data.

2. One-Hot Encoding

    Creates a new column for each category, filled with 0s and 1s.

    Example:

    Color
    ------
    Red    → [1, 0, 0]
    Blue   → [0, 1, 0]
    Green  → [0, 0, 1]

    Best for nominal data (no order).

3. Ordinal Encoding

    Manually assign numbers based on order.

    Example: ["Low", "Medium", "High"] → [0, 1, 2]

4. Target/Mean Encoding

    Replace category with the mean of the target variable for that category.

    Useful in specific scenarios like tree-based models or competitions.

    ⚠️ Risk of data leakage if not done carefully.

## Tools in Python

    pandas.get_dummies() – for one-hot encoding

    sklearn.preprocessing.LabelEncoder – for label encoding

    sklearn.preprocessing.OneHotEncoder – for one-hot encoding

    sklearn.preprocessing.OrdinalEncoder – for ordinal encoding