<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/dev/2_3_04_Encoding_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding Categorical Data

## 1. Objectives

* Identify different types of categorical variables
* Apply multiple encoding techniques
* Compare how encoding affects model performance
* Learn to handle unseen categories safely in pipelines

## 2. Setup


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df = None

## 3. Create Sample Dataset

Use a small dataset combining nominal, ordinal, and binary columns.

In [None]:
def reset_data():
  global df
  data = {
    'Color': ['Red','Blue','Green','Blue','Red','Green','Blue','Red'],
    'Size': ['Small','Medium','Large','Medium','Small','Large','Large','Medium'],
    'Purchased': ['Yes','No','Yes','No','Yes','Yes','No','No']
  }
  df = pd.DataFrame(data)

reset_data()
df

## Why encoding is needed

Most machine learning models (like those in scikit-learn) work only with numerical data — they perform mathematical operations such as computing distances, sums, and dot products.

Words like "Red" or "Medium" can't be directly multiplied or compared numerically.

Therefore, we encode categorical features into numbers so that algorithms can use them.

**The following code block will fail.**



In [None]:
# This will fail, can't predict categorical strings

X = df[['Color','Size']]
y = df['Purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

## 4. Identify Variable Types

* **Nominal:** Color
* **Ordinal:** Size (Small < Medium < Large)
* **Binary:** Purchased

In [None]:
df.info()

## 5. One-Hot Encoding (Nominal)

Each color becomes a binary column, good for linear models but expands feature space.

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [None]:
pd.get_dummies(df, columns=['Color'], drop_first=False)


## 6. Ordinal Encoding (Ordered)
Keeps order; ideal for tree models or ordinal data.

In [None]:
size_order = [['Small','Medium','Large']]
ord_enc = OrdinalEncoder(categories=size_order)
df['Size_encoded'] = ord_enc.fit_transform(df[['Size']])
df

## 7. Label Encoding (Binary/Target)
Converts Yes/No to 1/0 — compact representation.

In [None]:
le = LabelEncoder()
df['Purchased_encoded'] = le.fit_transform(df['Purchased'])
df

## 8. Compare Impact on Model

Quick logistic regression demo to show encoding influence:

In [None]:
X = pd.get_dummies(df[['Color','Size_encoded']], drop_first=True)
y = df['Purchased_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

## 9. Key Takeaways

* Choose encoding based on variable type and model type.
* One-hot: best for nominal & linear models.
* Ordinal: best for ordered categories.
* Label: for binary or tree models.
* Always plan for unseen categories in production.