# Lab Activity 6: Write a program to implement Categorical Encoding, One-hot Encoding

```mermaid
graph TD
    A[Categorical Encoding] --> B[Label Encoding]
    A --> C[Ordinal Encoding]
    A --> D[One-Hot Encoding]

# Categorical Encoding Methods: Label Encoding and One-Hot Encoding

## Objective
To learn and apply categorical encoding methods **Label Encoding** and **One-Hot Encoding** to convert categorical data into numeric formats suitable for machine learning models.

1. Understand how to utilize **Label Encoding** to assign integer values to categorical variables.
2. Perform **One-Hot Encoding** using `pandas.get_dummies` and `scikit-learn OneHotEncoder` to create binary columns for categorical variables.
3. Compare the results of both encoding methods and their use cases (e.g., ordinal vs. nominal data).
4. Explain the role of encoding techniques on the structure of data, model interpretability, and factors such as dimensions, multicollinearity, and others.

---

## Label Encoding

### Algorithm 6 a – Label Encoding

1. Identify all unique categories in the dataset or in list.
2. Assign a unique integer to each category, typically starting from 0.
3. Replace the original categorical values with the corresponding integer values.

   - **For Ordinal Data:**
     - The encoding preserves the order (e.g., Small → 0, Medium → 1, Large → 2).

   - **For Nominal Data:**
     - The encoding is arbitrary (e.g., Red → 0, Blue → 1, Green → 2).

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']

# Create a LabelEncoder instance
label_encoder = LabelEncoder()

# Apply label encoding
encoded_colors = label_encoder.fit_transform(colors)

# Display the results
df = pd.DataFrame({
    'Color': colors,
    'Encoded': encoded_colors
})
print(df)

   Color  Encoded
0    Red        2
1   Blue        0
2  Green        1
3    Red        2
4   Blue        0


---

##  Ordinal Encoding

### Algorithm 6 b – Ordinal Encoding

1. **Identify the categories that have an inherent order.**
   - Analyze the dataset to determine which categorical variables exhibit a natural sequence or hierarchy (e.g., Small, Medium, Large).

2. **Assign each category a specific integer based on the desired order (e.g., Small → 1, Medium → 2, Large → 3).**
   - Create a mapping where each category is assigned a unique integer reflecting its position in the order.

3. **Replace the original categorical values with the corresponding integer values.**
   - Update the dataset by substituting each categorical value with its assigned integer based on the mapping.

In [2]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Sample data (ordinal data)
sizes = ['Small', 'Medium', 'Large', 'Small', 'Medium']

# Create an OrdinalEncoder instance with a custom order
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])

# Apply ordinal encoding
encoded_sizes = ordinal_encoder.fit_transform([[size] for size in sizes])

# Map the categories to specific integers (Small → 1, Medium → 2, Large → 3)
encoded_sizes_custom = encoded_sizes + 1  # Adding 1 to match your desired order

# Convert the result to integer (this is optional, since adding 1 will automatically make it integer)
encoded_sizes_int = encoded_sizes_custom.astype(int)

# Display the results
df = pd.DataFrame({
    'Size': sizes,
    'Encoded': encoded_sizes_int.flatten()  # Flatten the array to 1D
})
print(df)

     Size  Encoded
0   Small        1
1  Medium        2
2   Large        3
3   Small        1
4  Medium        2


---

##  One - Hot Encoding

### Algorithm 6 c – One - Hot Encoding

### **Algorithm:**

1. **Ordinal Encoding (Custom Order)**:

   * **Goal:** Convert categorical data into integers while preserving a predefined order.
   * **Steps:**

     1. Define the custom order of categories.
     2. Use `OrdinalEncoder` to encode data based on the predefined order.
     3. Adjust the encoded integers to fit the desired starting point (e.g., starting from `1` instead of `0`).
     4. Store the results in a DataFrame for further analysis.
2. **One-Hot Encoding**:

   * **Goal:** Convert categorical data into binary columns, where each category is represented by a separate column.
   * **Steps:**

     1. Create a DataFrame with the original categorical column.
     2. Use `pd.get_dummies()` to create a binary matrix where each category gets its own column.
     3. Display the results as a DataFrame where each category is represented as a binary vector.



In [3]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Sample data (ordinal data)
sizes = ['Small', 'Medium', 'Large', 'Small', 'Medium']

# 1. Ordinal Encoding (Custom Order)
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
encoded_sizes = ordinal_encoder.fit_transform([[size] for size in sizes])

# Map the categories to specific integers (Small → 1, Medium → 2, Large → 3)
encoded_sizes_custom = encoded_sizes + 1  # Adding 1 to match the custom order

# Convert to integers
encoded_sizes_int = encoded_sizes_custom.astype(int)

# Create DataFrame for Ordinal Encoding
df_ordinal = pd.DataFrame({
    'Size': sizes,
    'Ordinal_Encoded': encoded_sizes_int.flatten()  # Flatten to make it a 1D array
})

# 2. One-Hot Encoding using pandas `get_dummies`
df_one_hot = pd.DataFrame({
    'Size': sizes
})

# Using pandas `get_dummies` for One-Hot Encoding and convert boolean to integers
df_one_hot_encoded = pd.get_dummies(df_one_hot, columns=['Size'], prefix='Size')

# Convert the boolean columns to integers (0 and 1)
df_one_hot_encoded = df_one_hot_encoded.astype(int)

# Display Results for both
print("DataFrame after Ordinal Encoding:")
print(df_ordinal)

print("\nDataFrame after One-Hot Encoding:")
print(df_one_hot_encoded)

DataFrame after Ordinal Encoding:
     Size  Ordinal_Encoded
0   Small                1
1  Medium                2
2   Large                3
3   Small                1
4  Medium                2

DataFrame after One-Hot Encoding:
   Size_Large  Size_Medium  Size_Small
0           0            0           1
1           0            1           0
2           1            0           0
3           0            0           1
4           0            1           0
