# scikit-learn Tutorial (Handling Categorical Variables)

<img src="https://raw.githubusercontent.com/arad1367/UniLi_sources/main/IMG/logo.jpg"
     alt="University of Liechtenstein"
     width="350"
     height="auto">

### About This Tutorial

This tutorial is prepared by **Dr. Pejman Ebrahimi** for the "Deep Learning and Advanced AI Techniques" course at the University of Liechtenstein.

For more resources and notebooks related to this course, please visit Moodle or GitHub repository: [Course Materials](https://github.com/arad1367/University-of-Liechtenstein/tree/main/Deep%20Learning%20and%20Advanced%20AI).

You can reach out to Dr. Pejman Ebrahimi via email: [pejman.ebrahimi@uni.li](mailto:pejman.ebrahimi@uni.li).

### Handling Categorical Variables in Machine Learning
> Categorical variables are non-numeric features that describe categories or groups (e.g., "Male/Female", "Low/Medium/High"). These variables need to be encoded into numeric formats for machine learning models to process them effectively. This notebook demonstrates two common techniques for handling categorical variables: `Label Encoding` and `One-Hot Encoding`.

### 1. Dataset Overview

In [2]:
# This dataset contains employee-related information, including both numerical and categorical features.

employee_data_url = "https://raw.githubusercontent.com/arad1367/WAC_November-2023/refs/heads/main/Employee_Data_missing.csv"

### 3. Two Main Techniques for Handling Categorical Variables

### 3.1. Label Encoding
#### What it does:
Converts each category into a unique integer (e.g., `0`, `1`, `2`, `3`, ...).

![LabelEncoder](https://raw.githubusercontent.com/arad1367/UniLi_sources/main/IMG/LabelEncoder.png)

#### When to Use:
- When the categories have an inherent order (e.g., `"Low"`, `"Medium"`, `"High"`).
- When working with tree-based algorithms like Decision Trees or Random Forests.
- Useful for categorical variables with many categories but no strict ordering.

---

### 3.2. One-Hot Encoding
#### What it does:
Creates a new binary column for each category. If a row belongs to a category, it gets `1`; otherwise, `0`.

![One-Hot-Encoding](https://raw.githubusercontent.com/arad1367/UniLi_sources/main/IMG/One-Hot-Encoding.png)


#### When to Use:
- When the categories are independent and unordered (most of the time!).
- Commonly used for linear models or neural networks where numeric values might imply relationships between categories.

### 4. Code Implementation

* Label Encoding converts categorical variables into numeric labels.

> Code Explanation:
Import the LabelEncoder from sklearn.preprocessing.
Create a dictionary to store encoders for each categorical column.
Loop through each categorical column, apply LabelEncoder, and save the encoder for future reference.
Print the mapping of integers to original categories for transparency.

In [4]:
# Step 1: Loading the Dataset

import pandas as pd

# Load dataset
df = pd.read_csv(employee_data_url)

# Display first few rows
print("Dataset Preview:")
df.head()

Dataset Preview:


Unnamed: 0,EmployeeID,Name,Age,Department,Salary,JoiningDate,PerformanceRating
0,1,John Doe,30,Engineering,75000.0,5/15/2020,4.5
1,2,Jane Smith,28,Marketing,60000.0,10/20/2019,3.8
2,3,Bob Johnson,35,Finance,80000.0,2/10/2021,4.2
3,4,Alice Williams,32,Engineering,82000.0,9/5/2018,4.7
4,5,Chris Brown,28,Finance,75000.0,1/12/2022,3.5


In [5]:
# Step 2: Identify Categorical Columns
# We identify columns with object data types, which typically represent categorical variables.

# Select categorical columns
objList = df.select_dtypes(include="object").columns
print("\nCategorical Columns:")
print(objList)


Categorical Columns:
Index(['Name', 'Department', 'JoiningDate'], dtype='object')


### Part A: Label Encoding

* Import the `LabelEncoder` from `sklearn.preprocessing`.
* Create a `dictionary` to store encoders for each categorical column.
* Loop through each categorical column, apply LabelEncoder, and save the encoder for future reference.
* Print the mapping of integers to original categories for transparency.

In [7]:
# 1. Import library
from sklearn.preprocessing import LabelEncoder

# 2. Create an empty dictionary to save encoders
le_dict = {}

# 3. Loop through all object columns and encode
for feat in objList:
    le = LabelEncoder()
    df[feat] = le.fit_transform(df[feat].astype(str))  # Ensure values are string
    le_dict[feat] = le  # Save the encoder for this column

# 4. Check dataframe info
print("\nDataframe Info After Label Encoding:")
print(df.info())

# 5. Print mapping for each feature
for feat in objList:
    print(f"\nMapping for '{feat}':")
    le = le_dict[feat]  # Get the LabelEncoder for this feature

    for i, class_label in enumerate(le.classes_):
        print(f"{i} -> {class_label}")


Dataframe Info After Label Encoding:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         50 non-null     int64  
 1   Name               50 non-null     int64  
 2   Age                50 non-null     int64  
 3   Department         50 non-null     int64  
 4   Salary             48 non-null     float64
 5   JoiningDate        50 non-null     int64  
 6   PerformanceRating  49 non-null     float64
dtypes: float64(2), int64(5)
memory usage: 2.9 KB
None

Mapping for 'Name':
0 -> Aiden Anderson
1 -> Alice Williams
2 -> Amelia Wilson
3 -> Ava Brown
4 -> Ava Clark
5 -> Ava Smith
6 -> Ava Taylor
7 -> Ava Wilson
8 -> Bob Johnson
9 -> Chris Brown
10 -> David Lee
11 -> Ella Davis
12 -> Ella White
13 -> Emily Davis
14 -> Emma Brown
15 -> Emma Taylor
16 -> Ethan Davis
17 -> Harper Moore
18 -> Isabella Anderson
19 -> James Anders

In [8]:
df.head(5)

Unnamed: 0,EmployeeID,Name,Age,Department,Salary,JoiningDate,PerformanceRating
0,1,22,30,0,75000.0,31,4.5
1,2,21,28,3,60000.0,7,3.8
2,3,8,35,1,80000.0,15,4.2
3,4,1,32,0,82000.0,47,4.7
4,5,9,28,1,75000.0,1,3.5


### Part B: One-Hot Encoding
* One-Hot Encoding creates binary columns for each category.


- Import the `OneHotEncoder` from `sklearn.preprocessing`.
- Initialize the encoder with `sparse_output=False` (to return a dense array) and `handle_unknown='ignore'` (to handle unseen categories during inference).
- Fit and transform the categorical columns.
- Create a DataFrame from the encoded array and concatenate it with the original dataset.
- Drop the original categorical columns.

In [9]:
# 1. Create OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# 2. Fit and transform the object columns
encoded_array = ohe.fit_transform(df[objList])

# 3. Create a DataFrame from the encoded array
encoded_df = pd.DataFrame(encoded_array, columns=ohe.get_feature_names_out(objList))

# 4. Reset index to align with df
encoded_df.index = df.index

# 5. Drop original object columns and concatenate the new one-hot columns
df = pd.concat([df.drop(objList, axis=1), encoded_df], axis=1)

# 6. Check the new df info
print("\nDataframe Info After One-Hot Encoding:")
print(df.info())

# Display last few rows
print("\nFinal Dataset Preview:")
df.tail()


Dataframe Info After One-Hot Encoding:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Columns: 109 entries, EmployeeID to JoiningDate_49
dtypes: float64(107), int64(2)
memory usage: 42.7 KB
None

Final Dataset Preview:


Unnamed: 0,EmployeeID,Age,Salary,PerformanceRating,Name_0,Name_1,Name_2,Name_3,Name_4,Name_5,...,JoiningDate_40,JoiningDate_41,JoiningDate_42,JoiningDate_43,JoiningDate_44,JoiningDate_45,JoiningDate_46,JoiningDate_47,JoiningDate_48,JoiningDate_49
45,46,37,90000.0,4.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46,47,28,84000.0,4.1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,48,29,,,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48,49,36,71000.0,4.7,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
49,50,30,92000.0,4.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
