# Label Encoding
Label Encoding is a simple and widely used technique in data preprocessing to convert categorical data (text labels) into numerical format. It is a supervised learning method that assigns a unique integer to each category in a categorical feature, making it easier for machine learning algorithms to process the data.

👉 In Label Encoding, each category in a categorical feature is mapped to an integer value. For example, "Red" might become 0, "Blue" might become 1, and "Green" might become 2.

# When to Use Label Encoding?
✔ When the categorical feature has an inherent order or hierarchy (e.g., "Low," "Medium," "High").

✔ For tree-based models like Decision Trees, Random Forests, and Gradient Boosting, which can handle label-encoded data effectively.

✔ When the dataset has a small number of unique categories.

# Example:
Suppose you have a dataset with a "Size" column containing values like "Small," "Medium," and "Large." Label Encoding will convert these values into numerical format, such as:

"Small" → 0

"Medium" → 1

"Large" → 2

# How Does Label Encoding Work?
**Identify Categorical Features:**
Determine which columns in your dataset contain categorical data.

**Map Categories to Integers:**
Assign a unique integer to each category in the feature.

**Transform the Data:**
Replace the original categorical values with the corresponding integers.

# Key Concepts in Label Encoding:
**Ordinal Data:**
Label Encoding is ideal for categorical features with an inherent order (e.g., "Low," "Medium," "High").

**Nominal Data:**
For categorical features without an inherent order (e.g., "Apple," "Banana," "Orange"), Label Encoding may introduce false ordering, and other techniques like One-Hot Encoding are preferred.

**Inverse Transform:**
You can convert the encoded integers back to their original labels using the inverse_transform method.

# Advantages of Label Encoding ✅
**✔ Simple and Easy to Implement:** Requires minimal code and is straightforward to use.

**✔ Memory Efficient:** Converts categorical data into integers, which are more memory-efficient than strings.

**✔ Works Well with Tree-Based Models:** Algorithms like Decision Trees and Random Forests can handle label-encoded data effectively.

# Disadvantages of Label Encoding ❌
**🚨 Introduces False Ordering:** Assigning integers to categories can create a false sense of order (e.g., "Green" = 0, "Blue" = 1, "Red" = 2 might imply "Green" < "Blue" < "Red").

**🚨 Not Suitable for Nominal Data:** For categorical features without an inherent order, Label Encoding can mislead the model.

**🚨 Limited Use in Linear Models:** Linear models (e.g., Linear Regression, Logistic Regression) may misinterpret the encoded integers as having numerical significance.

# How to Choose the Right Encoding Technique?
**Label Encoding:** Use for ordinal data or tree-based models.

**One-Hot Encoding:** Use for nominal data or linear models.

**Target Encoding:** Use for high cardinality features or classification tasks.

# Example Use Cases:
**Classification Tasks:**
Encoding target labels (e.g., "Yes" = 1, "No" = 0) in binary or multi-class classification.

**Ordinal Features:**
Encoding features like "Size" (Small, Medium, Large) or "Education Level" (High School, Bachelor's, Master's).

**Tree-Based Models:**
Preparing categorical data for algorithms like Decision Trees, Random Forests, and Gradient Boosting.

# Summary:
Label Encoding is a simple and effective technique for converting categorical data into numerical format, especially for ordinal data and tree-based models. However, it should be used with caution for nominal data, as it can introduce false ordering. For nominal data, consider using One-Hot Encoding or Target Encoding instead. By understanding when and how to use Label Encoding, you can preprocess your data effectively and improve the performance of your machine learning models! 🌟🔢



In [1]:
"AIzaSyDnL6b4gNlLGTDBezr9-2S8X3lYQr_-pcA"

'AIzaSyDnL6b4gNlLGTDBezr9-2S8X3lYQr_-pcA'

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 



In [None]:
df = pd.read_csv('data.csv')
df

In [None]:
df.head() #first 5 rows

In [None]:
df.tail() #last 5 rows

In [5]:
df.index #checking the index

RangeIndex(start=0, stop=1470, step=1)

## General description

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

## Encoding

##### 1. One-Hot / dummy Encoding

One-hot encoding is a technique used to convert categorical variables into a binary matrix. Each category is represented by a binary vector where only one bit is 'hot' (set to 1) while the rest are 'cold' (set to 0). Dummy encoding is essentially the same as one-hot encoding but is more often used in the context of regression analysis.ansform(data['Color'])


In [None]:
# one_hot_encoder = OneHotEncoder()
# one_hot_encoded = one_hot_encoder.fit_transform(data[Columns])


##### 2. Label / Ordinal Encoding
Label encoding is a technique used to convert categorical variables into numerical format by assigning a unique integer label to each category. It's typically used when the categories have an inherent order or rank.

In [None]:
# # Label Encoding / Ordinal Encoding
# label_encoder = LabelEncoder()
# ordinal_encoded = data[['Columns']].apply(label_encoder.fit_transform)


##### 3. Binary Encoding
Binary encoding is a technique used to convert categorical variables into binary format. Each category is first converted into its corresponding integer label, and then each label is converted into binary representation.

In [None]:

# # Binary Encoding
# binary_encoder = BinaryEncoder()
# binary_encoded = binary_encoder.fit_transform(data[['Columns']])


In [None]:
# Apple, Orange, Banana

# fruit = col1 col2
# apple =  0    0
# Orange = 1   1
# Banana = 0   1


0
1
0
0
1
1
1
0

##### 4. Target Encoding
Target encoding, also known as mean encoding, involves replacing categorical variables with the mean of the target variable (the variable you're trying to predict) for each category. It's particularly useful for binary classification problems and can help capture relationships between categorical variables and the target variable.

In [None]:

# # Target Encoding
# target_encoder = TargetEncoder()
# target_encoded = target_encoder.fit_transform(data['Columns1'], data['Columns2'])  # Assuming 'Size' is the target variable


##### 5. frequency / count Encoding
Frequency encoding, also known as count encoding, involves replacing categorical variables with the count of each category in the dataset. It's a simple yet effective technique that can help capture the importance or prevalence of each category.

In [None]:

# # Frequency / Count Encoding
# count_encoder = CountEncoder()
# count_encoded = count_encoder.fit_transform(data['Columns'])


##### 6. Frature Hashing
Feature hashing is a dimensionality reduction technique used to encode categorical variables into a fixed-size feature space. It involves applying a hash function to the categorical variables, which maps them to a predefined number of features. Feature hashing is particularly useful when dealing with high-cardinality categorical variables or in situations where memory efficiency is important. However, it may lead to collisions (different categories mapping to the same hash value), which can affect the performance of machine learning models.

In [None]:

# # Feature Hashing
# feature_hasher = FeatureHasher(n_features=5, input_type='string')
# feature_hashed = feature_hasher.fit_transform(data['Columns'])

## Ordinal Encoding

<img src="ordinal.png" alt="Image">

In [7]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

education_reshape = df['Education'].values.reshape(-1,1)
df['Education_Ordinal_Encoded'] = ordinal_encoder.fit_transform(education_reshape)



In [8]:
df[['Education','Education_Ordinal_Encoded']].head(7)

Unnamed: 0,Education,Education_Ordinal_Encoded
0,2,1.0
1,1,0.0
2,2,1.0
3,4,3.0
4,1,0.0
5,2,1.0
6,3,2.0


## Binary Encoding

<img src="binary.png" alt="Image">

In [9]:
attrition_mapping = {'Yes':1,'No':0}

df['Attrition_Binary_Encoded'] = df['Attrition'].map(attrition_mapping)



In [10]:
df[['Attrition','Attrition_Binary_Encoded']].head()

Unnamed: 0,Attrition,Attrition_Binary_Encoded
0,Yes,1
1,No,0
2,Yes,1
3,No,0
4,No,0


In [11]:
df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Education_Ordinal_Encoded,Attrition_Binary_Encoded
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,0,8,0,1,6,4,0,5,1.0,1
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,1,10,3,3,10,7,1,7,0.0,0
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,0,7,3,3,0,0,0,0,1.0,1
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,0,8,3,3,8,7,3,0,3.0,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,1,6,3,3,2,2,2,2,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,1,17,3,3,5,2,0,3,1.0,0
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,9,5,3,7,7,1,7,0.0,0
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,1,6,0,3,6,2,0,3,2.0,0
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,0,17,3,2,9,6,0,8,2.0,0


## One-Hot / Dummy Encoding

<img src="image.png" alt="Image">

In [31]:
one_hot_encoding = pd.get_dummies(df, columns=['BusinessTravel'])


In [32]:
one_hot_encoding.head()

Unnamed: 0,Age,Attrition,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Education_Ordinal_Encoded,Attrition_Binary_Encoded,Department_Frequence_Encoded,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely
0,41,Yes,1102,Sales,1,2,Life Sciences,1,1,2,...,6,4,0,5,1.0,1,0.303401,0,0,1
1,49,No,279,Research & Development,8,1,Life Sciences,1,2,3,...,10,7,1,7,0.0,0,0.653741,0,1,0
2,37,Yes,1373,Research & Development,2,2,Other,1,4,4,...,0,0,0,0,1.0,1,0.653741,0,0,1
3,33,No,1392,Research & Development,3,4,Life Sciences,1,5,4,...,8,7,3,0,3.0,0,0.653741,0,1,0
4,27,No,591,Research & Development,2,1,Medical,1,7,1,...,2,2,2,2,0.0,0,0.653741,0,0,1


## Frequency / count Encoding

In [12]:
frequency = df['Department'].value_counts() / len(df)
print(frequency)
df['Department_Frequence_Encoded'] = df['Department'].map(frequency)

df[['Department','Department_Frequence_Encoded']]

Research & Development    0.653741
Sales                     0.303401
Human Resources           0.042857
Name: Department, dtype: float64


Unnamed: 0,Department,Department_Frequence_Encoded
0,Sales,0.303401
1,Research & Development,0.653741
2,Research & Development,0.653741
3,Research & Development,0.653741
4,Research & Development,0.653741
...,...,...
1465,Research & Development,0.653741
1466,Research & Development,0.653741
1467,Research & Development,0.653741
1468,Sales,0.303401
