# **Encoding Categorical Features**

Categorical encoding is a process of converting categories to numbers.

> [**Kaggle Dataset**](https://www.kaggle.com/datasets/farhanmd29/50-startups)

In [None]:
# Install Kaggle.
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Files Upload.
from google.colab import files

files.upload()

In [None]:
# Create a Kaggle Folder.
!mkdir ~/.kaggle

# Copy the kaggle.json to the folder created.
!cp kaggle.json ~/.kaggle/

# Permission for the json file to act.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Dataset Download.
!kaggle datasets download -d farhanmd29/50-startups

In [None]:
# Unzip Dataset.
!unzip 50-startups.zip

In [None]:
# Import Library.
import pandas as pd
import numpy as np

# Import Dataset.
data = pd.read_csv("50_Startups.csv")
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
# Data Summary.
data.info()

In [None]:
# Count frequency of each category.
data["State"].value_counts()

New York      17
California    17
Florida       16
Name: State, dtype: int64

In [None]:
# Split the dataset into training and test set.
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
X_test.head()

# **Label Encoding.**

[**sklearn.preprocessing.LabelEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)

*Label Encoding is a popular encoding technique used to encode categorical data. In label encoding, each label in the feature column gets assigned to a unique integer based on alphabetical ordering.*

### **Challenges with Label Encoding.**

In the label encoding technique, the categorical variables do not have an order or rank. Hence, there is a high probability that the model captures the relationship between countries such as **New York $<$ Florida $<$ California**. To overcome this obstacle, we use the **One-Hot Encoding** technique. Label Encoding assigns each unique value to a different integer.

*   Label Encoding does not handle new categories in the test set automatically.
*   Label Encoding creates an order relationship between the categories.



In [None]:
# Label Encoding.
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
X_train["State"] = label_encoder.fit_transform(X_train["State"])
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
12,93863.75,127320.38,249839.44,1,141585.52
4,142107.34,91391.77,366168.42,1,166187.94
37,44069.95,51283.14,197029.42,0,89949.14
8,120542.52,148718.95,311613.29,2,152211.77
3,144372.41,118671.85,383199.62,2,182901.99


In [None]:
X_test["State"] = label_encoder.transform(X_test["State"])
X_test.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
13,91992.39,135495.07,252664.93,0,134307.35
39,38558.51,82982.09,174999.3,0,81005.76
30,61994.48,115641.28,91131.24,1,99937.59
45,1000.23,124153.04,1903.93,2,64926.08
17,94657.16,145077.58,282574.31,2,125370.37


# **One-Hot Encoding.**

[**sklearn.preprocessing.OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

**One-Hot Encoding is the process of creating dummy variables.**

One-Hot Encoding is another popular data encoding technique for treating categorical variables. This technique creates additional features based on the number of unique categories in the feature column. Each category gets mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 indicates the presence of that category. These newly created binary features are known as dummy variables. The number of dummy variables depends on the levels present in the categorical feature.

![onehot-encoding](https://miro.medium.com/max/1879/1*O_pTwOZZLYZabRjw3Ga21A.png)

### **Challenges of One-Hot Encoding.**

*   One-Hot Encoding results in a Dummy Variable Trap, i.e., the feature columns are highly correlated. The Dummy Variable Trap leads to the problem known as **multicollinearity**. Multicollinearity occurs when there is a dependency between the independent features.
*   Although One-Hot Encoding solves the problem of unequal weights given to categories, it is not very useful when there are too many categories featuring a column since that will result in the creation of new columns in the table leading to an increase in the total dataset. This technique makes the feature space large by adding additional features to the dataset. Hence it becomes computationally expensive and might face the problem of the curse of dimensionality.
*   One-Hot Encoding makes sparsity in the dataset, i.e., several columns have 0s, and a few have 1s.



In [None]:
# Split the dataset into training and test set.
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(data, test_size=0.2, random_state=74)

In [None]:
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
3,144372.41,118671.85,383199.62,New York,182901.99
29,65605.48,153032.06,107138.38,New York,101004.64
34,46426.07,157693.92,210797.67,California,96712.8
5,131876.9,99814.71,362861.36,New York,156991.12
13,91992.39,135495.07,252664.93,California,134307.35


In [None]:
X_test.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
22,73994.56,122782.75,303319.26,Florida,110352.25
16,78013.11,121597.55,264346.06,California,126992.93
26,75328.87,144135.98,134050.07,Florida,105733.54
8,120542.52,148718.95,311613.29,New York,152211.77
32,63408.86,129219.61,46085.25,California,97427.84


In [None]:
# One-Hot Encoding.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    transformers=[("encoder", OneHotEncoder(), [3])], remainder="passthrough"
)
X = np.array(ct.fit_transform(X_train.iloc[:, :].values))

data = pd.DataFrame(
    X,
    columns=[
        "California",
        "Florida",
        "New York",
        "R&D Spend",
        "Administration",
        "Marketing Spend",
        "Profit",
    ],
)

data.head()

Unnamed: 0,California,Florida,New York,R&D Spend,Administration,Marketing Spend,Profit
0,0.0,0.0,1.0,144372.41,118671.85,383199.62,182901.99
1,0.0,0.0,1.0,65605.48,153032.06,107138.38,101004.64
2,1.0,0.0,0.0,46426.07,157693.92,210797.67,96712.8
3,0.0,0.0,1.0,131876.9,99814.71,362861.36,156991.12
4,1.0,0.0,0.0,91992.39,135495.07,252664.93,134307.35


In [None]:
y = np.array(ct.transform(X_test.iloc[:, :].values))

test = pd.DataFrame(
    y,
    columns=[
        "California",
        "Florida",
        "New York",
        "R&D Spend",
        "Administration",
        "Marketing Spend",
        "Profit",
    ],
)

test.head()

Unnamed: 0,California,Florida,New York,R&D Spend,Administration,Marketing Spend,Profit
0,0.0,1.0,0.0,73994.56,122782.75,303319.26,110352.25
1,1.0,0.0,0.0,78013.11,121597.55,264346.06,126992.93
2,0.0,1.0,0.0,75328.87,144135.98,134050.07,105733.54
3,0.0,0.0,1.0,120542.52,148718.95,311613.29,152211.77
4,1.0,0.0,0.0,63408.86,129219.61,46085.25,97427.84


## **When to use Label Encoding vs. One-Hot Encoding?**

This question depends on the dataset and the model we wish to apply. A few points to note before choosing the correct encoding technique.

**Apply One-Hot Encoding when:**
*  The categorical feature is not ordinal (i.e., India, USA, UK, France, etc.)

*  The number of categorical features is less, so one-hot encoding can be effectively applied.

**Apply Label Encoding when:**
*  The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary School, High School, etc.)

*  The number of categories is quite large as one-hot encoding can lead to high memory consumption.