# Data Encoding

Data encoding is the process of converting categorical data into a numerical format that can be easily understood and processed by machine learning algorithms. 

Categorical data refers to variables that represent distinct categories or groups, such as colors, types, or labels. Since most machine learning algorithms require numerical input, encoding categorical data is essential for effective model training and prediction.

There are several techniques for encoding categorical data, including:

1. **Label Encoding**: This technique assigns a unique integer to each category. For example, if we have a color feature with categories "red", "green", and "blue", label encoding would convert these to 0, 1, and 2, respectively.

2. **One-Hot Encoding**: This method creates binary columns for each category. Using the same color example, one-hot encoding would create three new columns: "is_red", "is_green", and "is_blue", with 1s and 0s indicating the presence of each color.

3. **Target Encoding**: In this method, we replace each category with the mean of the target variable for that category. This can be useful for high-cardinality features but may lead to overfitting if not done carefully.

Choosing the right encoding technique depends on the specific dataset and the machine learning algorithm being used. It's important to experiment with different methods and evaluate their impact on model performance.

## Nominal/One Hot Encoding

Nominal data is a type of categorical data where the categories do not have a specific order or ranking. 

Examples of nominal data include colors (red, blue, green), types of animals (dog, cat, bird), or brands of cars (Toyota, Ford, BMW).

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
df= pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green']
})

df

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Blue
4,Red
5,Green


In [7]:
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['Color']]).toarray()
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Color']))
encoded_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
5,0.0,1.0,0.0


In [9]:
## Test with new data
encoder.transform([['Red'], ['Green']]).toarray()



array([[0., 0., 1.],
       [0., 1., 0.]])

In [10]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red
0,Red,0.0,0.0,1.0
1,Blue,1.0,0.0,0.0
2,Green,0.0,1.0,0.0
3,Blue,1.0,0.0,0.0
4,Red,0.0,0.0,1.0
5,Green,0.0,1.0,0.0


## Label Encoding

Label encoding is a technique used to convert categorical data into numerical format by assigning a unique integer to each category. 

This method is particularly useful for ordinal data, where the categories have a specific order or ranking. 

In [11]:
df.head()

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Blue
4,Red


In [12]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Color'] = label_encoder.fit_transform(df['Color'])
df.head()

Unnamed: 0,Color
0,2
1,0
2,1
3,0
4,2


## Ordinal Encoding

It is a technique used to convert ordinal categorical data into numerical format by assigning integers based on the order of the categories.

In [13]:
from sklearn.preprocessing import OrdinalEncoder

df= pd.DataFrame({
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large']
})

df

Unnamed: 0,Size
0,Small
1,Medium
2,Large
3,Medium
4,Small
5,Large


In [14]:
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])
df

Unnamed: 0,Size,Size_Encoded
0,Small,0.0
1,Medium,1.0
2,Large,2.0
3,Medium,1.0
4,Small,0.0
5,Large,2.0


## Target Guided Ordinal Encoding

Target Guided Ordinal Encoding is a technique used to encode ordinal categorical variables by considering the target variable's distribution. 

This method assigns integer values to categories based on their relationship with the target variable.

In [16]:
df = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Philadelphia'],
    'price': [700000, 800000, 600000, 500000, 400000, 300000]
})

df

Unnamed: 0,city,price
0,New York,700000
1,Los Angeles,800000
2,Chicago,600000
3,Houston,500000
4,New York,400000
5,Philadelphia,300000


In [17]:
df.groupby('city')['price'].mean().sort_values()

city
Philadelphia    300000.0
Houston         500000.0
New York        550000.0
Chicago         600000.0
Los Angeles     800000.0
Name: price, dtype: float64

In [18]:
mean_prices = df.groupby('city')['price'].mean().to_dict()
mean_prices

{'Chicago': 600000.0,
 'Houston': 500000.0,
 'Los Angeles': 800000.0,
 'New York': 550000.0,
 'Philadelphia': 300000.0}

In [19]:
df['city_encoded'] = df['city'].map(mean_prices)
df

Unnamed: 0,city,price,city_encoded
0,New York,700000,550000.0
1,Los Angeles,800000,800000.0
2,Chicago,600000,600000.0
3,Houston,500000,500000.0
4,New York,400000,550000.0
5,Philadelphia,300000,300000.0
