### Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

## Nominal/OHE Encoding
Consider a categorical feature called "color" with values such as red, green, and blue. This feature is categorical because it consists of distinct categories.

To enable the model to understand this feature, one hot encoding converts it into numerical features by creating new binary features for each category. For example, the categories red, green, and blue become three new features: red, green, and blue.

For each data point, the feature corresponding to its category is set to 1, and the others are set to 0. For instance, if the color is red, the red feature is 1, and green and blue are 0.

Disadvantages of One Hot Encoding
If a categorical feature has many categories, such as 100, one hot encoding will create 100 new features, increasing dimensionality.
This leads to sparse matrices, where most values are zeros and only one value is one per row.
Sparse matrices can cause overfitting, where the model fits the training data too well but performs poorly on new data.
Therefore, one hot encoding is not recommended for features with many categories.

Implementing One Hot Encoding in Python
One hot encoding, also known as nominal encoding, represents categorical data as numerical data suitable for machine learning algorithms. Each category is represented as a binary vector where each bit corresponds to a unique category.

For example, a color variable with categories red, green, and blue can be represented as binary vectors such as 100, 010, and 001 respectively.

In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
## Create a simple dataframe
df = pd.DataFrame({
    'color':['red','green','blue','green','red','blue']
})

In [5]:
df.head()

Unnamed: 0,color
0,red
1,green
2,blue
3,green
4,red


In [6]:
df['color'].unique()

array(['red', 'green', 'blue'], dtype=object)

In [7]:
#create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [9]:
#perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [10]:
import pandas as pd
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [11]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [13]:
#for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [14]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,green,0.0,1.0,0.0
2,blue,1.0,0.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [15]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### Label Encoding
- Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.
- Label encoding involves assigning a unique numerical label to each category in variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue) we can represent it using label encoding as follows:
1. Red-1
2. Green-2
3. Blue-3