<a href="https://colab.research.google.com/github/francisfernande/Missing_values/blob/main/Nominal_OH_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
#Create a DF
df = pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})

In [4]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [5]:
#create a instance of onehotencoder
encoder = OneHotEncoder()

In [6]:
#perform fit_transform
encoded_value = encoder.fit_transform(df[['color']]).toarray()
#will always be alphetically placed

In [7]:
encoded_value

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [9]:
encoder_df=pd.DataFrame(encoded_value,columns=encoder.get_feature_names_out())

In [10]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [11]:
#if any data comes but it should RGB
encoder.transform([['red']]).toarray()



array([[0., 0., 1.]])

In [12]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [13]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [14]:
df = sns.load_dataset('tips')

In [15]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [18]:
df['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [19]:
df['day'].unique()

['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']

In [20]:
df['time'].value_counts()

Unnamed: 0_level_0,count
time,Unnamed: 1_level_1
Dinner,176
Lunch,68


In [21]:
#create a instance of onehotencoder
new_encoder = OneHotEncoder()

In [22]:
new_value = new_encoder.fit_transform(df[['time','day']]).toarray()

In [23]:
new_value


array([[1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0.],
       ...,
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1.]])

In [25]:
encoder_df1=pd.DataFrame(new_value,columns=new_encoder.get_feature_names_out())

In [26]:
encoder_df1

Unnamed: 0,time_Dinner,time_Lunch,day_Fri,day_Sat,day_Sun,day_Thur
0,1.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...
239,1.0,0.0,0.0,1.0,0.0,0.0
240,1.0,0.0,0.0,1.0,0.0,0.0
241,1.0,0.0,0.0,1.0,0.0,0.0
242,1.0,0.0,0.0,1.0,0.0,0.0
