## **<ins style="color:green">Encoding Categorical Data | Ordinal Encoding | Label Encoding</ins>**
- **Data :**
  - **Numerical Data**
    - Number like _Age_, _Salary_, _Height_, _Weight_, etc.
  - **Categorical Data**
    - Category like _Gender_, _Country_, etc.
    - __Types of Categorical Data :-__
        1. **Nominal** : `OneHotEncoding`
          - Name of state : UP, Maharastra, Bihar, Gujrat
            - We can not say that ___UP > Maharastra___ or ___Bihar < Gujrat___
          - Branch of Engineering : CSE, ME, ECE, CE, EE
            - We can not say that CSE < ECE or ME > CE
          - It will make different column for each attributes and fill that column with 1 and rest column with 0.
        2. **Ordinal** : `OrdinalEncoding`
          - Divisin : First, Second, Third
            - We can say that ___First > Second___ and ___Second > Third___
          - Name of degree : B.Tech. M.Tech.
            - We can say that ___B.tech < M.Tech.___
          - We have to specify befor use the OrdinalEncoding. Specigy that First=1, Second=2, third=3 or B.Tech=0 and M.Tech=1.
- **LabelEncoding**
  - If __Attributes__ has __Categorical Data__ then use `OneHotEncoding` or `OrdinalEncoding`. But if __Label__ or __Class__ have __Categorical Data__ then use `LabelEncoding`.

## <ins style="color:red">**OneHotEncoder**</ins>
- #### Use with ___Nominal___ data.
- Red | Blue | Green | Yellow\
  R | B | G | Y  \
  1 | 0 | 0 | 0  \
  0 | 0 | 1 | 0  \
  0 | 1 | 0 | 0  \
  0 | 0 | 0 | 1
- **DummyVariable Trap** : Drop one column
- **Multi-colinearity** : Relation of one column to another columns. Columns should not have mathematical relation. So we drop one column. \
  There is relation like to add all columns value of R|B|G|Y get 1 always. If we drop one column relation will break.

- ### **OHE Using Most Frequent Variables**
  - Have only frequent categories and change rest of them into _other_ one category.
  

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

In [2]:
df = pd.read_csv("../data/cars.csv")
df

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000


In [3]:
df['brand'].value_counts()

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [4]:
df.brand.nunique()


32

In [5]:
df['fuel'].value_counts()

Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: fuel, dtype: int64

In [6]:
df['owner'].value_counts()

First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: owner, dtype: int64

In [7]:
df['brand'].nunique()

32

In [8]:
df['owner'].nunique()

5

### **OneHotEncoder**

### **1. OneHotEncoder Using Pandas**
- It does not remove automatically dummy trap variable. It means it does not remove first column.
- - During the machine learning project we do not use `pd.get_dummies(df, columns=[col1, col2], drop_first=True)`. Because panda have random pattern of dummies variable. If you run agian you will get different pattern of dummies variables.

In [9]:
df1 = pd.get_dummies(df, columns=['fuel', 'owner'])
df1

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


In [10]:
df1.shape

(8128, 12)

### **2. K-1 OneHotEncoding**
- If we do ___drop_first=True___ then it will remove first column of each dummy variable.
- - During the machine learning project we do not use `pd.get_dummies(df, columns=[col1, col2], drop_first=True)`. Because panda have random pattern of dummies variable. If you run agian you will get different pattern of dummies variables.

In [11]:
df2 = pd.get_dummies(df, columns=['fuel', 'owner'], drop_first=True)
df2

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


In [12]:
df2.shape

(8128, 10)

### **3. OneHotEncoding Using Sklearn**
- During the machine learning project we do not use `pd.get_dummies(df, columns=[col1, col2], drop_first=True)`. Because panda have random pattern of dummies variable. If you run agian you will get different pattern of dummies variables.

In [13]:
X = df.iloc[:, :-1]
Y = df.iloc[:, -1]

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=8)
X_train

Unnamed: 0,brand,km_driven,fuel,owner
5220,Maruti,10000,Petrol,First Owner
2231,Skoda,25000,Petrol,First Owner
3485,Tata,79000,Diesel,First Owner
7655,Mahindra,60000,Diesel,First Owner
7652,Maruti,70000,Petrol,First Owner
...,...,...,...,...
2181,Mahindra,100000,Diesel,First Owner
2409,Hyundai,60000,Petrol,Second Owner
2033,Maruti,25000,Petrol,First Owner
1364,Mahindra,117000,Diesel,First Owner


In [15]:
y_train

5220     579000
2231     725000
3485     325000
7655    1000000
7652     250000
         ...   
2181     500000
2409     200000
2033     420000
1364     800000
4547     750000
Name: selling_price, Length: 6502, dtype: int64

In [16]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', dtype=np.int32)
# drop={'first', 'if_binary'} : to remove first column
# sparce=False : automatically give you array. No need to to use .toarray()
# dtype=np.int32 : change the value float to int32
X_train_ohe = ohe.fit_transform(X_train[['fuel', 'owner']]).toarray()
X_test_ohe = ohe.fit_transform(X_test[['fuel', 'owner']]).toarray() 

In [17]:
X_train_ohe.shape

(6502, 7)

In [18]:
X_test_ohe.shape

(1626, 7)

#### Removing Nominal columns and add with OneHotEncoded columns

In [34]:
X_train[['brand', 'km_driven']] # Pandas DataFrame
X_train[['brand', 'km_driven']].values  # Numpy array
X_train_ohe # Numpy array

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int32)

In [19]:
# train data set
X_train_final = np.hstack((X_train[['brand', 'km_driven']].values, X_train_ohe))
X_train_final.shape

(6502, 9)

In [20]:
X_train_final

array([['Maruti', 10000, 0, ..., 0, 0, 0],
       ['Skoda', 25000, 0, ..., 0, 0, 0],
       ['Tata', 79000, 1, ..., 0, 0, 0],
       ...,
       ['Maruti', 25000, 0, ..., 0, 0, 0],
       ['Mahindra', 117000, 1, ..., 0, 0, 0],
       ['Maruti', 90000, 1, ..., 0, 0, 0]], dtype=object)

In [35]:
X_test[['brand', 'km_driven']] # Pandas DataFrame
X_test[['brand', 'km_driven']].values  # Numpy array
X_test_ohe # Numpy array

array([[0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int32)

In [21]:
# test data set
X_test_final = np.hstack((X_test[['brand', 'km_driven']].values, X_test_ohe))
X_test_final.shape

(1626, 9)

In [22]:
X_test_final

array([['Hyundai', 39414, 0, ..., 0, 0, 0],
       ['Hyundai', 60000, 1, ..., 1, 0, 0],
       ['Maruti', 50000, 1, ..., 0, 0, 0],
       ...,
       ['Honda', 7032, 0, ..., 0, 0, 0],
       ['Renault', 18000, 1, ..., 0, 0, 0],
       ['Renault', 22000, 0, ..., 0, 0, 0]], dtype=object)

### **4. OneHotEncoder with Top Categories**

In [23]:
# take a thresold value of brand
counts = df['brand'].value_counts()

In [24]:
counts

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [25]:
df['brand'].nunique()
thresold = 100

In [26]:
repl = counts[counts <= thresold].index

In [27]:
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object')

In [28]:
dfn = pd.get_dummies(df['brand'].replace(repl, 'uncommon'))
dfn.sample(10)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
7578,0,0,0,0,0,0,1,0,0,0,0,0,0
3112,0,0,0,0,1,0,0,0,0,0,0,0,0
986,0,1,0,0,0,0,0,0,0,0,0,0,0
6491,0,0,0,0,0,0,1,0,0,0,0,0,0
2749,0,0,0,0,0,0,1,0,0,0,0,0,0
2892,0,0,0,0,0,0,0,0,0,0,0,0,1
5768,0,0,0,0,1,0,0,0,0,0,0,0,0
3949,0,0,0,0,0,0,1,0,0,0,0,0,0
7358,0,0,0,0,0,0,1,0,0,0,0,0,0
7775,0,0,0,0,0,0,0,0,0,0,0,0,1


In [29]:
df

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000
