## **<ins style="color:green">Encoding Categorical Data | Ordinal Encoding | Label Encoding</ins>**
- **Data :**
  - **Numerical Data**
    - Number like _Age_, _Salary_, _Height_, _Weight_, etc.
  - **Categorical Data**
    - Category like _Gender_, _Country_, etc.
    - __Types of Categorical Data :-__
        1. **Nominal** : `OneHotEncoding`
          - Name of state : UP, Maharastra, Bihar, Gujrat
            - We can not say that ___UP > Maharastra___ or ___Bihar < Gujrat___
          - Branch of Engineering : CSE, ME, ECE, CE, EE
            - We can not say that CSE < ECE or ME > CE
          - It will make different column for each attributes and fill that column with 1 and rest column with 0.
        2. **Ordinal** : `OrdinalEncoding`
          - Divisin : First, Second, Third
            - We can say that ___First > Second___ and ___Second > Third___
          - Name of degree : B.Tech. M.Tech.
            - We can say that ___B.tech < M.Tech.___
          - We have to specify befor use the OrdinalEncoding. Specigy that First=1, Second=2, third=3 or B.Tech=0 and M.Tech=1.
- **LabelEncoding**
  - If __Attributes__ has __Categorical Data__ then use `OneHotEncoding` or `OrdinalEncoding`. But if __Label__ or __Class__ have __Categorical Data__ then use `LabelEncoding`.

### **<ins style="color:red">OrdinalEncoder</ins>**
- Education : HS, UG, PG
  - We can know PG > UG > HS
  - So we specify the value like HS=0, UG=1, PG=2

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport as pr

In [2]:
df = pd.read_csv("../data/customer.csv")
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
37,94,Male,Average,PG,Yes
29,83,Female,Average,UG,Yes
13,57,Female,Average,School,No
14,15,Male,Poor,PG,Yes
10,98,Female,Good,UG,Yes


In [3]:
df.shape

(50, 5)

In [4]:
prof = pr(df)
prof.to_file(output_file="customer.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
df

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No
5,31,Female,Average,School,Yes
6,18,Male,Good,School,No
7,60,Female,Poor,School,Yes
8,65,Female,Average,UG,No
9,74,Male,Good,UG,Yes


In [6]:
df.isnull().sum()

age          0
gender       0
review       0
education    0
purchased    0
dtype: int64

In [7]:
df.nunique()

age          41
gender        2
review        3
education     3
purchased     2
dtype: int64

In [8]:
print('gender : ', df.gender.unique())
print('review : ', df.review.unique())
print('education : ', df.education.unique())
print('purchased : ', df.purchased.unique())

gender :  ['Female' 'Male']
review :  ['Average' 'Poor' 'Good']
education :  ['School' 'UG' 'PG']
purchased :  ['No' 'Yes']


In [9]:
df.tail(4)

Unnamed: 0,age,gender,review,education,purchased
46,64,Female,Poor,PG,No
47,38,Female,Good,PG,Yes
48,39,Female,Good,UG,Yes
49,25,Female,Good,UG,No


In [10]:
df = df.iloc[:, 2:]

In [11]:
df.tail(4)

Unnamed: 0,review,education,purchased
46,Poor,PG,No
47,Good,PG,Yes
48,Good,UG,Yes
49,Good,UG,No


In [12]:
from sklearn.model_selection import train_test_split
X = df.iloc[:, [0, 1]]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

### **OrdinalEncoder** : For the Attributes

In [13]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[["Poor", "Average", "Good"], ["School", "UG", "PG"]]) # You can decide order
# [Poor, Average, Good] = [0, 1, 2]
# [School, UG, PG] = [0, 1, 2]
oe.fit(X_train)
X_train = pd.DataFrame(oe.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(oe.transform(X_test), columns=X_test.columns)
X_train.shape, X_test.shape

((40, 2), (10, 2))

In [14]:
type(X_train), type(X_test)

(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

In [15]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

### <ins style="color:red">**LabelEndoder**</ins> : For the Labels / Class
- Encode target labels with values between 0 and n_classes-1

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # You can not decide the order
le.fit(y_train)
y_train = pd.DataFrame(le.transform(y_train), columns=['purchased'])
y_test = pd.DataFrame(le.transform(y_test), columns=['purchased'])
y_train.shape, y_test.shape

((40, 1), (10, 1))

In [17]:
y_train.shape, y_test.shape

((40, 1), (10, 1))

In [18]:
y_train

Unnamed: 0,purchased
0,1
1,0
2,0
3,0
4,1
5,1
6,1
7,1
8,1
9,0
