# Tutorial 09: 
# Categorical Encoding: Label & One Hot Encoding

## Apa itu Categorical Encoding
Categorical Encoding adalah **proses konversi nilai categorical menjadi nilai numerical.** Terdapat banyak jenis Categorical Encoding, antara lain: 
- **Label Encoding**
- **One Hot Encoding**

### 1. Label Encoding
Pada Label Encoding, setiap **kategori** pada suatu feature **akan diurutkan secara alfabet dan direpresentasikan dengan sebuah nilai integer**. 
#### - Dataset

In [18]:
import pandas as pd 

df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'], 
    'age'    : [44, 34, 36, 35, 23],
    'salary' : [72000, 65000, 98000, 45000, 34000]
})

df

Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,36,98000
3,US,35,45000
4,Japan,23,34000


#### - Label Encoding pada Scikit Learn

In [19]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['country'] = label_encoder.fit_transform(df['country']) # Merubah string menjadi numerik pada value-value di key Country. Dimana urutan numerik didasari pada urutan alfabet.
df

Unnamed: 0,country,age,salary
0,0,44,72000
1,2,34,65000
2,1,36,98000
3,2,35,45000
4,1,23,34000


### 2. One Hot Encoding
Pada One Hot Encoding, setiap **kategori pada suatu feature akan diurutkan secara alfabet dan direpresentasikan sebagai kumpulan bits**. 
#### - Dataset

In [20]:
import pandas as pd 

df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'], 
    'age'    : [44, 34, 36, 35, 23],
    'salary' : [72000, 65000, 98000, 45000, 34000]
})

df

Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,36,98000
3,US,35,45000
4,Japan,23,34000


#### - One Hot Encoding pada Scikit Learn

In [21]:
x = df['country'].values.reshape(-1,1)
x

array([['India'],
       ['US'],
       ['Japan'],
       ['US'],
       ['Japan']], dtype=object)

In [22]:
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
x = onehot_encoder.fit_transform(x).toarray() # Merubah data string menjadi numerik array
x

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [23]:
onehot_encoder.categories_

[array(['India', 'Japan', 'US'], dtype=object)]

In [24]:
df_onehot = pd.DataFrame(x, columns=[str(i) for i in range(x.shape[1])])
df_onehot

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


In [25]:
df = pd.concat([df_onehot, df], axis=1)
df

Unnamed: 0,0,1,2,country,age,salary
0,1.0,0.0,0.0,India,44,72000
1,0.0,0.0,1.0,US,34,65000
2,0.0,1.0,0.0,Japan,36,98000
3,0.0,0.0,1.0,US,35,45000
4,0.0,1.0,0.0,Japan,23,34000


Dari hasil diatas, terlihat bahwa **India direpresentasikan dengan 1 0 0, US 0 0 1, sedangkan Japan 0 1 0**.

In [26]:
df = df.drop(['country'], axis=1)
df

Unnamed: 0,0,1,2,age,salary
0,1.0,0.0,0.0,44,72000
1,0.0,0.0,1.0,34,65000
2,0.0,1.0,0.0,36,98000
3,0.0,0.0,1.0,35,45000
4,0.0,1.0,0.0,23,34000


### 3. Label Encoding VS One Hot Encoding
Kita menerapkan **One Hot Encoding** jika: 
- ** Nilai categorical adalah nominal**. 
- ** Jumlah kategori yang ada tidak terlalu banyak**. 

Sedangkan kita menerapkan **Label Encoding** bila: 
- ** Nilai categorical adalah ordinal**. 
- ** Jumlah kategori yang ada relatif banyak**. 