**Categorical Encoding: Label Encoding & One Hot Encoding**

**Apa itu Categorical Encoding?**
- Categorical Encoding adalah proses konversi nilai categorical menjadi nilai numerical.
- [Reff](https://en.wikipedia.org/wiki/One-hot)


**Label Encoding**
- Pada Label Encoding, setiap kategori pada suatu feature akan diurutkan secara alfabet dan direpresentasikan dengan sebuah nilai integer.

Persiapan dataset

In [1]:
import pandas as pd

df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'],
    'age': [44, 34, 46, 35, 23],
    'salary': [72000, 65000, 98000, 45000, 34000]
})

df

Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,46,98000
3,US,35,45000
4,Japan,23,34000


**Label Encoding pada Scikit Learn**

In [2]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['country'] = label_encoder.fit_transform(df['country'])
df

Unnamed: 0,country,age,salary
0,0,44,72000
1,2,34,65000
2,1,46,98000
3,2,35,45000
4,1,23,34000


In [3]:
label_encoder.classes_

array(['India', 'Japan', 'US'], dtype=object)

**One Hot Encoding**

Persiapan dataset

In [4]:
df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'],
    'age': [44, 34, 46, 35, 23],
    'salary': [72000, 65000, 98000, 45000, 34000]
})

df

Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,46,98000
3,US,35,45000
4,Japan,23,34000


**One Hot Encoding pada Scikit Learn**
- Fungsi reshape() digunakan untuk mengubah array menjadi 2 dimensi.
- Mentransform nilai X kedalam onehot_encoder dan mengubahnya menjadi array.
- [1., 0., 0.] Dibaca categories pertama dan merepresentasikan India.
- [0., 1., 0.] Dibaca categories kedua dan merepresentasikan Japan.
- [0., 0., 1.] Dibaca categories ketiga dan merepresentasikan US.
- Gunakan fungsi concat() untuk menggabungkan dataframe. Parameter axis=1 - berarti proses akan dilaksanakan secara berdampingan
- Gunakan fungsi drop() untuk menghapus kolom secara spesifik.

In [5]:
X = df['country'].values.reshape(-1,1)
X

array([['India'],
       ['US'],
       ['Japan'],
       ['US'],
       ['Japan']], dtype=object)

In [6]:
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
X = onehot_encoder.fit_transform(X).toarray()
X

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [7]:
onehot_encoder.categories_

[array(['India', 'Japan', 'US'], dtype=object)]

In [8]:
df_onehot = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])])
df_onehot

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


In [9]:
df = pd.concat([df_onehot, df], axis=1)
df

Unnamed: 0,0,1,2,country,age,salary
0,1.0,0.0,0.0,India,44,72000
1,0.0,0.0,1.0,US,34,65000
2,0.0,1.0,0.0,Japan,46,98000
3,0.0,0.0,1.0,US,35,45000
4,0.0,1.0,0.0,Japan,23,34000


In [10]:
df = df.drop(['country'], axis=1)
df

Unnamed: 0,0,1,2,age,salary
0,1.0,0.0,0.0,44,72000
1,0.0,0.0,1.0,34,65000
2,0.0,1.0,0.0,46,98000
3,0.0,0.0,1.0,35,45000
4,0.0,1.0,0.0,23,34000


**Label Encoding vs One Hot Encoding**
One Hot Encoding diterapkan bila:

- Nilai categorical adalah nominal
- Jumlah kategori yang ada tidak terlalu banyak

Label Encoding diterapkan bila:

- Nilai categorical adalah ordinal
- Jumlah kategori yang ada relatif banyak