# Encoding categorical features (or label)

這兩個編碼方式的目的是為了將類別 (categorical)或是文字(text)的資料轉換成數字，而讓程式能夠更好的去理解及運算。
> Label encoding : 把每個類別 mapping 到某個整數，不會增加新欄位

> Onehot encoding : 為每個類別新增一個欄位，用 0/1 表示是否

![](images/Encoder.PNG)

# Load Categorical data

![](images/Encoding.PNG)

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


# 方法一: sklearn - label encoder + onehot encoder

**OneHot Encoding的編碼邏輯為將類別拆成多個行(column)，每個列中的數值由1、0替代，當某一列的資料存在的該行的類別則顯示1，反則顯示0**。
- **要數字**，若資料文字要先用Label Encoder轉數字
- **要用2D array**，若一維所以要用reshape(-1,1)

## 單獨處理

In [2]:
from sklearn.preprocessing import LabelEncoder

# 使用LabelEncoder對'blood'特徵進行標籤編碼
encoder = LabelEncoder()
encoder_blood = encoder.fit_transform(df['blood'])

df['blood'] = encoder_blood
print("Label Encoded DataFrame:")
df

Label Encoded DataFrame:


Unnamed: 0,blood,Y,Z
0,0,high,
1,2,low,
2,1,high,-1196.0
3,3,mid,72.0
4,2,mid,83.0


In [3]:
from sklearn.preprocessing import OneHotEncoder

# 使用OneHotEncoder進行編碼
onehot_encoder = OneHotEncoder()
blood_arr = np.array(df['blood'])
print("Blood Array:")
print(blood_arr)
print("Blood Array Shape:")
print(blood_arr.shape)

onehot_encoded = onehot_encoder.fit_transform(blood_arr.reshape(-1, 1)).toarray()
print("One-Hot Encoded Array:")
print(onehot_encoded)

Blood Array:
[0 2 1 3 2]
Blood Array Shape:
(5,)
One-Hot Encoded Array:
[[1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


In [4]:
# 將OneHotEncoder的結果轉換為DataFrame
onehot_encoded_df = pd.DataFrame(onehot_encoded, columns=onehot_encoder.get_feature_names_out(['Blood']))

# 將原資料框中的其他列與 OneHotEncoder 的結果合併
final_df = pd.concat([onehot_encoded_df, df[['Y', 'Z']]], axis=1)

# 顯示最終的資料
final_df

Unnamed: 0,Blood_0,Blood_1,Blood_2,Blood_3,Y,Z
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


## 一起處理

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


In [6]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

# 定義 ColumnTransformer，對第0列進行OneHotEncoding
column_transformer = ColumnTransformer(
    transformers=[('onehot', OneHotEncoder(), [0])], # [0] 表示對第0列進行OneHotEncoding
    remainder='passthrough'  # 其他列保持原樣
)

# 應用ColumnTransformer進行編碼
# 使用fit_transform方法擬合，轉換df為NumPy陣列，並將其元素類型設為字串
data = np.array(column_transformer.fit_transform(df), dtype =str)
print(data)

# 將轉換後的NumPy陣列轉換為Pandas資料
data_df = pd.DataFrame(data)
data_df

[['1.0' '0.0' '0.0' '0.0' 'high' 'nan']
 ['0.0' '0.0' '1.0' '0.0' 'low' 'nan']
 ['0.0' '1.0' '0.0' '0.0' 'high' '-1196.0']
 ['0.0' '0.0' '0.0' '1.0' 'mid' '72.0']
 ['0.0' '0.0' '1.0' '0.0' 'mid' '83.0']]


Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


在你使用 ColumnTransformer 和 OneHotEncoder 進行編碼時，編碼後的結果中，新的數據框將不再包含原始的列名。

In [7]:
# 獲取OneHotEncoder的列名
onehot_feature_names = column_transformer.named_transformers_['onehot'].get_feature_names_out(['blood'])
# 合併OneHotEncoder的列名和原始列名（除了第0列）
new_feature_names = np.concatenate([onehot_feature_names, df.columns[1:]])
print(new_feature_names)

# 將轉換後的數據轉換為DataFrame
data_df = pd.DataFrame(data, columns=new_feature_names)
data_df

['blood_A' 'blood_AB' 'blood_B' 'blood_O' 'Y' 'Z']


Unnamed: 0,blood_A,blood_AB,blood_B,blood_O,Y,Z
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


# 方法二: Keras - label encoder + to_categorical

In [8]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


In [9]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});

# label encoder對'blood'特徵進行編碼
encoder = LabelEncoder()
encoder_blood = encoder.fit_transform(df['blood'])
df['blood'] = encoder_blood

print("Label Encoded DataFrame:")
df

Label Encoded DataFrame:


Unnamed: 0,blood,Y,Z
0,0,high,
1,2,low,
2,1,high,-1196.0
3,3,mid,72.0
4,2,mid,83.0


`to_categorical`函數是專門用來將類別標籤轉換為OneHot編碼

即使輸入的是一維數組（如[0, 1, 2, 1, 0]）

也會自動將其轉換為二維數組，表示為OneHot編碼的形式

In [10]:
# 使用to_categorical將標籤編碼轉換1hot編碼
keras_onehot_encoded = to_categorical(df['blood'])

print("OneHot Encoded Array:")
keras_onehot_encoded

OneHot Encoded Array:


array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]])

In [11]:
# 將1hot的結果轉換為 DataFrame
keras_onehot_encoded_df = pd.DataFrame(keras_onehot_encoded, columns=encoder.classes_)
print(keras_onehot_encoded_df)

# 將原資料框中的其他列與1hot的結果合併
final_df = pd.concat([keras_onehot_encoded_df, df[['Y', 'Z']]], axis=1)
final_df

     A   AB    B    O
0  1.0  0.0  0.0  0.0
1  0.0  0.0  1.0  0.0
2  0.0  1.0  0.0  0.0
3  0.0  0.0  0.0  1.0
4  0.0  0.0  1.0  0.0


Unnamed: 0,A,AB,B,O,Y,Z
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


# 方法三: pd.get_dummies方法

![](images/Encoding_pd.PNG)

pd.get_dummies(df)
- Pandas提供`get_dummies`函數，可以直接將某一列的類別特徵進行OneHot編碼，而不需要reshape
- 僅能將字串轉換為OneHot Encoding表示形式
- 可以直接轉字串，反而無法轉換數字
- 沒指定columns，會全部轉換

In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


In [13]:
df_all = pd.get_dummies(df)
df_all

Unnamed: 0,Z,blood_A,blood_AB,blood_B,blood_O,Y_high,Y_low,Y_mid
0,,True,False,False,False,True,False,False
1,,False,False,True,False,False,True,False
2,-1196.0,False,True,False,False,True,False,False
3,72.0,False,False,False,True,False,False,True
4,83.0,False,False,True,False,False,False,True


In [14]:
df_blood = pd.get_dummies(df.blood)
df_blood

Unnamed: 0,A,AB,B,O
0,True,False,False,False
1,False,False,True,False
2,False,True,False,False
3,False,False,False,True
4,False,False,True,False


In [15]:
# 將原資料框中的其他列與get_dummies的結果合併
final_df = pd.concat([df_blood, df[['Y', 'Z']]], axis=1)
final_df

Unnamed: 0,A,AB,B,O,Y,Z
0,True,False,False,False,high,
1,False,False,True,False,low,
2,False,True,False,False,high,-1196.0
3,False,False,False,True,mid,72.0
4,False,False,True,False,mid,83.0


# 練習一: sklearn - label encoder + onehot encoder

大部分的模型都是基於數學運算，字串無法套入數學模型進行運算<br>

可以看到country那欄皆為字串
- 進行Label encoding編碼
- 從sklearn library中導入LabelEncoder
- 對第一行資料進行fit及transform並取代之

In [16]:
import numpy as np
import pandas as pd

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]

dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

Unnamed: 0,Country,Age,Salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [17]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# LabelEncoder對'Country'特徵進行標籤編碼
label_encoder = LabelEncoder()
data['Country'] = label_encoder.fit_transform(data['Country'])
print(data['Country'].shape)
print(data[['Country']].shape)

# OneHotEncoder對標籤編碼後的'Country'特徵進行編碼
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(data[['Country']]).toarray()

# 將OneHotEncoder的結果轉換為DataFrame
onehot_encoded_df = pd.DataFrame(onehot_encoded, columns=onehot_encoder.get_feature_names_out(['Country']))

# 將原資料框中的其他列與OneHotEncoder的結果合併
final_df = pd.concat([onehot_encoded_df, data[['Age', 'Salary']]], axis=1)

# 顯示最終的資料框
final_df

(6,)
(6, 1)


Unnamed: 0,Country_0,Country_1,Country_2,Age,Salary
0,0.0,0.0,1.0,25,20000
1,1.0,0.0,0.0,30,32000
2,0.0,1.0,0.0,45,59000
3,1.0,0.0,0.0,35,60000
4,0.0,1.0,0.0,22,43000
5,0.0,0.0,1.0,36,52000


# 練習二: Keras - label encoder + to_categorical

In [18]:
import numpy as np
import pandas as pd

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]

dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

Unnamed: 0,Country,Age,Salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [19]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# LabelEncoder對'Country'特徵進行標籤編碼
label_encoder = LabelEncoder()
data['Country'] = label_encoder.fit_transform(data['Country'])

# to_categorical將標籤編碼轉換1hot編碼
onehot_encoded = to_categorical(data['Country'])

# 將1hot的結果轉換為DataFrame
onehot_encoded_df = pd.DataFrame(onehot_encoded, columns=label_encoder.classes_)

# 將原資料框中的其他列與1hot的結果合併
final_df = pd.concat([onehot_encoded_df, data[['Age', 'Salary']]], axis=1)

# 顯示最終的資料框
final_df

Unnamed: 0,Australia,Ireland,Taiwan,Age,Salary
0,0.0,0.0,1.0,25,20000
1,1.0,0.0,0.0,30,32000
2,0.0,1.0,0.0,45,59000
3,1.0,0.0,0.0,35,60000
4,0.0,1.0,0.0,22,43000
5,0.0,0.0,1.0,36,52000


# 練習三: Pandas.get_dummies

In [20]:
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]

dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

Unnamed: 0,Country,Age,Salary
0,Taiwan,25,20000
1,Australia,30,32000
2,Ireland,45,59000
3,Australia,35,60000
4,Ireland,22,43000
5,Taiwan,36,52000


In [21]:
# pandas.get_dummies對'Country'特徵1hot編碼
country_encoded_df = pd.get_dummies(data, columns=['Country'])

# 顯示最資料框
country_encoded_df

Unnamed: 0,Age,Salary,Country_Australia,Country_Ireland,Country_Taiwan
0,25,20000,False,False,True
1,30,32000,True,False,False
2,45,59000,False,True,False
3,35,60000,True,False,False
4,22,43000,False,True,False
5,36,52000,False,False,True


由於 data 中只有 Country 是類別特徵，兩個結果看起來是一樣的

In [22]:
import numpy as np
import pandas as pd

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]

dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

# pandas.get_dummies對所有特徵1hot編碼
data_encoded_df = pd.get_dummies(data)
data_encoded_df

Unnamed: 0,Age,Salary,Country_Australia,Country_Ireland,Country_Taiwan
0,25,20000,False,False,True
1,30,32000,True,False,False
2,45,59000,False,True,False
3,35,60000,True,False,False
4,22,43000,False,True,False
5,36,52000,False,False,True


添加一個新的類別特徵 'Gender'

In [23]:
# 添加一個新的類別特徵'Gender'
country = ['Taiwan', 'Australia', 'Ireland', 'Australia', 'Ireland', 'Taiwan']
age = [25, 30, 45, 35, 22, 36]
salary = [20000, 32000, 59000, 60000, 43000, 52000]
gender = ['Male', 'Female', 'Female', 'Male', 'Female', 'Male']

dic = {'Country': country, 'Age': age, 'Salary': salary, 'Gender': gender}
data = pd.DataFrame(dic)
data

Unnamed: 0,Country,Age,Salary,Gender
0,Taiwan,25,20000,Male
1,Australia,30,32000,Female
2,Ireland,45,59000,Female
3,Australia,35,60000,Male
4,Ireland,22,43000,Female
5,Taiwan,36,52000,Male


In [24]:
# pandas.get_dummies對'Country'特徵1hot編碼
country_encoded_df = pd.get_dummies(data, columns=['Country'])
print("Country特徵1hot編碼:")
country_encoded_df

Country特徵1hot編碼:


Unnamed: 0,Age,Salary,Gender,Country_Australia,Country_Ireland,Country_Taiwan
0,25,20000,Male,False,False,True
1,30,32000,Female,True,False,False
2,45,59000,Female,False,True,False
3,35,60000,Male,True,False,False
4,22,43000,Female,False,True,False
5,36,52000,Male,False,False,True


In [25]:
# pandas.get_dummies對所有類別特徵1hot編碼
data_encoded_df = pd.get_dummies(data)
print("所有類別特徵1hot編碼:")
data_encoded_df

所有類別特徵1hot編碼:


Unnamed: 0,Age,Salary,Country_Australia,Country_Ireland,Country_Taiwan,Gender_Female,Gender_Male
0,25,20000,False,False,True,False,True
1,30,32000,True,False,False,True,False
2,45,59000,False,True,False,True,False
3,35,60000,True,False,False,False,True
4,22,43000,False,True,False,True,False
5,36,52000,False,False,True,False,True
