# Part 1:  特徵工程 (Feature engineering)

<div align="center">
<img width="70%" src="https://i.ibb.co/nbtTXy3/topic2-1-Feature-Engineering.png">
</div>

<div align="right">

[圖片來源](https://www.wallstreetmojo.com/feature-engineering/)  

</div>

### why we need Feature engineering?
- improve a model's predictive performance
- reduce computational or data needs
- improve interpretability of the results

### methods
- 基本資料處理 (可參考 EDA notebook)
    - Outliers Detection
    - Missing Value Imputation

- 數值轉換
    - Feature Scaling 特徵縮放
    - Feature Transformation 特徵變換
    

- 特徵合併
    - Features Interaction 特徵交互
    - Feature Combination 特徵組合

- 通用
    - Feature Construction 特徵建構
    - Feature Extraction 特徵提取 &  Feature Selection 特徵選擇

- 其他議題
    - Data Leakage
    - Target Engineering



### 產生範例資料

**example.csv**

In [None]:
%%bash
cat >  example.csv << EOL
Gender,People,Age,Height
male,kid,5,100
male,elder,67,158
female,youth,25,160
male,youth,29,175
male,kid,7,120
female,elder,76,168
EOL

In [None]:
import pandas as pd

df = pd.read_csv("example.csv")
df

Unnamed: 0,Gender,People,Age,Height
0,male,kid,5,100
1,male,elder,67,158
2,female,youth,25,160
3,male,youth,29,175
4,male,kid,7,120
5,female,elder,76,168


**example2.csv**

In [None]:
%%bash
cat >  example2.csv << EOL
Region,Gender,People,Age,Height
A,male,kid,5,100
A,male,elder,67,158
E,female,youth,25,160
B,male,youth,29,175
B,male,kid,7,120
A,female,elder,76,168
B,female,elder,67,180
D,male,youth,25,165
B,male,kid,4,110
C,female,elder,66,158
EOL

In [None]:
import pandas as pd

df = pd.read_csv("example2.csv")
df

Unnamed: 0,Region,Gender,People,Age,Height
0,A,male,kid,5,100
1,A,male,elder,67,158
2,E,female,youth,25,160
3,B,male,youth,29,175
4,B,male,kid,7,120
5,A,female,elder,76,168
6,B,female,elder,67,180
7,D,male,youth,25,165
8,B,male,kid,4,110
9,C,female,elder,66,158


## 數值轉換

###  特徵縮放 (Feature Scaling)

#### 正規化/歸一化 (Normalization)

將資料縮放到固定的數值區間，常見有 [0, 1] or [1, -1]。


In [None]:
df = pd.read_csv("example.csv", usecols=['Age', 'Height'])
df

Unnamed: 0,Age,Height
0,5,100
1,67,158
2,25,160
3,29,175
4,7,120
5,76,168


**min-max**

直接對整個 dataframe 使用 `min()`、`max()` 要注意 aixs 參數。預設為 0 代表是對 column 計算，如果設為 1 則代表對 row 計算。

In [None]:
min_max_df = (df-df.min())/(df.max()-df.min())
min_max_df

Unnamed: 0,Age,Height
0,0.0,0.0
1,0.873239,0.773333
2,0.28169,0.8
3,0.338028,1.0
4,0.028169,0.266667
5,1.0,0.906667


#### 標準化(standardization)  [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html)

In [None]:
import scipy.stats as stats

zscore_df = df.copy()
for col in df.columns:
    zscore_df[col] = stats.zscore(zscore_df[col])

In [None]:
zscore_df

Unnamed: 0,Age,Height
0,-1.086366,-1.719145
1,1.171333,0.409903
2,-0.358076,0.483318
3,-0.212418,1.033934
4,-1.013537,-0.98499
5,1.499064,0.77698


#### 討論

-  **Standardization**: 適合用在 SVM、logistic regression 對距離和數值大小敏感的演算法
- **Normalization**: Vector Space Model、Batch Normalization (deep learning 可以減緩梯度消失的問題)
- tree-based 的演算法，基本上都不需要標準化或歸一化，它們對 scale 不敏感。

### 特徵變換 (Feature Transformation)


In [None]:
df = pd.read_csv("example.csv", usecols=['Age', 'Height'])
df

Unnamed: 0,Age,Height
0,5,100
1,67,158
2,25,160
3,29,175
4,7,120
5,76,168


#### Rounding

In [None]:
(df/100).round(2)

Unnamed: 0,Age,Height
0,0.05,1.0
1,0.67,1.58
2,0.25,1.6
3,0.29,1.75
4,0.07,1.2
5,0.76,1.68


#### log

In [None]:
import numpy as np

In [None]:
# 數學常數 e 為底
np.log(df)

Unnamed: 0,Age,Height
0,1.609438,4.60517
1,4.204693,5.062595
2,3.218876,5.075174
3,3.367296,5.164786
4,1.94591,4.787492
5,4.330733,5.123964


In [None]:
# 10 為底
np.log10(df)

Unnamed: 0,Age,Height
0,0.69897,2.0
1,1.826075,2.198657
2,1.39794,2.20412
3,1.462398,2.243038
4,0.845098,2.079181
5,1.880814,2.225309


#### Binarization

In [None]:
df['Age(Binarization)'] = df['Age'].apply(lambda x: 'adult' if x >= 20 else 'kid')
df

Unnamed: 0,Age,Height,Age(Binarization)
0,5,100,kid
1,67,158,adult
2,25,160,adult
3,29,175,adult
4,7,120,kid
5,76,168,adult


#### Binning

In [None]:
df['Age(Binning)'] = df['Age'].apply(lambda x: 'older' if x >= 60 else 'adult' if x >= 20 else 'kid' )
df

Unnamed: 0,Age,Height,Age(Binarization),Age(Binning)
0,5,100,kid,kid
1,67,158,adult,older
2,25,160,adult,adult
3,29,175,adult,adult
4,7,120,kid,kid
5,76,168,adult,older


####   Data Encoding (類別特徵)
part 2 的部分會講

## 特徵合併

### 特徵交互 (Features Interaction)
數值特徵之間的運算處理

In [None]:
df = pd.read_csv("example.csv", usecols=['Age', 'Height'])
df

Unnamed: 0,Age,Height
0,5,100
1,67,158
2,25,160
3,29,175
4,7,120
5,76,168


In [None]:
# 這邊只是範例，大家在做操作時要注意有沒有意義
df['divide'] = df['Height'] / df['Age']
df['add'] = df['Height'] + df['Age']
df

Unnamed: 0,Age,Height,divide,add
0,5,100,20.0,105
1,67,158,2.358209,225
2,25,160,6.4,185
3,29,175,6.034483,204
4,7,120,17.142857,127
5,76,168,2.210526,244


### 特徵組合 (Feature Combination)

In [None]:
df = pd.read_csv("example.csv", usecols=['Gender', 'People'])
df

Unnamed: 0,Gender,People
0,male,kid
1,male,elder
2,female,youth
3,male,youth
4,male,kid
5,female,elder


In [None]:
df['Gender_People'] = df['Gender'] + '-' +  df['People']
df

Unnamed: 0,Gender,People,Gender_People
0,male,kid,male-kid
1,male,elder,male-elder
2,female,youth,female-youth
3,male,youth,male-youth
4,male,kid,male-kid
5,female,elder,female-elder


## 通用

### 特徵建構 (Feature Construction)
特徵構建指的是從原有的特徵中，人工地創造出新的特徵，通常用來解決一般的線性模型沒辦法學到非線性特徵的問題
- 如果你有 city 或 address 等特徵，可以新建出 latitude 和 longitude 兩個 features（當然你得透過外部的 API 或資料來源才做得到），再組合出 median_income_within_2_miles 這樣的特徵。
- 對於 date / time 類型的資料，除了轉換成 timestamp 和取出 day、month 和 year 做成新的欄位之外，也可以對 hour 做 binning（分成上午、中午、晚上之類的）或是對 day 做 binning（分成工作日、週末）；或是想辦法查出該日期當天的天氣、節日或活動等訊息，例如 is_national_holiday 或 has_sport_events。

### 特徵提取 (Feature Extraction) vs. 特徵選擇 (Feature Selection)

<div align="center">
<img width="70%" src="https://i.ibb.co/wMqYJgH/Feature-Extraction-vs-Feature-Selection.png">
</div>

<div align="right">

[圖片來源](https://quantdare.com/what-is-the-difference-between-feature-extraction-and-feature-selection/)  

</div>

- 特徵提取 (Feature Extraction) -> 保留多數資訊進行降維
    - PCA、LDA、encoder(Deep learning)
- 特徵選擇 (Feature Selection) -> 排除多餘資訊進行降維
 - random forest、Linear Regression(LASSO、Ridge)、logistic regression 等演算法都可以直接看  feature importances。

## 其他議題

### Data Leakage
簡單來說就是有某些特徵洩漏了答案或是有一些操作會偷看答案。造成結果看起來很好。EX: 同時 train、test 資料集一起計算 z-score 時許多指標會洩漏 test 資料集的資訊。如果在時間序列預測上，基本上 train 是過去資料， test 是未來資料。所以我們的使用的統計量就是可以一定程度推測未來資料的數值。

可以看這篇 : [Data Leakage | Kaggle](https://www.kaggle.com/code/alexisbcook/data-leakage)

### Target Engineering
也有針對預測目標進行特徵工程的方式，最常見的方法是 Target LogTransform，基本上就是透過 log 對目標進行數值轉換，有興趣可以看這篇: [Target LogTransform Effect | R²: 0.64 to 0.97 | Kaggle](https://www.kaggle.com/code/heitornunes/target-logtransform-effect-r-0-64-to-0-97)



# Part 2:  資料編碼 (Data Encoding)

## Label Encoding &nbsp; [<img width="1.8%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

df = pd.read_csv('example.csv')
df['People(en)'] = labelencoder.fit_transform(df['People'])
df[['People', 'People(en)']]

Unnamed: 0,People,People(en)
0,kid,1
1,elder,0
2,youth,2
3,youth,2
4,kid,1
5,elder,0


## Ordinal Encoding

### pandas [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

如果想要有序的 encode  ( kid -> youth -> elder)，可以使用 pandas 的 apply 方法處理。

In [None]:
 df = pd.read_csv('example.csv')

 df['People(en)'] = df['People'].apply(lambda x: ['kid', 'youth', 'elder'].index(x))
 df[['People', 'People(en)']]

Unnamed: 0,People,People(en)
0,kid,0
1,elder,2
2,youth,1
3,youth,1
4,kid,0
5,elder,2


### sklearn  [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)


在 sklearn 中 預設的 OrdinalEncoder 不是一般的 Ordinal Encoding，比較像 Label Encoding ，可以一次處理多個特徵。


In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

df = pd.read_csv('example.csv')
df[['People(en)', 'Gender(en)']] = ordinal_encoder.fit_transform(df[['People', 'Gender']])
df[['People', 'Gender',  'People(en)', 'Gender(en)']]

Unnamed: 0,People,Gender,People(en),Gender(en)
0,kid,male,1.0,1.0
1,elder,male,0.0,1.0
2,youth,female,2.0,0.0
3,youth,male,2.0,1.0
4,kid,male,1.0,1.0
5,elder,female,0.0,0.0


也可以和數值欄位一起做 encode，但不建議直接這樣做，像是可以先將年齡轉換化成年齡區間(類別)後再行 encode。

In [None]:
df[['People(en)', 'Gender(en)', 'Age(en)']] = ordinal_encoder.fit_transform(df[['People', 'Gender', 'Age']])
df[['People', 'Gender', 'Age', 'People(en)', 'Gender(en)',  'Age(en)']]

Unnamed: 0,People,Gender,Age,People(en),Gender(en),Age(en)
0,kid,male,5,1.0,1.0,0.0
1,elder,male,67,0.0,1.0,4.0
2,youth,female,25,2.0,0.0,2.0
3,youth,male,29,2.0,1.0,3.0
4,kid,male,7,1.0,1.0,1.0
5,elder,female,76,0.0,0.0,5.0


透過指定 `categories` 參數，可以自訂各特徵的類別 encode 的順序去實現 Ordinal Encoding。

In [None]:
ordinal_encoder = OrdinalEncoder(
    categories=[
         ['kid', 'youth', 'elder'],
         ['male', 'female']
        ]
    )
df[['People(en)', 'Gender(en)']] = ordinal_encoder.fit_transform(df[['People', 'Gender']])
df[['People', 'Gender',  'People(en)', 'Gender(en)']]

Unnamed: 0,People,Gender,People(en),Gender(en)
0,kid,male,0.0,0.0
1,elder,male,2.0,0.0
2,youth,female,1.0,1.0
3,youth,male,1.0,0.0
4,kid,male,0.0,0.0
5,elder,female,2.0,1.0


## OneHot Encoding

### pandas [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
df

Unnamed: 0,Gender,People,Age,Height
0,male,kid,5,100
1,male,elder,67,158
2,female,youth,25,160
3,male,youth,29,175
4,male,kid,7,120
5,female,elder,76,168


同時輸入數值和類別欄位時，會自動對類別欄位做 one hot encoding

In [None]:
data_dum = pd.get_dummies(df)
data_dum

Unnamed: 0,Age,Height,Gender_female,Gender_male,People_elder,People_kid,People_youth
0,5,100,0,1,0,1,0
1,67,158,0,1,1,0,0
2,25,160,1,0,0,0,1
3,29,175,0,1,0,0,1
4,7,120,0,1,0,1,0
5,76,168,1,0,1,0,0


### sklearn  [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)


In [None]:
from sklearn.preprocessing import OneHotEncoder

直接把全部特徵丟進去會出事， sklearn 他不會自動判斷數值 or 字串(類別) 變數，他會一律視為類別。所以使用時要特別注意。結果如下:

In [None]:
onehot_encoder = OneHotEncoder()
onehot_array = onehot_encoder.fit_transform(df).toarray()
pd.DataFrame(onehot_array)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


可以從 encoder 底下的 `categories_` 屬性看到 sklearn 把所有欄位都進行 onehot encoding，導致整個 onehot encode 的維度變很大 。

In [None]:
onehot_encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['elder', 'kid', 'youth'], dtype=object),
 array([ 5,  7, 25, 29, 67, 76]),
 array([100, 120, 158, 160, 168, 175])]

把類別欄位分出來做 onehot encode，轉換完後再並回去

In [None]:
cate_cols = ['Gender', 'People']
num_cols = [i for i in df.columns if i not in cate_cols]

In [None]:
onehot_encoder = OneHotEncoder()
onehot_array = onehot_encoder.fit_transform(df[cate_cols]).toarray()

# fit 完後的 encoder 可以從 get_feature_names_out() 看欄位名稱
column_name = onehot_encoder.get_feature_names_out()
onehot_df = pd.DataFrame(onehot_array, columns=column_name)
onehot_df

Unnamed: 0,Gender_female,Gender_male,People_elder,People_kid,People_youth
0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0,1.0
4,0.0,1.0,0.0,1.0,0.0
5,1.0,0.0,1.0,0.0,0.0


In [None]:
pd.concat([onehot_df, df[num_cols]], axis=1)

Unnamed: 0,Gender_female,Gender_male,People_elder,People_kid,People_youth,Age,Height
0,0.0,1.0,0.0,1.0,0.0,5,100
1,0.0,1.0,1.0,0.0,0.0,67,158
2,1.0,0.0,0.0,0.0,1.0,25,160
3,0.0,1.0,0.0,0.0,1.0,29,175
4,0.0,1.0,0.0,1.0,0.0,7,120
5,1.0,0.0,1.0,0.0,0.0,76,168


我們可以看到 onehot encoding 的缺點很明顯，會造成維度災難，增加後續運算負擔。所以在類別特徵的類別數量很多時，不太適合使用此方法。

### multiple categories  📎
方法很直觀，就是限制進行 onehot encode 的類別上限。最簡單的方式是只使用次數最多的 n 個類別，其他歸一類，所以我們可以把維度大小限制在 n+1 。

In [None]:
df = pd.read_csv('example2.csv')
df

Unnamed: 0,Region,Gender,People,Age,Height
0,A,male,kid,5,100
1,A,male,elder,67,158
2,E,female,youth,25,160
3,B,male,youth,29,175
4,B,male,kid,7,120
5,A,female,elder,76,168
6,B,female,elder,67,180
7,D,male,youth,25,165
8,B,male,kid,4,110
9,C,female,elder,66,158


In [None]:
df['Region'].value_counts()

B    4
A    3
E    1
D    1
C    1
Name: Region, dtype: int64

In [None]:
# 使用 top2，其他這邊稱作 others
top2_item = df['Region'].value_counts().head(2).index
df['Region(top2)'] = df['Region'].apply(lambda x: x if x in top2_item else 'others')
df

Unnamed: 0,Region,Gender,People,Age,Height,Region(top2)
0,A,male,kid,5,100,A
1,A,male,elder,67,158,A
2,E,female,youth,25,160,others
3,B,male,youth,29,175,B
4,B,male,kid,7,120,B
5,A,female,elder,76,168,A
6,B,female,elder,67,180,B
7,D,male,youth,25,165,others
8,B,male,kid,4,110,B
9,C,female,elder,66,158,others


移除原始欄位，用新生成的欄位當作特徵進行 encoding

In [None]:
df_new = df.drop(['Region'], axis=1)
data_dum = pd.get_dummies(df_new)
data_dum

Unnamed: 0,Age,Height,Gender_female,Gender_male,People_elder,People_kid,People_youth,Region(top2)_A,Region(top2)_B,Region(top2)_others
0,5,100,0,1,0,1,0,1,0,0
1,67,158,0,1,1,0,0,1,0,0
2,25,160,1,0,0,0,1,0,0,1
3,29,175,0,1,0,0,1,0,1,0
4,7,120,0,1,0,1,0,0,1,0
5,76,168,1,0,1,0,0,1,0,0
6,67,180,1,0,1,0,0,0,1,0
7,25,165,0,1,0,0,1,0,0,1
8,4,110,0,1,0,1,0,0,1,0
9,66,158,1,0,1,0,0,0,0,1


## Frequency Encoding
使用各類別頻率出現的頻率作為特徵。

### pandas

In [None]:
df = pd.read_csv('example.csv')
df

Unnamed: 0,Gender,People,Age,Height
0,male,kid,5,100
1,male,elder,67,158
2,female,youth,25,160
3,male,youth,29,175
4,male,kid,7,120
5,female,elder,76,168


In [None]:
cate_cols = ['Gender', 'People']
sampple_nums = df.shape[0]

In [None]:
# 直接用次數
for col in cate_cols:
     freq_encode_dt = df[col].value_counts().to_dict()
     df[f"{col}(en)"] = df[col].apply(lambda x: freq_encode_dt[x])

df

Unnamed: 0,Gender,People,Age,Height,Gender(en),People(en)
0,male,kid,5,100,4,2
1,male,elder,67,158,4,2
2,female,youth,25,160,2,2
3,male,youth,29,175,4,2
4,male,kid,7,120,4,2
5,female,elder,76,168,2,2


In [None]:
# 將次數轉換成比例
for col in cate_cols:
     freq_encode_dt = df[col].value_counts().to_dict()
     df[f"{col}(en)"] = df[col].apply(lambda x: freq_encode_dt[x]/sampple_nums)

df

Unnamed: 0,Gender,People,Age,Height,Gender(en),People(en)
0,male,kid,5,100,0.666667,0.333333
1,male,elder,67,158,0.666667,0.333333
2,female,youth,25,160,0.333333,0.333333
3,male,youth,29,175,0.666667,0.333333
4,male,kid,7,120,0.666667,0.333333
5,female,elder,76,168,0.333333,0.333333


## Target Encoding (Mean Encoding) &nbsp; [<img width="1.8%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html)


<div align="center">
<img width="70%" src="https://i.ibb.co/dgbhG9D/topic2-2-Target-encoding-example.png">
</div>

<div align="right">

[圖片來源](https://www.kaggle.com/code/caesarlupum/catcomp-simple-target-encoding/notebook/)  

</div>

- 適合類別數量多的時候，可以減少複雜度。
- 此方法對 Outlier 特別敏感，使用時須特別注意。

In [None]:
df = pd.read_csv('example.csv')
df

Unnamed: 0,Gender,People,Age,Height
0,male,kid,5,100
1,male,elder,67,158
2,female,youth,25,160
3,male,youth,29,175
4,male,kid,7,120
5,female,elder,76,168


新版的 sklearn 1.3.1 有支援  TargetEncoder (當前 colab 裝的是 1.2.2)，所以這部分就不用 sklearn 作範例了

### pandas

這部分參考[這篇](https://www.kaggle.com/code/ryanholbrook/target-encoding)

In [None]:
df = pd.read_csv('example2.csv')
df

Unnamed: 0,Region,Gender,People,Age,Height
0,A,male,kid,5,100
1,A,male,elder,67,158
2,E,female,youth,25,160
3,B,male,youth,29,175
4,B,male,kid,7,120
5,A,female,elder,76,168
6,B,female,elder,67,180
7,D,male,youth,25,165
8,B,male,kid,4,110
9,C,female,elder,66,158


In [None]:
# 這邊假設身高為預測目標
target = "Height"
cate_cols = ['Region', 'Gender', 'People']

In [None]:
for col in cate_cols:
     df[f"{col}(en)"] = df.groupby(col)[target].transform("mean")

In [None]:
df

Unnamed: 0,Region,Gender,People,Age,Height,Region(en),Gender(en),People(en)
0,A,male,kid,5,100,142.0,138.0,110.0
1,A,male,elder,67,158,142.0,138.0,166.0
2,E,female,youth,25,160,160.0,166.5,166.666667
3,B,male,youth,29,175,146.25,138.0,166.666667
4,B,male,kid,7,120,146.25,138.0,110.0
5,A,female,elder,76,168,142.0,166.5,166.0
6,B,female,elder,67,180,146.25,166.5,166.0
7,D,male,youth,25,165,165.0,138.0,166.666667
8,B,male,kid,4,110,146.25,138.0,110.0
9,C,female,elder,66,158,158.0,166.5,166.0


上面的結果只是 mean encode 而已，需要再加上 smoothing。

In [None]:
# 設定 Smoothing 參數（alpha）
alpha = 1

# 計算全局平均目標值
global_mean = df[target].mean()

for col in cate_cols:
    # 計算每個類別的平均目標值和計數
    category_stats = df.groupby(col)[target].agg(['mean', 'count'])

    # 計算 Smoothing 後的 Target Encoding
    Smooth_result = (category_stats['mean'] * category_stats['count'] + global_mean * alpha) / (category_stats['count'] + alpha)

    # Target Encoding
    encode_dt = pd.DataFrame(Smooth_result).to_dict()[0]
    df[f"{col}(en)"] = df[col].apply(lambda x: encode_dt[x])

In [None]:
df

Unnamed: 0,Region,Gender,People,Age,Height,Region(en),Gender(en),People(en)
0,A,male,kid,5,100,143.85,139.628571,119.85
1,A,male,elder,67,158,143.85,139.628571,162.68
2,E,female,youth,25,160,154.7,163.08,162.35
3,B,male,youth,29,175,146.88,139.628571,162.35
4,B,male,kid,7,120,146.88,139.628571,119.85
5,A,female,elder,76,168,143.85,163.08,162.68
6,B,female,elder,67,180,146.88,163.08,162.68
7,D,male,youth,25,165,157.2,139.628571,162.35
8,B,male,kid,4,110,146.88,139.628571,119.85
9,C,female,elder,66,158,153.7,163.08,162.68


alpha 值是可以視情況調整的，越大會放大 global 的特徵。下面用 10 測試

In [None]:
# 設定 Smoothing 參數（alpha）
alpha = 10

# 計算全局平均目標值
global_mean = df[target].mean()

for col in cate_cols:
    # 計算每個類別的平均目標值和計數
    category_stats = df.groupby(col)[target].agg(['mean', 'count'])

    # 計算 Smoothing 後的 Target Encoding
    Smooth_result = (category_stats['mean'] * category_stats['count'] + global_mean * alpha) / (category_stats['count'] + alpha)

    # Target Encoding
    encode_dt = pd.DataFrame(Smooth_result).to_dict()[0]
    df[f"{col}(en)"] = df[col].apply(lambda x: encode_dt[x])

In [None]:
df

Unnamed: 0,Region,Gender,People,Age,Height,Region(en),Gender(en),People(en)
0,A,male,kid,5,100,147.692308,145.125,140.307692
1,A,male,elder,67,158,147.692308,145.125,154.142857
2,E,female,youth,25,160,150.363636,154.285714,153.384615
3,B,male,youth,29,175,148.5,145.125,153.384615
4,B,male,kid,7,120,148.5,145.125,140.307692
5,A,female,elder,76,168,147.692308,154.285714,154.142857
6,B,female,elder,67,180,148.5,154.285714,154.142857
7,D,male,youth,25,165,150.818182,145.125,153.384615
8,B,male,kid,4,110,148.5,145.125,140.307692
9,C,female,elder,66,158,150.181818,154.285714,154.142857


同理也可以讓 alpha 變小(接近於 1)，讓local 的特徵放大。

In [None]:
# 設定 Smoothing 參數（alpha）
alpha = 0.01

# 計算全局平均目標值
global_mean = df[target].mean()

for col in cate_cols:
    # 計算每個類別的平均目標值和計數
    category_stats = df.groupby(col)[target].agg(['mean', 'count'])

    # 計算 Smoothing 後的 Target Encoding
    Smooth_result = (category_stats['mean'] * category_stats['count'] + global_mean * alpha) / (category_stats['count'] + alpha)

    # Target Encoding
    encode_dt = pd.DataFrame(Smooth_result).to_dict()[0]
    df[f"{col}(en)"] = df[col].apply(lambda x: encode_dt[x])

In [None]:
df

Unnamed: 0,Region,Gender,People,Age,Height,Region(en),Gender(en),People(en)
0,A,male,kid,5,100,142.024585,138.018968,110.130897
1,A,male,elder,67,158,142.024585,138.018968,165.958603
2,E,female,youth,25,160,159.89505,166.457357,166.609302
3,B,male,youth,29,175,146.257855,138.018968,166.609302
4,B,male,kid,7,120,146.257855,138.018968,110.130897
5,A,female,elder,76,168,142.024585,166.457357,165.958603
6,B,female,elder,67,180,146.257855,166.457357,165.958603
7,D,male,youth,25,165,164.845545,138.018968,166.609302
8,B,male,kid,4,110,146.257855,138.018968,110.130897
9,C,female,elder,66,158,157.914851,166.457357,165.958603


### category_encoder   [<img width="1.5%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://contrib.scikit-learn.org/category_encoders/targetencoder.html)
先安裝 category_encoder 套件

In [None]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.2


In [None]:
import category_encoders as ce

In [None]:
df = pd.read_csv('example2.csv')
df

Unnamed: 0,Region,Gender,People,Age,Height
0,A,male,kid,5,100
1,A,male,elder,67,158
2,E,female,youth,25,160
3,B,male,youth,29,175
4,B,male,kid,7,120
5,A,female,elder,76,168
6,B,female,elder,67,180
7,D,male,youth,25,165
8,B,male,kid,4,110
9,C,female,elder,66,158


In [None]:
# 這邊假設身高為預測目標
target = "Height"
cate_cols = ['Region', 'Gender', 'People']

In [None]:
enc = ce.TargetEncoder(cols=cate_cols)

target_encode_df = enc.fit_transform(df[cate_cols], df[target])
target_encode_df

Unnamed: 0,Region,Gender,People
0,148.256957,147.144896,143.314069
1,148.256957,147.144896,152.188495
2,150.77915,152.272486,152.0671
3,148.870858,147.144896,152.0671
4,148.870858,147.144896,143.314069
5,148.256957,152.272486,152.188495
6,148.870858,152.272486,152.188495
7,151.429692,147.144896,152.0671
8,148.870858,147.144896,143.314069
9,150.518933,152.272486,152.188495


### Leave-One-Out Encoding (LOO) 📎

<div align="center">

<font>Target encoding</font>

<img width="75%" src="https://i.ibb.co/2F3JrFd/topic2-2-Target-encoding.webp">

<font>Leave-One-Out Encoding</font>

<img width="75%" src="https://i.ibb.co/swGg0NM/topic2-2-Leave-One-Out-Encoding.webp">
</div>

<div align="right">

[圖片來源](https://axk51013.medium.com/kaggle-categorical-encoding-3%E5%A4%A7%E7%B5%95%E6%8B%9B-589780119470)  

</div>



In [None]:
df = pd.read_csv('example2.csv')
df

Unnamed: 0,Region,Gender,People,Age,Height
0,A,male,kid,5,100
1,A,male,elder,67,158
2,E,female,youth,25,160
3,B,male,youth,29,175
4,B,male,kid,7,120
5,A,female,elder,76,168
6,B,female,elder,67,180
7,D,male,youth,25,165
8,B,male,kid,4,110
9,C,female,elder,66,158


In [None]:
# 這邊假設身高為預測目標
target = "Height"
cate_cols = ['Region', 'Gender', 'People']

In [None]:
import category_encoders as ce
encoder = ce.LeaveOneOutEncoder(cols=cate_cols, return_df=True)
LOO_encode_df = encoder.fit_transform(df[cate_cols], df[target])

In [None]:
LOO_encode_df

Unnamed: 0,Region,Gender,People
0,163.0,145.6,115.0
1,134.0,134.0,168.666667
2,149.4,168.666667,170.0
3,136.666667,130.6,162.5
4,155.0,141.6,105.0
5,129.0,166.0,165.333333
6,135.0,162.0,161.333333
7,149.4,132.6,167.5
8,158.333333,143.6,110.0
9,149.4,169.333333,168.666667


### Beta Target Encoding (BTE) 📎
BTE 認為除了 mean 以外，其他統計量也很有意義。所以同時針對不同統計量進行 encode。有興趣的同學可以參考這篇:
[Beta Target Encoding | kaggle](https://www.kaggle.com/code/mmotoki/beta-target-encoding)

# Part 3:  數值預測

## 線性迴歸 (Linear regression)  &nbsp; [<img width="1.8%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
這部分主要參考[這篇](https://medium.com/analytics-vidhya/linear-regression-using-iris-dataset-hello-world-of-machine-learning-b0feecac9cc1)

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

In [None]:
iris_df = pd.DataFrame(data=iris.data,
                  columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error, mean_squared_error

# 整理資料
iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
iris_df = pd.concat([iris_df, target_df], axis= 1)

X= iris_df.drop(labels= 'sepal length (cm)', axis= 1)
y= iris_df['sepal length (cm)']

# Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

# Instantiating LinearRegression() Model
lr = LinearRegression()

# Training/Fitting the Model
lr.fit(X_train, y_train)

# Making Predictions
lr.predict(X_test)
pred = lr.predict(X_test)

# Evaluating Model's Performance (取小數點第三位顯示)
print('Mean Absolute Error: {:.3}'.format(mean_absolute_error(y_test, pred)))
print('Mean Squared Error: {:.3}'.format(mean_squared_error(y_test, pred)))
print('Mean Root Squared Error: {:.3}'.format(mean_squared_error(y_test, pred, squared=False)))

Mean Absolute Error: 0.26
Mean Squared Error: 0.102
Mean Root Squared Error: 0.319


In [None]:
lr.score(X_train, y_train)

0.8666129758784316

使用自己資料進行回歸預測流程(精簡版)

```python
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

# step1: 讀取 & 整理資料


# step2: 切分 train、test 資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

# step3: 建立 regression
lr = LinearRegression()

# step4: 訓練回歸模型
lr.fit(X_train, y_train)

# step5: 預測測試集資料
lr.predict(X_test)
pred = lr.predict(X_test)

# step6:  計算評估指標
mean_absolute_error(y_test, pred)
```

## 評估指標  &nbsp; [<img width="1.8%" src="https://i.ibb.co/GJHbVbG/external-link.png">](https://scikit-learn.org/stable/modules/classes.html#regression-metrics)


<div align="center">

<img width="60%" src="https://i.ibb.co/sPQRGHy/topic2-3-metrics-type.webp">

</div>

<div align="right">

[圖片來源](https://medium.com/jackys-blog/%E4%BB%8B%E7%B4%B9%E5%90%84%E7%A8%AE%E5%B8%B8%E7%94%A8%E7%9A%84metrics-8ef5f8b3ca90)  

</div>

<div align="center">

<a href="https://scikit-learn.org/stable/modules/classes.html#regression-metrics">

<img width="75%" src="https://i.ibb.co/qnNFMK5/topic2-3-Regression-metrics-sklearn-support.png">

</a>

</div>

 # 補充


##  使用 cat 語法將多行文字輸出成檔案

In [None]:
%%bash
cat >  example.txt << EOL
i
am
groot
EOL

In [None]:
!ls

example.txt  sample_data


In [None]:
!cat example.txt

i
am
groot


- 上面演示使用 magic command `%%bash` 結合多行 shell語法直接輸出多行文字到檔案中，算是 `cat`  結合 EOL(end of line) 的進階用法。

- 這邊要注意在 colab 的 code cell 中使用 `%%bash` 代表告訴 IPython kernel 這個 cell 都是執行 shell 語法，所以不需要再額外在指令前加上 `!`。

- 也可以使用 EOF(end of file) 達到同樣目的。如下:
    ```shell
    cat << EOF > example2.txt
    i
    am
    groot
    EOF
    ```

- 其中 EOF 和 EOL 只是通用的標記，語法是允許自訂標記文字。用我們學校的縮寫 NCCU 也可以達到同樣效果，如下:
    ```shell
    cat > example.txt << NCCU
    i
    am
    groot
    NCCU
    ```

## colab 功能:  suggest chart

In [None]:
%%bash
cat >  example.csv << EOL
Gender,People,Age,Height
male,kid,5,100
male,elder,67,158
female,youth,25,160
male,youth,29,175
male,kid,7,120
female,elder,76,168
EOL

In [None]:
import pandas as pd

df = pd.read_csv("example.csv")
df

Unnamed: 0,Gender,People,Age,Height
0,male,kid,5,100
1,male,elder,67,158
2,female,youth,25,160
3,male,youth,29,175
4,male,kid,7,120
5,female,elder,76,168


# 建議學習資源

- [Data ScienceTutorial for Beginners | Kaggle](https://www.kaggle.com/code/kanncaa1/data-sciencetutorial-for-beginners)
    - 適合初學者，包含與資料科學相關的基本的 python 語法 (前兩章)和 pandas 資料操作(3-5 章)。
    - pandas 包含很多功能，但至少上面介紹關於 pandas 的方法都要會用。
- [How to use the Kaggle API from Colab - Colaboratory (google.com)](https://colab.research.google.com/github/corrieann/kaggle/blob/master/kaggle_api_in_colab.ipynb#scrollTo=SHVqmMXfilWG)
    - 透過 kaggle API 可以更方便取得 kaggle 上的資料。
- [A Comprehensive Overview of Regression Evaluation Metrics | NVIDIA Technical Blog](https://developer.nvidia.com/blog/a-comprehensive-overview-of-regression-evaluation-metrics/)
- [Evaluation Metrics for Your Regression Model - Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/)

# 今天的練習 💪
### 作業內容
請每位同學建立一個新的 colab notebook，每組在 kaggle 下找個 dataset。 實作下列題目:

1. 使用該資料集建立基本的回歸預測模型，並選定一個評估指標，並解釋為什麼要使用該指標。
2. 實作並比較至少 3 種 Feature engineering 方法結果
2. 實作並比較至少 3 種 Data Encoding 方法結果

<font color="#48C9B0">資料集</font>

[Find Open Datasets and Machine Learning Projects | Kaggle](https://www.kaggle.com/datasets?search=Linear+regression&sort=votes&sizeEnd=100%2CMB)
- search: Linear regression
- sort: Most Votes
- data size(options): <100MB
- Contains numeric and categorical fields

<font color="#48C9B0">注意</font>
- 可以直接複製別人程式碼，但須標明來源。
- 整個過程須自己理解並加上自己的敘述。

### 作業繳交方式
> *作業完成後與 colab 的左上角，檔案 → 列印 → 另存為 PDF，並將  colab notebook 的 PDF 檔上傳到 moodle。如果有視覺化無法在 PDF 上顯示，可以附上 colab 共享連結。*
