![](https://cdn-images-1.medium.com/max/1600/1*jX6Gwn1rt4da7e-yUj84IQ.png)

### Likelihood Encoding
- 也稱為 Impact Encoding 或 Mean Encoding 或 Target Encoding。


In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

### 讀取檔案

In [0]:
PATH = 'data/'
train_data = pd.read_table(PATH + 'train.tsv', engine='c')
train_data.rename(index=str, columns={'price':'y'},inplace=True)

### 抓出Category dtypes

In [0]:
categorical_features = []

for dtype, feature in zip(train_data.dtypes, train_data.columns):
    if dtype == object:
        categorical_features.append(feature)

categorical_features

['name', 'category_name', 'brand_name', 'item_description']

In [0]:
for f_ in categorical_features:   
    print('{} has {} unique items'.format(f_, train_data[f_].nunique()))

name has 1225273 unique items
category_name has 1287 unique items
brand_name has 4809 unique items
item_description has 1281426 unique items


## name, item_description 是文字特徵，所以不能算在內
- 注意 **band_name** 缺失值多達42%，本不應該拿去做target encoding
- 為了學習方便，移除這些NaN

In [0]:
train_data = train_data[train_data['brand_name'].notnull()]
train_data = train_data[train_data['category_name'].notnull()]

categorical_features.remove('name')
categorical_features.remove('item_description')
categorical_features

['category_name', 'brand_name']

# Mean encodings without Regularization
### 這裡使用兩種範例，只是一個用 `map` 一個是 `transform`

   - 範例1


```np.corrcoef```



In [0]:
for f_ in categorical_features:
    # 先算出目標的平均值，也叫「全域均值」
    global_mean = train_data['y'].mean()
    
    # 接者，計算類別的的group mean
    item_id_target_mean = train_data.groupby(f_).y.mean()

    # 新增一個欄位，賦予類別特徵的group mean 使用 map
    train_data['item_target_enc'] = train_data[f_].map(item_id_target_mean)

    # 用「全域均值」填充 缺失值，這只是維持好習慣，在本例子沒有NaN，因為是使用「全」train set 去做 encoding
    train_data['item_target_enc'].fillna(global_mean, inplace=True) 

    # 檢視，關聯程度
    encoded_feature = train_data['item_target_enc'].values
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

Corr between category_name and target is: 0.4465404391861943
Corr between brand_name and target is: 0.48512436987737967


   - 範例2

In [0]:
for f_ in categorical_features:   
    
    # 先算出目標的平均值，也叫「全域均值」
    global_mean = train_data['y'].mean()
    
    # 接者，計算類別的的group mean
    item_id_target_mean = train_data.groupby(f_).y.mean()
    
    # 新增一個欄位，賦予類別特徵的group mean 使用 transform
    train_data['item_target_enc'] = train_data.groupby(f_)['y'].transform('mean')

    # 用「全域均值」填充 缺失值，理由同範例一
    train_data['item_target_enc'].fillna(global_mean, inplace=True) 

     # 檢視，關聯程度
    encoded_feature = train_data['item_target_enc'].values
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

Corr between category_name and target is: 0.4465404391861943
Corr between brand_name and target is: 0.48512436987737967


# 練習時間：
## Mean encodings with Regularization
###   1. 引入regularization 避免 Overfitting
    - 此Regularization並不是L1, L2 Penalty
    
###   2. 參考指標，檢視跟Target的相關性。 
   - 謹記，您的作業的相關性，不應該高過全域的相關性，就是範例1, 2的相關性
   - 謹記，低於全域的相關性，不等於一定不會Overfitting
   
### 3. 請基於 範例1 or 範例2 完成以下
1. KFold scheme
2. Smoothing
3. Smoothing and noising

### 4. 練習題採取雙刀流，即簡單的「兩行內」就可以搞定。
    - 當然，可以不必理會只限兩行完成，請隨心所欲去發揮！

## 1. KFold scheme
#### 本例，在測試是否了解kold scheme，**因為之後很多練習都是基於Kfold**，去做處理，請學員務必務必弄懂。
#### 您也可以使用Hold Out scheme。
- Hint: 
- 作法：假設切成N個fold，每次fold loop 取(N-1)份fold 資訊，去套用在剩下的那一份，vice versa.
- 只有兩行
    - 第一行：指派切割位置
    - 第二行：使用條件取代，套用在範例1 or 2 方法

您可能會用到，**pandas conditional replace** (google it)


```
# 這是本題答案數據
Corr between category_name and target is: 0.4448439052519606
Corr between brand_name and target is: 0.47786979474324226

```
#### 此題數據如下，請學員盡可能自己寫，去比對上面數據，想不出來再看Answer



In [0]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5, shuffle = False) 

global_mean = train_data['y'].mean()

for f_ in categorical_features:    
    
    train_data['item_target_enc'] = np.nan # 先賦予nan
    
    for tr_ind, val_ind in kf.split(train_data): 
        # tr_ind, val_ind  是切fold後對應到的位置，前是N-1 fold，後者是剩餘fold
        
        # YOUR CODE GOES HERE:
        # 第一行:
        
        # 第二行:
     
        
    train_data['item_target_enc'].fillna(global_mean, inplace = True)
    encoded_feature = train_data['item_target_enc'].values
    # You will need to compute correlation like that
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

## 2. Smoothing
#### Hint:
- 第一行：請參考Slide數學公式的**分子**
    - 您可能會用到 `np.multiply` 
- 第二行：請參考Slide數學公式的**分母**


```
# 這是本題答案數據
Corr between category_name and target is: 0.4439601079786612
Corr between brand_name and target is: 0.4569492306386542
```
#### 此題數據如下，請學員盡可能自己寫，去比對上面數據，想不出來再看Answer




In [0]:
# YOUR CODE GOES HERE
alpha = 100
global_mean = train_data['y'].mean()

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    
    # YOUR CODE GOES HERE
    train_data['item_target_enc_smg'] = # 第一行

    train_data['item_target_enc_smg'] = # 第二行

    encoded_feature = train_data['item_target_enc_smg'].values
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))


## 2-1. Smoothing paper [Daniele Micci-Barreca](https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf)


- Hint:
1. 練習題的 Equation 4 解釋: 
    - n 為 個數，
    - k 為 min_samples_leaf（設好了，可自行調整）
    - f 為 smoothing（設好了，可自行調整）
2. 練習題的 Equation 5 解釋: 
    - B 為 smoothing
    - y head 以及 y 為 ？想想
    
#### 此題答案基於
  1. smoothing= 5
  2. min_samples_leaf=100
  

```
# 這是本題答案數據
Corr between category_name and target is: 0.4423907460273596
Corr between brand_name and target is: 0.46616098005914425
```
#### 此題數據如下，請學員盡可能自己寫，去比對上面數據，想不出來再看Answer， and
###  **TA IS NOT ALWAYS RIGHT 如果有錯歡迎指正**



In [0]:

global_mean = train_data['y'].mean()
smoothing= 5 # 調看看其他數值
min_samples_leaf=100 # 調看看其他數值

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    
    # Please refer Paper equation 4
    # YOUR CODE GOES HERE 

    
    # Please refer Paper equation 5
    # YOUR CODE GOES HERE 


    encoded_feature = train_data['item_target_enc_smg'].values
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

## 3. Smoothing and Noising

[Owen Zhan ](https://www.linkedin.com/in/owen-zhang-363aa051)常使用的方法
- 現職	Hedge Fund, DataRobot
- 曾任	MeForo (USA), Inc, DataRobot, AIG

#### 加入雜訊，並不是一個很好的方法，雜訊多寡沒有準則
- 雜訊太大，雜訊會主宰一切
- 雜訊太小，那乾脆不要加


- Hint:
    - 此題做法承接第二題，只是把第二題加入雜訊
    - 此題重點在維度 :-)
    
使用函數
`np.random.randn`

#### 此題答案基於
1. noise_level = 0.05，
2. alpha = 100


```
# 這是本題答案數據
Corr between category_name and target is: 0.4423907460273596
Corr between brand_name and target is: 0.46616098005914425
```
#### 此題數據如下，請學員盡可能自己寫，去比對上面數據，想不出來再看Answer




In [0]:

alpha = 100 # 可以調整
global_mean = train_data['y'].mean()
noise_level = 0.05 # 可以調整這裡 (standard dev)

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    
    
    train_data['item_target_enc_smg'] = np.multiply(train_data['item_target_mean'] ,train_data['target_count'] ) + global_mean * alpha
    train_data['item_target_enc_smg'] = train_data['item_target_enc_smg'] / (train_data['target_count'] + alpha)
    
    # YOUR CODE GOES HERE
    encoded_feature = # Only 一行而已自己想～～～～
    
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))
