![](https://cdn-images-1.medium.com/max/1600/1*jX6Gwn1rt4da7e-yUj84IQ.png)

### 這只是對分類特徵做 Likelihood Encoding 

### 也稱為Impact Encoding 或 Mean Encoding 或 Target Encoding。



In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold

In [2]:
PATH = 'data/'
train_data = pd.read_table(PATH + 'train.tsv', engine='c')
train_data.rename(index=str, columns={'price':'y'},inplace=True)

In [3]:
test_data = pd.read_table(PATH + 'test.tsv',  engine='c')

# 抓出Category dtypes

In [6]:
categorical_features = []

for dtype, feature in zip(train_data.dtypes, train_data.columns):
    if dtype == object:
        categorical_features.append(feature)

categorical_features

['name', 'category_name', 'brand_name', 'item_description']

In [7]:
for f_ in categorical_features:   
    print('{} has {} unique items'.format(f_, train_data[f_].nunique()))

name has 1225273 unique items
category_name has 1287 unique items
brand_name has 4809 unique items
item_description has 1281426 unique items


## name, item_description 是文字特徵，所以不能算在內
- 請記住 **band_name** 缺失值多達42%，本不應該拿去做target encoding
- 為了教學，我們先移除這些NaN

In [8]:
train_data = train_data[train_data['brand_name'].notnull()]
train_data = train_data[train_data['category_name'].notnull()]


categorical_features.remove('name')
categorical_features.remove('item_description')
categorical_features

['category_name', 'brand_name']

# Mean encodings without regularization
   - 範例1

In [7]:
for f_ in categorical_features:    
    global_mean = train_data['y'].mean()
    # Calculate a mapping: {item_id: target_mean}
    item_id_target_mean = train_data.groupby(f_).y.mean()

    # In our non-regularized case we just *map* the computed means to the `item_id`'s
    train_data['item_target_enc'] = train_data[f_].map(item_id_target_mean)

    # Fill NaNs
    train_data['item_target_enc'].fillna(global_mean, inplace=True) 

    # Print correlation
    encoded_feature = train_data['item_target_enc'].values
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

Corr between category_name and target is: 0.4465404391861943
Corr between brand_name and target is: 0.48512436987737967


   - 範例2

In [8]:
for f_ in categorical_features:    
    global_mean = train_data['y'].mean()
    # Calculate a mapping: {item_id: target_mean}
    item_id_target_mean = train_data.groupby(f_).y.mean()
    train_data['item_target_enc'] = train_data.groupby(f_)['y'].transform('mean')

    # Fill NaNs
    train_data['item_target_enc'].fillna(global_mean, inplace=True) 

    # Print correlation
    encoded_feature = train_data['item_target_enc'].values
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

Corr between category_name and target is: 0.4465404391861943
Corr between brand_name and target is: 0.48512436987737967


# 練習時間：
## Mean encodings with Regularization
###   1. 引入regularization 避免 Overfitting
    - 此Regularization並不是L1, L2 Penalty
    
###   2. 參考指標，檢視跟Target的相關性。 
   - 謹記，您的作業的相關性，不應該高過全域的相關性，就是範例1, 2的相關性
   - 謹記，低於全域的相關性，不等於一定不會Overfitting
   
### 3. 請基於 範例1 or 範例2 完成以下
1. KFold scheme
2. Smoothing
3. Smoothing and noising

### 4. 練習題採取雙刀流，即簡單的「兩行內」就可以搞定。

## 1. KFold scheme

- Hint: 本例，在測試是否了解kold機制，因為之後很多練習都是基於Kfold，去做處理，請學員務必務必弄懂。
- 作法：假設切成N個fold，每次fold loop 取(N-1)份fold 資訊，去套用在剩下的那一份，vice versa.
- 只有兩行
    - 第一行：指派切割位置
    - 第二行：使用條件取代，套用在範例1 or 2 方法

您可能會用到，**pandas conditional replace** (google it)

In [9]:
# YOUR CODE GOES HERE
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5, shuffle = False) 
global_mean = train_data['y'].mean()

for f_ in categorical_features:    
    
    train_data['item_target_enc'] = np.nan
    for tr_ind, val_ind in kf.split(train_data):
        X_tr, X_val = train_data.iloc[tr_ind], train_data.iloc[val_ind]
        train_data.loc[train_data.index[val_ind], 'item_target_enc'] = X_val[f_ ].map(X_tr.groupby(f_ ).y.mean())

    train_data['item_target_enc'].fillna(global_mean, inplace = True)
    encoded_feature = train_data['item_target_enc'].values
    # You will need to compute correlation like that
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))



Corr between category_name and target is: 0.4448439052519598
Corr between brand_name and target is: 0.4778697947432403


## 2. Smoothing
#### Hint:
- 第一行：請參考Slide數學公式的分子
    - 您可能會用到 `np.multiply` 
- 第二行：請參考Slide數學公式的分母

In [10]:
# YOUR CODE GOES HERE
alpha = 10
global_mean = train_data['y'].mean()

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    train_data['item_target_enc_smg'] = np.multiply(train_data['item_target_mean'] ,train_data['target_count'] ) + global_mean * alpha
    train_data['item_target_enc_smg'] = train_data['item_target_enc_smg'] / (train_data['target_count'] + alpha)

    encoded_feature = train_data['item_target_enc_smg'].values
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))


Corr between category_name and target is: 0.4462495699282723
Corr between brand_name and target is: 0.4819742303440905


## 2-1. Smoothing paper [Daniele Micci-Barreca](https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf)
- Hint:
1. 練習題的 Equation 4 解釋: 
    - n 為 個數，
    - k 為 min_samples_leaf（設好了，可自行調整）
    - f 為 smoothing（設好了，可自行調整）
2. 練習題的 Equation 5 解釋: 
    - B 為 smoothing
    - y head 以及 y 為 ？您應該要想想

In [11]:

global_mean = train_data['y'].mean()
smoothing= 5
min_samples_leaf=100

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    # YOUR CODE GOES HERE 
    
    # Please refer Paper equation 4
    smoothing = 1 / (1 + np.exp(-(train_data['target_count'] - min_samples_leaf) / smoothing))
    
    # Please refer Paper equation 5
    train_data['item_target_enc_smg'] = global_mean * (1 - smoothing) + train_data['item_target_mean'] * smoothing
    
    encoded_feature = train_data['item_target_enc_smg'].values
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))

Corr between category_name and target is: 0.4439601079786596
Corr between brand_name and target is: 0.4569492306386533


  del sys.path[0]


## 3. Smoothing and Noising

In [12]:
# YOUR CODE GOES HERE
Factor = 100
global_mean = train_data['y'].mean()
noise_level = 0.05 # 可以調整這裡 (standard dev)

for f_ in categorical_features:    

    train_data['item_target_mean'] = train_data.groupby(f_)['y'].transform('mean')
    train_data['target_count'] = train_data.groupby(f_)['y'].transform('count')
    train_data['item_target_enc_smg'] = np.multiply(train_data['item_target_mean'] ,train_data['target_count'] ) + global_mean * Factor
    train_data['item_target_enc_smg'] = train_data['item_target_enc_smg'] / (train_data['target_count'] + Factor)

    encoded_feature = train_data['item_target_enc_smg'].values* (1 + noise_level * np.random.randn(len(train_data)))
    
    corr = np.corrcoef(train_data['y'].values, encoded_feature)[0][1]
    print('Corr between {} and target is: {}'.format(f_ ,np.corrcoef(train_data['y'].values, encoded_feature)[0][1]))


Corr between category_name and target is: 0.44204948101833325
Corr between brand_name and target is: 0.4665217775920104
