## Summary of D24 - Feature Hash and Count Encoder

#### [feature hash](https://blog.csdn.net/laolu1573/article/details/79410187)

#### [count vectorizer](https://www.jianshu.com/p/063840752151)


# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# 作業1
* 參考範例，將鐵達尼的艙位代碼( 'Cabin' )欄位使用特徵雜湊 / 標籤編碼 / 目標均值編碼三種轉換後，  
與其他數值型欄位一起預估生存機率

In [62]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

data_path = '../data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [63]:
#只取類別值 (object) 型欄位, 存於 object_features 中
object_features = []
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'object':
        object_features.append(feature)
    elif dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(object_features)} Object Features : {object_features}\n')
print(f'{len(num_features)} Numeric Features : {num_features}\n')

# 只留類別型欄位
df_obj = df[object_features]
df_obj = df_obj.fillna('None')
train_num = train_Y.shape[0]
df_obj.head()

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

5 Numeric Features : ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']



Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [64]:
## Count Encoder ##
count_df = df_obj.groupby(['Cabin'])['Name'].agg({'Cabin_Count':'size'}).reset_index()
df = pd.merge(df, count_df, on=['Cabin'], how='left')
df = df.fillna(0)
count_df.sort_values(by=['Cabin_Count'], ascending=False).head(10)

## Numeric Value + Count Encoder + Label Encoder + Feature Hash
df_temp = df[num_features]
df_temp = df_temp.fillna(-1)
for c in object_features:
    df_temp[c] = LabelEncoder().fit_transform(df_obj[c])
df_temp['Cabin_Hash'] = df['Cabin'].map(lambda x:hash(x) % 10)
df_temp['Cabin_Count'] = df['Cabin_Count']
train_X = df_temp[:train_num]
estimator = LogisticRegression()
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')
print(f'features: {df_temp.head()}')
estimator.fit(train_X, train_Y)
test_X = df_temp[train_num:]
pred = estimator.predict(test_X)
sub = pd.DataFrame({'PassengerId': ids, 'Survived': pred})
print(f'predict: {sub.head()}')

scores: 0.8058669325023485
features:    Pclass   Age  SibSp  Parch     Fare  Name  Sex  Ticket  Cabin  Embarked  \
0       3  22.0      1      0   7.2500   155    1     720    185         3   
1       1  38.0      1      0  71.2833   286    0     816    106         0   
2       3  26.0      0      0   7.9250   523    0     914    185         3   
3       1  35.0      1      0  53.1000   422    0      65     70         3   
4       3  35.0      0      0   8.0500    22    1     649    185         3   

   Cabin_Hash  Cabin_Count  
0           0          0.0  
1           8          2.0  
2           0          0.0  
3           0          2.0  
4           0          0.0  
predict:    PassengerId  Survived
0          892         0
1          893         0
2          894         0
3          895         0
4          896         1


is deprecated and will be removed in a future version
  


# 作業2
* 承上題，三者比較效果何者最好?

In [65]:
# 對照組 : 標籤編碼 + 邏輯斯迴歸
estimator = LogisticRegression()

df_temp = pd.DataFrame()
for c in df_obj.columns:
    df_temp[c] = LabelEncoder().fit_transform(df_obj[c])
train_X = df_temp[:train_num]
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

scores: 0.780004837244799


In [68]:
# 加上 'Cabin' 欄位的計數編碼
df_temp['Cabin_Count'] = df['Cabin_Count']
train_X = df_temp[:train_num]
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

scores: 0.785622885700232


In [70]:
# 'Cabin'計數編碼 + 邏輯斯迴歸
df_temp = pd.DataFrame()
df_temp['Cabin_Count'] = df['Cabin_Count']
train_X = df_temp[:train_num]
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

scores: 0.6891683662631256


In [71]:
# 'Cabin'特徵雜湊 + 邏輯斯迴歸
df_temp = pd.DataFrame()
df_temp['Cabin_Hash'] = df['Cabin'].map(lambda x:hash(x) % 10)
train_X = df_temp[:train_num]
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

scores: 0.6600300661007374


In [72]:
# 'Cabin'計數編碼 + 'Cabin'特徵雜湊 + 邏輯斯迴歸
df_temp = pd.DataFrame()
df_temp['Cabin_Count'] = df['Cabin_Count']
df_temp['Cabin_Hash'] = df['Cabin'].map(lambda x:hash(x) % 10)
train_X = df_temp[:train_num]
print(f'scores: {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

scores: 0.6858227589530698


## Ans:
Label encoding has the best score