# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察均值編碼的效果

# [作業重點]
- 仿造範例, 完成標籤編碼與均值編碼搭配邏輯斯迴歸的預測
- 觀察標籤編碼與均值編碼在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 (In[3], Out[3], In[4], Out[4]) 

# 作業1
* 請仿照範例，將鐵達尼範例中的類別型特徵改用均值編碼實作一次

In [1]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

data_path = 'data/part2/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
#只取類別值 (object) 型欄位, 存於 object_features 中
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'object':
        object_features.append(feature)
print(f'{len(object_features)} Numeric Features : {object_features}\n')

# 只留類別型欄位
df = df[object_features]
df = df.fillna('None')
train_num = train_Y.shape[0]
df.head()

5 Numeric Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']



Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [9]:
#將鐵達尼範例中的類別型特徵改用均值編碼實作一次
mean_en_data = pd.concat([df[:train_num], train_Y], axis=1)
for c in df.columns:
    mean_df = mean_en_data.groupby([c])['Survived'].mean().reset_index()
    mean_df.columns = [c, f'{c}_mean']
    mean_en_data = pd.merge(mean_en_data, mean_df, on=c, how='left')
    mean_en_data = mean_en_data.drop([c] , axis=1)

mean_en_data.head()

Unnamed: 0,Survived,Name_mean,Sex_mean,Ticket_mean,Cabin_mean,Embarked_mean
0,0,0,0.188908,0.0,0.299854,0.336957
1,1,1,0.742038,1.0,1.0,0.553571
2,1,1,0.742038,1.0,0.299854,0.336957
3,1,1,0.742038,0.5,0.5,0.336957
4,0,0,0.188908,0.0,0.299854,0.336957


# 作業2
* 觀察鐵達尼生存預測中，均值編碼與標籤編碼兩者比較，哪一個效果比較好? 可能的原因是什麼?

In [3]:
# 對照組 : 標籤編碼 + 邏輯斯迴歸
df_temp = pd.DataFrame()
for c in df.columns:
    df_temp[c] = LabelEncoder().fit_transform(df[c])
train_X = df_temp[:train_num]
estimator = LogisticRegression()
start = time.time()
print(f'shape : {train_X.shape}')
print(f'score : {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')
print(f'time : {time.time() - start} sec')

shape : (891, 5)
score : 0.780004837244799
time : 0.03490638732910156 sec




In [5]:
# 均值編碼 + 邏輯斯迴歸
data = pd.concat([df[:train_num], train_Y], axis=1)
for c in df.columns:
    mean_df = data.groupby([c])['Survived'].mean().reset_index()
    mean_df.columns = [c, f'{c}_mean']
    data = pd.merge(data, mean_df, on=c, how='left')
    data = data.drop([c] , axis=1)
data = data.drop(['Survived'] , axis=1)
estimator = LogisticRegression()
start = time.time()
print(f'shape : {train_X.shape}')
print(f'score : {cross_val_score(estimator, data, train_Y, cv=5).mean()}')
print(f'time : {time.time() - start} sec')

shape : (891, 5)
score : 1.0
time : 0.017952442169189453 sec




In [10]:
data.head()

Unnamed: 0,Name_mean,Sex_mean,Ticket_mean,Cabin_mean,Embarked_mean
0,0,0.188908,0.0,0.299854,0.336957
1,1,0.742038,1.0,1.0,0.553571
2,1,0.742038,1.0,0.299854,0.336957
3,1,0.742038,0.5,0.5,0.336957
4,0,0.188908,0.0,0.299854,0.336957


In [11]:
print(data['Sex_mean'].unique())
print(data['Cabin_mean'].unique())
print(data['Embarked_mean'].unique())

[ 0.18890815  0.74203822]
[ 0.29985444  1.          0.5         0.          0.66666667  0.33333333]
[ 0.33695652  0.55357143  0.38961039  1.        ]


In [12]:
print(df['Sex'].unique())
print(df['Cabin'].unique())
print(df['Embarked'].unique())

['male' 'female']
['None' 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4'
 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67'
 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65'
 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40'
 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104'
 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25'
 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10'
 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126'
 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45'
 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84'
 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148'
 'B45' 'B36' 'A21' 'D34'

In [13]:
print(len(df['Cabin'].unique()))

187


標籤編碼+LogisticRegression
 score : 0.780004837244799
 time : 0.03490638732910156 sec
 
以 Survive 作分類基準的均值編碼+LogisticRegression
score : 1.0
time : 0.017952442169189453 sec

後者顯然好得多. 原本標籤並沒有意義, 只是數字, 或許有意義, 但 machine 不得而知, 比如 Cabin 本來有 187 種, 艙位直接編成數字也沒有意義.

但以 Survive 作分類基準的將 Sex, Cabin, Embarked 重新以 Survive 均值編碼, 對生存率直接賦予關係, 分數就好很多。