# 作業 : (Kaggle)鐵達尼生存預測
***
- 分數以網站評分結果為準, 請同學實際將提交檔(*.csv)上傳試試看  
https://www.kaggle.com/c/titanic/submit

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀查堆疊泛化 (Stacking) 的寫法與效果

# [作業重點]
- 完成堆疊泛化的寫作, 看看提交結果, 想想看 : 分類與回歸的堆疊泛化, 是不是也與混合泛化一樣有所不同呢?(In[14])  
如果可能不同, 應該怎麼改寫會有較好的結果?  
- Hint : 請參考 mlxtrend 官方網站 StackingClassifier 的頁面說明 : Using Probabilities as Meta-Features
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

# 參考資料

[如何在 Kaggle 首战中进入前 10%](https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/)

In [1]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy, time, warnings
warnings.filterwarnings('ignore')

from IPython.display import display
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

data_path = './data/part02/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'], axis=1)
df_test = df_test.drop(['PassengerId'], axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# 檢查 DataFrame 空缺值的狀態
def na_check(df_data):
    data_na = (df_data.isnull().sum() / len(df_data)) * 100
    data_na = data_na.drop(data_na[data_na == 0].index).sort_values(ascending=False)
    missing_data = pd.DataFrame({'Missing Ratio': data_na})
    display(missing_data.head(10))

na_check(df)

Unnamed: 0,Missing Ratio
Cabin,77.463713
Age,20.091673
Embarked,0.152788
Fare,0.076394


### 以下只是鐵達尼預測中的一組特徵工程，並以此組特徵工程跑參數，若更換其他特徵工程，參數需要重新跑。

In [3]:
# Sex: 直接轉男 0 女 1
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# Fare: 用 log 去偏態, 0 則直接取 0
df['Fare'] = df['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
# Age: 缺值用中位數補
df['Age'] = df['Age'].fillna(df['Age'].median())

In [4]:
# Ticket: 如果不只是數字, 取第一個空白之前的字串(去除'.'與'/'), 如果只是數字, 則設為'X', 最後再取 One Hot
Ticket = []
for i in list(df.Ticket):
    if not i.isdigit() :
        Ticket.append(i.replace('.', '').replace('/', '').strip().split(' ')[0])
    else:
        Ticket.append('X')        
df['Ticket'] = Ticket
df = pd.get_dummies(df, columns=['Ticket'], prefix='T')

In [5]:
# Cabib 依照第一碼分類, 再取 One Hot
df['Cabin'] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in df['Cabin']])
df = pd.get_dummies(df, columns=['Cabin'], prefix='Cabin')

In [6]:
# Embarked, Pclass 取 One Hot
df = pd.get_dummies(df, columns=['Embarked'], prefix='Em')
df['Pclass'] = df['Pclass'].astype('category')
df = pd.get_dummies(df, columns=['Pclass'], prefix='Pc')

In [7]:
# Title 的 特徵工程 : 將各種頭銜按照類型分類, 最後取 One Hot
df_title = [i.split(",")[1].split(".")[0].strip() for i in df["Name"]]
df["Title"] = pd.Series(df_title)
df["Title"] = df["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df["Title"] = df["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
df["Title"] = df["Title"].astype(int)
df = pd.get_dummies(df, columns = ["Title"])

In [8]:
# 新建 Title: 將各種頭銜按照類型分類, 最後取 One Hot
df_title = [i.split(',')[1].split('.')[0].strip() for i in df['Name']]
df['Title'] = pd.Series(df_title)
df['Title'] = df['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'] = df['Title'].map({'Master': 0, 'Miss': 1, 'Ms': 1, 'Mme': 1, 'Mlle': 1, 'Mrs': 1, 'Mr': 2, 'Rare': 3})
df['Title'] = df['Title'].astype(int)
df = pd.get_dummies(df, columns=['Title'])

In [9]:
# 新建 Fsize: 家庭大小的特徵, 並依照大小分別建獨立欄位
df['Fsize'] = df['SibSp'] + df['Parch'] + 1
df['Single'] = df['Fsize'].map(lambda s: 1 if s == 1 else 0)
df['SmallF'] = df['Fsize'].map(lambda s: 1 if  s == 2  else 0)
df['MedF'] = df['Fsize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
df['LargeF'] = df['Fsize'].map(lambda s: 1 if s >= 5 else 0)

In [10]:
# 捨棄 Name 欄位
df.drop(labels=['Name'], axis=1, inplace=True)

In [11]:
# 確認缺值與目前的資料表內容
na_check(df)
df.head()

Unnamed: 0,Missing Ratio


Unnamed: 0,Sex,Age,SibSp,Parch,Fare,T_A,T_A4,T_A5,T_AQ3,T_AQ4,...,Pc_3,Title_0,Title_1,Title_2,Title_3,Fsize,Single,SmallF,MedF,LargeF
0,0,22.0,1,0,1.981001,0,0,1,0,0,...,1,0,0,1,0,2,0,1,0,0
1,1,38.0,1,0,4.266662,0,0,0,0,0,...,0,0,1,0,0,2,0,1,0,0
2,1,26.0,0,0,2.070022,0,0,0,0,0,...,1,0,1,0,0,1,1,0,0,0
3,1,35.0,1,0,3.972177,0,0,0,0,0,...,0,0,1,0,0,2,0,1,0,0
4,0,35.0,0,0,2.085672,0,0,0,0,0,...,1,0,0,1,0,1,1,0,0,0


In [12]:
# 將資料最大最小化
df = MinMaxScaler().fit_transform(df)

# 將前述轉換完畢資料 df, 重新切成 train_X, test_X
train_num = train_Y.shape[0]
train_X = df[:train_num]
test_X = df[train_num:]

# 使用三種模型 : 邏輯斯迴歸 / 梯度提升機 / 隨機森林, 參數使用 Random Search 尋找
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
lr = LogisticRegression(fit_intercept=True, penalty='l2', C=1.0, tol=0.001)
gdbt = GradientBoostingClassifier(learning_rate=0.75, n_estimators=250, subsample=0.75,
                                  max_features=20, max_depth=6, tol=100)
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=6,
                            min_samples_split=2, min_samples_leaf=1, bootstrap=True)

In [13]:
# 線性迴歸預測檔 (結果有部分隨機, 請以 Kaggle 計算的得分為準, 以下模型同理)
lr.fit(train_X, train_Y)
lr_pred = lr.predict_proba(test_X)[:, 1]
sub = pd.DataFrame({'PassengerId': ids, 'Survived': lr_pred})
sub['Survived'] = sub['Survived'].map(lambda x: 1 if x > 0.5 else 0)
sub.to_csv('./output/Day_050_Stacking_HW_lr.csv', index=False) 

In [14]:
# 梯度提升機預測檔 
gdbt.fit(train_X, train_Y)
gdbt_pred = gdbt.predict_proba(test_X)[:, 1]
sub = pd.DataFrame({'PassengerId': ids, 'Survived': gdbt_pred})
sub['Survived'] = sub['Survived'].map(lambda x: 1 if x > 0.5 else 0)
sub.to_csv('./output/Day_050_Stacking_HW_gdbt.csv', index=False)

In [15]:
# 隨機森林預測檔
rf.fit(train_X, train_Y)
rf_pred = rf.predict_proba(test_X)[:, 1]
sub = pd.DataFrame({'PassengerId': ids, 'Survived': rf_pred})
sub['Survived'] = sub['Survived'].map(lambda x: 1 if x > 0.5 else 0)
sub.to_csv('./output/Day_050_Stacking_HW_rf.csv', index=False)

# 作業
* 分類預測的集成泛化, 也與回歸的很不一樣  
既然分類的 Blending 要變成機率, 才比較容易集成,
那麼分類的 Stacking 要讓第一層的模型輸出機率當特徵, 應該要怎麼寫呢?

In [16]:
from mlxtend.classifier import StackingClassifier

meta_estimator = GradientBoostingClassifier(learning_rate=0.3, n_estimators=50, subsample=0.70,
                                            max_features='sqrt', max_depth=4, tol=100)
"""
Your Code Here
"""
stacking = StackingClassifier(classifiers=[lr, gdbt, rf], use_probas=True, average_probas=False, meta_classifier=meta_estimator)

In [17]:
stacking.fit(train_X, train_Y)
stacking_pred = stacking.predict(test_X)
sub = pd.DataFrame({'PassengerId': ids, 'Survived': stacking_pred})
sub.to_csv('./output/Day_050_Stacking_HW.csv', index=False)