# 本筆記拿鐵達尼號資料集來練習建模。



我們將用Scikit-learn和Keras來建立一些基本的模型。

* [將檔案存為Pandas DataFrame](#%E5%B0%87%E6%AA%94%E6%A1%88%E5%AD%98%E7%82%BAPandas-DataFrame)

* [空直填補](#%E7%A9%BA%E7%9B%B4%E5%A1%AB%E8%A3%9C)

* [用scikit-learn建立模型：Decision Tree](#%E7%94%A8scikit-learn%E5%BB%BA%E7%AB%8B%E6%A8%A1%E5%9E%8B%EF%BC%9ADecision-Tree)

* [用scikit-learn建立模型：Random Forest](#%E7%94%A8scikit-learn%E5%BB%BA%E7%AB%8B%E6%A8%A1%E5%9E%8B%EF%BC%9ARandom-Forest)
* [scikit-learn練習：使用logistic regression來分資料，並輸出分類報告](#scikit-learn%E7%B7%B4%E7%BF%92%EF%BC%9A%E4%BD%BF%E7%94%A8logistic-regression%E4%BE%86%E5%88%86%E8%B3%87%E6%96%99%EF%BC%8C%E4%B8%A6%E8%BC%B8%E5%87%BA%E5%88%86%E9%A1%9E%E5%A0%B1%E5%91%8A%E3%80%82)
* [用Keras建立模型：Logistic Regression](#%E7%94%A8Keras%E5%BB%BA%E7%AB%8B%E6%A8%A1%E5%9E%8B%EF%BC%9ALogistic-Regression)
* [用Keras建立模型：Multilayer Perceptron + BatchNorm](#%E7%94%A8Keras%E5%BB%BA%E7%AB%8B%E6%A8%A1%E5%9E%8B%EF%BC%9AMultilayer-Perceptron)
* [用Keras建立模型：Multilayer Perceptron + Dropout](#%E7%94%A8Keras%E5%BB%BA%E7%AB%8B%E6%A8%A1%E5%9E%8B%EF%BC%9AMultilayer-Perceptron-+-Dropout)

---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import numpy as np
import re

In [None]:
# Here's a temporary workout. I replace data in several dataframes, which leads to some warning messages
import warnings
warnings.filterwarnings("ignore")

### 將檔案存為Pandas DataFrame

In [None]:
data=pd.read_csv("../datasets/titanic/titanic_train.csv") # 輸入資料

In [None]:
data.info()

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

---

### 空直填補

In [None]:
cmap=sns.light_palette("navy", reverse=False)
sns.heatmap(data.isnull().astype(np.int8),yticklabels=False,cmap=cmap)

In [None]:
g=sns.factorplot(x="Pclass",y="Age",data=data,kind="box")

In [None]:
g=sns.factorplot(x="Pclass",y="Age",hue="Sex",data=data,kind="box")

In [None]:
g=sns.factorplot(x="Sex",y="Age",data=data,kind="box")

In [None]:
g=sns.factorplot(x="Embarked",y="Age",data=data,kind="box")

決定用Pclass的中位數來填補Age之空值。

In [None]:
data.groupby("Pclass").median()["Age"]

In [None]:
def dataClean(df):
    # 資料整理。將空值填補。
    
    nanIndexes={}
    groups=df.groupby("Pclass")
    for name,group in groups:
        nanIndexes[name]=group["Age"][group["Age"].isnull()].index
    data["Age"][nanIndexes[1]]=37
    data["Age"][nanIndexes[2]]=29
    data["Age"][nanIndexes[3]]=24
    
    idxEmbarked=df[df["Embarked"].isnull()].index
    idxDrop=idxEmbarked
    df=df.drop(index=idxDrop)

    df=df.drop(
        columns=["Name","PassengerId","Cabin","Ticket"])

    df["famSize"]=df["SibSp"]+df["Parch"]
    df["Kid"]=df["Age"].apply(lambda x: 1 if x<12 else 0)
    df["Pclass"]=df["Pclass"].astype("object")
    df["Sex"]=df["Sex"].apply(lambda x:0 if "female" in x else 1)
    
    for col in ["Age","Fare"]:
        df[col]=( df[col]-df[col].mean() ) / df[col].std()
    
    df=pd.get_dummies(df)
    
    print("檢查資料整理前是否存在空值:\n",df.isnull().sum(),"\n") # 檢查空值情形。
    return df

def trainTestValSplit(df):
    # 將資料切分為訓練(70%),測試(15%)和驗證(15%)三份。
    train=df.sample(frac=0.7)
    test=df.drop( train.index )
    val=test.sample(frac=0.5)
    test=test.drop( val.index)
    return train,test,val

def dfXYSplit(df,targetName):
    # 將特徵和目標變數切成兩份資料。
    
    dfX=df.drop(columns=targetName)
    dfY=df[targetName]
    
    return dfX,dfY

In [None]:
data=pd.read_csv("../datasets/titanic/titanic_train.csv") # 輸入資料
data=dataClean(data)
print(data.info() )
data.head(5)

train,test,val=trainTestValSplit(data)

trainX,trainY=dfXYSplit(train,"Survived")
testX,testY=dfXYSplit(test,"Survived")
valX,valY=dfXYSplit(val,"Survived")

資料整理完畢，可以開始建立模型。

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

---

### 用scikit-learn建立模型：Decision Tree

* Decision Tree: http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation

In [None]:
from sklearn import tree
from sklearn.metrics import classification_report

In [None]:
clf=tree.DecisionTreeClassifier()
model=clf.fit(trainX,trainY)

testPredY=model.predict(testX)
valPredY=model.predict(valX)

print( classification_report(testY,testPredY) )
print( classification_report(valY,valPredY) )

In [None]:
tree.DecisionTreeClassifier()

In [None]:
from sklearn.metrics import classification_report
from sklearn import tree
from sklearn.ensemble.forest import RandomForestClassifier

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

### 用scikit-learn建立模型：Random Forest

* Random Forest: https://www.youtube.com/watch?v=3kYujfDgmNk

* ```bootstrap=True``` ->要抽樣本
* ```n_jobs=1``` ->記得要改 ```n_jobs=-1```, 使用所有的CPU核心
* ```oob_score``` -> out of bag score

In [None]:
clf = RandomForestClassifier()
clf

In [None]:
model=clf.fit(trainX,trainY)

testPredY=model.predict(testX)
valPredY=model.predict(valX)

print( classification_report(testY,testPredY) )
print( classification_report(valY,valPredY) )

上面是錯的.不要把資料切三分丟給random forest.

Q: 什麼是OOB score?

因為做隨機森林,驗證資料(validation data)不需要存在,故我們把驗證資料和訓練資料合併.

In [None]:
trainXNew = np.concatenate([trainX,valX],axis=0)
trainYNew = np.concatenate([trainY,valY],axis=0)

驗證資料有沒有合併成功:

In [None]:
assert valX.shape[0]+trainX.shape[0] == trainXNew.shape[0]
assert valY.shape[0]+trainY.shape[0] == trainYNew.shape[0]

In [None]:
clf = RandomForestClassifier(oob_score=True)
model=clf.fit(trainX,trainY)

In [None]:
model.oob_score_

### scikit-learn練習：使用logistic regression來分資料，並輸出分類報告。

In [None]:
# 補足以下程式碼
from sklearn.linear_model.... import LogisticRegression
clf=LogisticRegression()
...

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

---

### 用Keras建立模型：Logistic Regression

In [None]:
from keras.models import Sequential
from keras.layers import Activation,Dense,Dropout,BatchNormalization
from keras.optimizers import RMSprop,SGD

In [None]:
model=Sequential()

model.add(Dense(1, input_shape=(trainX.shape[1],),activation='sigmoid') ) 
# initiate RMSprop optimizer
#opt = RMSprop(lr=0.05, decay=1e-6)
opt = SGD(lr=0.05)
model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])
model.summary()

history=model.fit(trainX.values,trainY.values,epochs=20, validation_data=( valX.values, valY.values ) )

#畫出訓練過程
plt.plot(history.history['acc'],ms=5,marker='o',label='accuracy')
plt.plot(history.history['val_acc'],ms=5,marker='o',label='val accuracy')
plt.legend()
plt.show()

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

### 用Keras建立模型：Multilayer Perceptron + BatchNorm

* Reference: 

  Rmsprop: http://ruder.io/optimizing-gradient-descent/

  簡單來說，是Gradient Descent的改良。相較於Gradient Descent是針對每一個網路參數($w_1,w_2,...,w_n$)使用相同的learning rate ($lr$)，RMSPROP則是能夠使得每一個每一個參數($w_1,w_2,...,w_n$)去擁有不同的learning rate ($lr_1, lr_2,...,lr_n$)。

* Reference: 

  BatchNormalization: https://zh-tw.coursera.org/learn/deep-neural-network/lecture/81oTm/why-does-batch-norm-work

  簡單來說，網路較內層的參數若於學習過程中一直改變，會導致較外層的輸出分佈一直產生變化。這件事情將不利於網路的訓練。我們若將這個分佈重新shift, 就可以加速網路訓練。(若看不懂這幾句話，請見以上連結，裡面有比較詳細的說明。)

In [None]:
batch_size=32

model=Sequential()

model.add(Dense(50, input_shape=(trainX.shape[1],),activation='relu') )
model.add(BatchNormalization())
model.add(Activation("relu"))
# model.add(Dropout(0.5))

model.add(Dense(50) )
model.add(BatchNormalization())
model.add(Activation("relu"))
# model.add(Dropout(0.5))

model.add(Dense(1, input_shape=(trainX.shape[1],),activation='sigmoid') ) 
# initiate RMSprop optimizer
opt = RMSprop(lr=0.01, decay=1e-5)
model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])
model.summary()

history=model.fit(trainX.values,trainY.values,epochs=30,
                  validation_data=( valX.values, valY.values ), batch_size=batch_size,verbose=1)

#畫出訓練過程
plt.plot(history.history['acc'],ms=5,marker='o',label='accuracy')
plt.plot(history.history['val_acc'],ms=5,marker='o',label='val accuracy')
plt.legend()
plt.show()

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

### 用Keras建立模型：Multilayer Perceptron + Dropout 

添加Dropout layer可減少Overfitting。

* Dropout layer: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/dropout_layer.html

In [None]:
batch_size=32

model=Sequential()

model.add(Dense(50, input_shape=(trainX.shape[1],),activation='relu') )
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(0.6))

model.add(Dense(50) )
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(0.6))

model.add(Dense(1, input_shape=(trainX.shape[1],),activation='sigmoid') ) 
# initiate RMSprop optimizer
opt = RMSprop(lr=0.01, decay=1e-5)
model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])
model.summary()

history=model.fit(trainX.values,trainY.values,epochs=30,
                  validation_data=( valX.values, valY.values ), batch_size=batch_size,verbose=1)

#畫出訓練過程
plt.plot(history.history['acc'],ms=5,marker='o',label='accuracy')
plt.plot(history.history['val_acc'],ms=5,marker='o',label='val accuracy')
plt.legend()
plt.show()

[回到頂部](#%E6%9C%AC%E7%AD%86%E8%A8%98%E6%8B%BF%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E8%B3%87%E6%96%99%E9%9B%86%E4%BE%86%E7%B7%B4%E7%BF%92%E5%BB%BA%E6%A8%A1%E3%80%82)

---

In [None]:
## 輸出成csv, 提交至Kaggle:
# subm=np.vstack( (dfTest["PassengerId"].values,testY) ).T
# subm=pd.DataFrame(subm,columns=["PassengerId","Survived"])
# subm.to_csv("~/Dropbox/Learning/submissionLogistic.csv",index=None)