<p align="center" ><img src="https://www.ai4kids.ai/wp-content/uploads/2019/07/ai4kids_website_logo_120x40.png"></img></p>

# 學AI真簡單 <1> 初探機器學習
## 第三章 3-4 傑克與蘿絲誰的生存機率高—決策樹分類器介紹與應用



<p align="right">© Copyright AI4kids.ai</p>

## 首先運行下面的程序，下載train.csv

In [3]:
!wget --no-check-certificate "https://drive.google.com/uc?export=download&id=13bGRvk1Vq9tFRzMWsXZwOg7TzZLQj83O" -O 'train.csv'

--2020-06-30 13:43:19--  https://drive.google.com/uc?export=download&id=13bGRvk1Vq9tFRzMWsXZwOg7TzZLQj83O
Resolving drive.google.com (drive.google.com)... 74.125.31.113, 74.125.31.101, 74.125.31.102, ...
Connecting to drive.google.com (drive.google.com)|74.125.31.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/aq12s5t0bbjjucure8avua90piui08g5/1593524550000/13008311106298629004/*/13bGRvk1Vq9tFRzMWsXZwOg7TzZLQj83O?e=download [following]
--2020-06-30 13:43:19--  https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/aq12s5t0bbjjucure8avua90piui08g5/1593524550000/13008311106298629004/*/13bGRvk1Vq9tFRzMWsXZwOg7TzZLQj83O?e=download
Resolving doc-14-80-docs.googleusercontent.com (doc-14-80-docs.googleusercontent.com)... 173.194.217.132, 2607:f8b0:400c:c13::84
Connecting to doc-14-80-docs.googleusercontent.com (doc-14-80

### 資料清理

In [4]:
from sklearn import tree
import numpy as np
import pandas as pd

data = pd.read_csv('train.csv')
data = data.drop(columns=['Name', 'Ticket', 'Cabin'])
data


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.2500,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.9250,S
3,4,1,1,female,35.0,1,0,53.1000,S
4,5,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,13.0000,S
887,888,1,1,female,19.0,0,0,30.0000,S
888,889,0,3,female,,1,2,23.4500,S
889,890,1,1,male,26.0,0,0,30.0000,C


### 資料轉換

In [5]:
# 轉換Embarked欄位
typeEmbarked = list(set(data['Embarked']))

# 把Embarked欄位的文字按順序轉換成0~n的數字
for i in range(len(typeEmbarked)):
    print(typeEmbarked[i])
    row =  data['Embarked'] == typeEmbarked[i]
    data.loc[row, 'Embarked'] = i

# 轉換Sex欄位
typeSex = list(set(data['Sex']))

# 把Sex欄位的文字按順序轉換成0~n的數字
for i in range(len(typeSex)):
    print(typeSex[i])
    rows =  data['Sex'] == typeSex[i]
    data.loc[rows, 'Sex'] = i
    
data

C
nan
Q
S
male
female


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,22.0,1,0,7.2500,3
1,2,1,1,1,38.0,1,0,71.2833,0
2,3,1,3,1,26.0,0,0,7.9250,3
3,4,1,1,1,35.0,1,0,53.1000,3
4,5,0,3,0,35.0,0,0,8.0500,3
...,...,...,...,...,...,...,...,...,...
886,887,0,2,0,27.0,0,0,13.0000,3
887,888,1,1,1,19.0,0,0,30.0000,3
888,889,0,3,1,,1,2,23.4500,3
889,890,1,1,0,26.0,0,0,30.0000,0


### 缺失值處理
在前面的章節應該有提過缺失值的處理，缺失值的存在會讓模型無法訓練，而我們因為資料很少，所以用補值的方式。

In [6]:
# fillna(999)會讓所有缺失值都被補成999，之所以選999是因為不認為資料集的任何欄位中有999這個數字。
data = data.fillna(999)

### 分割資料並訓練模型
接下來我們把前750筆資料當作訓練資料`X_train`，750筆之後的則當作測試資料。並且把`Survived`作為`y_train`從訓練資料中獨立出來。

我們使用決策樹 DecisionTreeClassifier() 來進行訓練。

In [7]:
# 分割資料
X_train = data[:750]
X_test = data[750:]

# 把`Survived`作為`y_train`從訓練資料中獨立出來。
y_train = X_train.pop('Survived')

# 建立並訓練(fit)決策樹
clf = tree.DecisionTreeClassifier() 
clf = clf.fit(X_train, y_train)

### 評估模型正確率
最後我們想要知道模型的準確率到底有多好，這裡我們選擇了Accuracy和Recall兩個metric(指標)來評估
- Accuracy代表正確率，也就是模型每一次預測的準確度。
- Recall則是召回率，代表模型是否能把資料集中沒能獲救的人都找出來。

In [8]:
from sklearn.metrics import accuracy_score, recall_score

# 從測試資料集取出y值作為真實答案
y_test = X_test.pop('Survived')

# 進行預測，取得預測答案
y_pred = clf.predict(X_test) 

print('accuracy_score=',accuracy_score(y_test, y_pred))
print('recall_score=',recall_score(y_test, y_pred))


accuracy_score= 0.7730496453900709
recall_score= 0.6078431372549019
