# Julia 機器學習：DecisionTree 決策樹

## 作業 030：乳癌預測資料集

請使用隨機森林模型建立一個分類模型，來預測乳癌資料集中，為良性或是惡性的腫瘤。

In [1]:
using DecisionTree, RDatasets, DataFrames, MLDataUtils, Statistics

## 讀取資料

In [2]:
biopsy = dataset("MASS", "biopsy")
size(biopsy)

(699, 11)

In [3]:
names(biopsy)

11-element Array{Symbol,1}:
 :ID   
 :V1   
 :V2   
 :V3   
 :V4   
 :V5   
 :V6   
 :V7   
 :V8   
 :V9   
 :Class

In [4]:
first(biopsy, 10)

Unnamed: 0_level_0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,Class
Unnamed: 0_level_1,String,Int32,Int32,Int32,Int32,Int32,Int32⍰,Int32,Int32,Int32,Categorical…
1,1000025,5,1,1,1,2,1,3,1,1,benign
2,1002945,5,4,4,5,7,10,3,2,1,benign
3,1015425,3,1,1,1,2,2,3,1,1,benign
4,1016277,6,8,8,1,3,4,3,7,1,benign
5,1017023,4,1,1,3,2,1,3,1,1,benign
6,1017122,8,10,10,8,7,10,9,7,1,malignant
7,1018099,1,1,1,1,2,10,3,1,1,benign
8,1018561,2,1,2,1,2,1,3,1,1,benign
9,1033078,2,1,1,1,2,1,1,1,5,benign
10,1033078,4,2,1,1,2,1,2,1,1,benign


In [5]:
biopsy = dropmissing(biopsy)
size(biopsy)

(683, 11)

## X & y

In [6]:
features = Matrix(biopsy[!, 2:10])
labels = Vector{String}(biopsy[!, :Class])
println(features[2:10])
println(labels[1:5])

Int32[5, 3, 6, 4, 8, 1, 2, 2, 4]
["benign", "benign", "benign", "benign", "benign"]


## 決策樹模型

In [7]:
model = DecisionTree.RandomForestClassifier(n_trees=50, max_depth=2)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             nothing
ensemble:            nothing

## 訓練

In [8]:
DecisionTree.fit!(model, features, labels)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             ["benign", "malignant"]
ensemble:            Ensemble of Decision Trees
Trees:      50
Avg Leaves: 4.0
Avg Depth:  2.0

## 交叉驗證

In [9]:
using ScikitLearn.CrossValidation: cross_val_score

In [10]:
accuracy = cross_val_score(model, features, labels, cv=5)

5-element Array{Float64,1}:
 0.9635036496350365
 0.948905109489051 
 0.9562043795620438
 0.9854014598540146
 0.9777777777777777

## 預測

In [11]:
test_data = Int32[5, 3, 6, 4, 8, 1, 2, 2, 4]
DecisionTree.predict(model, test_data)

"malignant"