## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics, linear_model

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

In [3]:
# 建立模型
clf = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=2, min_samples_split=0.17, min_samples_leaf=0.10)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [4]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [5]:
for k, w in zip(iris.feature_names, clf.feature_importances_):
    if w != 0:
        print(k, w)

petal width (cm) 1.0


Acuuracy 沒有顯著的改變, 但是發現 importance 裡面其實只有使用一種特徵(petal width (cm))

測試其他資料集
1. breast_cancer

In [6]:
# 讀取資料集
breast_cancer = datasets.load_breast_cancer()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target, test_size=0.1, random_state=4)

In [7]:
dept = int(x_train.shape[1] ** 0.5)
# 建立模型
clf = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=dept, min_samples_split=0.17, min_samples_leaf=0.10)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [8]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.8596491228070176


In [9]:
for k, w in zip(breast_cancer.feature_names, clf.feature_importances_):
    if w != 0:
        print(k, w)

perimeter error 0.006749575242330312
symmetry error 0.0013052514507821682
worst radius 0.8963207545870827
worst concave points 0.0956244187198048


2. boston

In [10]:
# 讀取資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.1, random_state=4)

In [11]:
# 建立模型
clf = DecisionTreeRegressor()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [12]:
mean_squared_error(y_pred, y_test)

23.59372549019608

In [13]:
for k, w in zip(boston.feature_names, clf.feature_importances_):
    if w != 0:
        print(k, w)

CRIM 0.051546675525287036
ZN 0.0012225748490342083
INDUS 0.00616306063049175
CHAS 0.0014967837138257002
NOX 0.014347870324282629
RM 0.5780188137201769
AGE 0.008228138390866991
DIS 0.0856313933484794
RAD 0.0017776421720541594
TAX 0.010343617302936422
PTRATIO 0.013940668422616122
B 0.0178010723265184
LSTAT 0.20948168927343025


In [14]:
regr = linear_model.LinearRegression()

# 將訓練資料丟進去模型訓練
regr.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred = regr.predict(x_test)

In [15]:
mean_squared_error(y_pred, y_test)

17.038701324921963

看來是 linear regression 好很多

In [16]:
for k, w in zip(boston.feature_names, regr.coef_):
    if w != 0:
        print(k, w)

CRIM -0.12585665878406954
ZN 0.0484257396100201
INDUS 0.01840852809252633
CHAS 3.085095691516899
NOX -17.327701820564606
RM 3.6167471330861467
AGE 0.0021918185271774765
DIS -1.4936113225001264
RAD 0.3199792000272681
TAX -0.01272946486141267
PTRATIO -0.927469085924641
B 0.009509124683760478
LSTAT -0.5335924706228666
