## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split


In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [3]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9777777777777777


In [4]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [5]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.         0.02150464 0.89367339 0.08482197]


2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [6]:
# 讀取 wine 資料集
wine = datasets.load_wine()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=0)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

print(y_pred)

[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 2 0 1 0 1 1 1 1 0 1 1 2 0 0 1 0 0 0 2
 1 1 2 1 0 1 1 1 0 2 1 2 0 2 2 0 2]


In [7]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9259259259259259


In [8]:
print(wine.feature_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


In [9]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.         0.         0.02389053 0.         0.         0.
 0.4175378  0.         0.         0.40621672 0.02036125 0.
 0.1319937 ]


In [10]:
# 讀取 boston 資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=0)

# 建立模型
clf = DecisionTreeRegressor()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

print(y_pred)

[23.7 50.  20.3 11.7 22.2 20.4 21.4 20.4 22.3 19.6 10.5 17.9 12.6  8.8
 50.  37.  21.4 32.7 28.4 22.2 23.1 22.7 18.5 24.7 19.7 16.1 20.1 15.6
 37.6 18.4 12.5 13.5 17.6 22.2 24.6 19.4  8.7 50.  13.4 17.9 23.1 19.7
 22.  13.5 24.4 22.9 19.8 19.1 15.6 25.  19.  23.7 21.1 35.2 15.6 19.8
 18.5 17.5 50.  20.  23.1 23.1 35.4 32.4 13.5 32.4 16.7 16.  14.1 23.3
 15.3 24.5 22.5 35.4 26.5  8.8 37.6 23.3 22.  21.7 26.6 18.  50.  37.6
 41.7 25.  22.  14.2 21.6 18.9 19.8 11.8 20.5 31.6 22.6 20.5 11.9 22.
 14.3 18.5 24.2 19.3 27.5 18.5 31.6 21.7  8.3 19.5 22.6 24.5 23.6 17.9
 14.4 19.1 14.4 19.8 10.2 18.4  9.5 48.5 26.6  7.2 22.4 21.7 19.4 20.5
 36.  22.5 20.5 36.1 14.1  9.5 18.8 15.6 10.4 36.1 21.1 15.6 26.6  8.7
 10.2 21.8 23.6 24.6 23.4 19.4 36.4 37.9 12.8  7.2 32.4 23.9]


In [11]:
# ValueError: continuous is not supported

#acc = metrics.accuracy_score(y_test, y_pred)
#print("Acuuracy: ", acc)


# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % metrics.mean_squared_error(y_test, y_pred))

Mean squared error: 27.71


In [12]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [13]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [8.85329736e-02 2.06998644e-04 8.06071211e-03 1.75672242e-04
 6.61802737e-03 5.97457009e-01 1.40846103e-02 6.02130282e-02
 2.03163236e-03 1.00672949e-02 2.58905001e-02 1.13783398e-02
 1.75283201e-01]
