## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 建立模型四步驟

在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟

1. 讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)
    - 讀取資料的方法：
        - **使用 pandas 讀取 .csv 檔：**pd.read_csv
        - **使用 numpy 讀取 .txt 檔：**np.loadtxt 
        - **使用 Scikit-learn 內建的資料集：**sklearn.datasets.load_xxx
    - **檢查資料數量：**data.shape (data should be np.array or dataframe)

2. 將資料切為訓練 (train) / 測試 (test)
    - train_test_split(data)

3. 建立模型，將資料 fit 進模型開始訓練
    - clf = DecisionTreeClassifier()
    - clf.fit(x_train, y_train)

4. 將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估
    - clf.predict(x_test)
    - accuracy_score(y_test, y_pred)
    - f1_score(y_test, y_pred)

## 匯入所需模型

In [57]:
from sklearn import datasets, metrics
import warnings
warnings.filterwarnings('ignore')

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier  # 分類問題
from sklearn.tree import DecisionTreeRegressor   # 回歸問題
from sklearn.model_selection import train_test_split

## 作業

## 1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？


In [58]:
# 讀取糖尿病資料集
diabetes = datasets.load_diabetes()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [59]:
print(y_pred[:20])

[ 84. 145. 206.  74. 214. 219.  97. 252.  59.  69.  97. 126. 142. 131.
 221. 131. 259.  53. 191. 296.]


In [60]:
acc = metrics.accuracy_score(y_test, y_pred)
print('Acuuracy : ', acc)

Acuuracy :  0.011235955056179775


In [61]:
print(diabetes.feature_names)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


In [62]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.11529632 0.05067004 0.11212819 0.10170237 0.11158884 0.10176858
 0.10307558 0.07181085 0.11850685 0.11345239]


## 2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [63]:
dir(datasets)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_svmlight_format',
 'base',
 'california_housing',
 'clear_data_home',
 'covtype',
 'dump_svmlight_file',
 'fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_mldata',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions',
 'get_data_home',
 'kddcup99',
 'lfw',
 'load_boston',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_mlcomp',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2'

## wine

In [64]:
# 讀取wine資料集
wine = datasets.load_wine()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [65]:
y_pred

array([2, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 1, 2, 1, 2,
       1, 2, 0, 2, 1, 1, 0, 2, 0, 1, 0, 1, 2, 1])

In [66]:
acc = metrics.accuracy_score(y_test, y_pred)
print('Acuuracy : ', acc)

Acuuracy :  0.8888888888888888


In [67]:
print(wine.feature_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


In [68]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.01277488 0.         0.         0.         0.         0.
 0.07753082 0.         0.         0.38969228 0.04244155 0.0400926
 0.43746787]


# boston

In [74]:
# 讀取boston資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=4)

# 建立模型
#clf = DecisionTreeClassifier() # 原
clf = DecisionTreeRegressor()   # 回歸問題

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [75]:
y_pred

array([15. , 22. , 16.2, 14.4, 44.8, 23.8, 37.3, 18.4, 17.2, 15.6, 23.9,
       16.5, 20. , 21.6, 19.6, 13.8, 19.4,  8.8,  7. , 15.6,  7. , 15.6,
       20.5, 19.2, 18.6, 21.4, 14.4, 13.1, 21.4, 19.5,  8.8, 20.5, 36.5,
       22.6, 13.8, 13.9, 33.2, 50. , 24.5, 23.1, 44. , 37. , 14.2, 30.1,
       25.1, 20.9, 50. , 21.7, 21.4, 23. , 30.8, 23.8,  8.5, 27.1, 15.2,
       20.3, 22. , 33.1, 19.9, 33.1, 18.1, 21.4, 30.5, 20. , 43.1, 30.1,
       22. , 10.5, 17.5, 23. , 21.1, 19.4, 22. , 30.1, 13.3, 33.4, 14.1,
       21. , 17.7, 23. , 21.1, 15.2, 27.9, 24. , 28.1, 20.6, 32.2, 16.8,
       22.5, 50. , 29. , 50. , 17.5, 44.8, 20.4, 24.5, 19.7, 27.9, 14.6,
       19. , 10.5, 22.6])

In [79]:
acc = metrics.accuracy_score(y_test, y_pred)
print('Acuuracy : ', acc)

ValueError: continuous is not supported

In [77]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [78]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [4.51157288e-02 9.92566375e-04 7.61135592e-03 2.45236017e-04
 3.16255667e-02 5.98815796e-01 1.14105069e-02 5.24263382e-02
 3.39431799e-04 1.58948602e-02 1.62940699e-02 7.46236112e-03
 2.11766182e-01]
