## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [2]:
def train_regr_from_data(regr, skdataset):
    data = skdataset.data
    target = skdataset.target
    
    x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=4)
    
    regr.fit(x_train, y_train)
    
    y_pred = regr.predict(x_test)
    
    print(f'train score: {regr.score(x_train, y_train)}')
    print(f'test score: {regr.score(x_train, y_train)}')
    
    return y_pred, y_test, regr

In [3]:
boston = datasets.load_boston()

In [4]:
dtr_1 = DecisionTreeRegressor(min_samples_leaf=1)

dtr_1_pred, dtr_1_test, dtr_1_train_rr = train_regr_from_data(dtr_1, boston)

train score: 1.0
test score: 1.0


In [5]:
dtr_2 = DecisionTreeRegressor(min_samples_leaf=2)

dtr_2_pred, dtr_2_test, dtr_2_train_rr = train_regr_from_data(dtr_2, boston)

train score: 0.9827660200715456
test score: 0.9827660200715456


In [6]:
dtr_4 = DecisionTreeRegressor(min_samples_leaf=4)

dtr_4_pred, dtr_4_test, dtr_4_train_rr = train_regr_from_data(dtr_4, boston)

train score: 0.9173888037360505
test score: 0.9173888037360505


In [7]:
print(f'mean squared error from DTR (min_samples_leaf = 1): {metrics.mean_squared_error(dtr_1_pred, dtr_1_test)}')
print(f'mean squared error from DTR (min_samples_leaf = 2): {metrics.mean_squared_error(dtr_2_pred, dtr_2_test)}')
print(f'mean squared error from DTR (min_samples_leaf = 4): {metrics.mean_squared_error(dtr_4_pred, dtr_4_test)}')

mean squared error from DTR (min_samples_leaf = 1): 25.939607843137253
mean squared error from DTR (min_samples_leaf = 2): 30.07754357298474
mean squared error from DTR (min_samples_leaf = 4): 24.02143445822773


In [8]:
wine = datasets.load_wine()

In [9]:
dtc_1 = DecisionTreeClassifier(min_samples_leaf=1)

dtc_1_pred, dtc_1_test, dtc_1_train_rr = train_regr_from_data(dtc_1, wine)

train score: 1.0
test score: 1.0


In [10]:
dtc_2 = DecisionTreeClassifier(min_samples_leaf=2)

dtc_2_pred, dtc_2_test, dtc_2_train_rr = train_regr_from_data(dtc_2, wine)

train score: 0.9859154929577465
test score: 0.9859154929577465


In [11]:
dtc_4 = DecisionTreeClassifier(min_samples_leaf=4)

dtc_4_pred, dtc_4_test, dtc_4_train_rr = train_regr_from_data(dtc_4, wine)

train score: 0.9507042253521126
test score: 0.9507042253521126


In [12]:
print(f'accuracy from DTC (min_samples_leaf = 1): {metrics.accuracy_score(dtc_1_pred, dtc_1_test)}')
print(f'accuracy from DTC (min_samples_leaf = 2): {metrics.accuracy_score(dtc_2_pred, dtc_2_test)}')
print(f'accuracy from DTc (min_samples_leaf = 4): {metrics.accuracy_score(dtc_4_pred, dtc_4_test)}')

accuracy from DTC (min_samples_leaf = 1): 0.9166666666666666
accuracy from DTC (min_samples_leaf = 2): 0.8611111111111112
accuracy from DTc (min_samples_leaf = 4): 0.8611111111111112


In [13]:
diabetes = datasets.load_diabetes()

In [14]:
dtr = DecisionTreeRegressor(min_samples_leaf=1)

dtr_pred, dtr_test, dtr_train_rr = train_regr_from_data(dtr, diabetes)

print(f'mean squared error: {metrics.mean_squared_error(dtr_pred, dtr_test)}')

train score: 1.0
test score: 1.0
mean squared error: 8529.033707865168
