## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets, metrics, linear_model
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()
print(iris.feature_names)
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Accuracy:  0.9736842105263158
Feature importance:  [0.01796599 0.         0.52229134 0.45974266]


In [3]:
# from gini to entropy
clf = DecisionTreeClassifier( criterion='entropy')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Feature importance:  [0.         0.0156062  0.62264163 0.36175217]


#### 小結:換成entory, accuracy 相同，但feature importance差異很大

In [4]:
# Max_depth: 樹能⽣長的最深限制
# Min_samples_split: ⾄至少要多少樣本以上才進⾏切分
# Min_samples_lear: 最終的葉⼦ (節點) 上至少要有多少樣本
clf = DecisionTreeClassifier( criterion='entropy',
                              max_depth= 5,
                              min_samples_split= 5,
                              min_samples_leaf = 1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Feature importance:  [0.         0.         0.64454953 0.35545047]


#### 小結: 改變min_samples_split, min_samples_leaf, 會改變feature importance, 但accuracy不改變

## Decision Tree Regressor: using Boston Housing Price Dataset
[Boston house prices dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston)
- [Data Set Characteristics](https://scikit-learn.org/stable/datasets/index.html#boston-dataset)
    - CRIM per capita crime rate by town : 犯罪率
    - ZN proportion of residential land zoned for lots over 25,000 sq.ft.:規畫住宅用地的比例超過25,000平方英尺
    - INDUS proportion of non-retail business acres per town: 每個城鎮非零售業務佔比的比例
    - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise): 是否被河道包围，如果是则为1，否则为0
    - NOX nitric oxides concentration (parts per 10 million):一氧化氮濃度
    - RM average number of rooms per dwelling: 每間住宅的平均房間數
    - AGE proportion of owner-occupied units built prior to 1940: 在1940年之前建造的自住單位比例
    - DIS weighted distances to five Boston employment centres: 加權距離到波士頓的五個就業中心
    - RAD index of accessibility to radial highways: 高速公路可達性指數
    - TAX full-value property-tax rate per $\$10,000$ :每$\$10,000$的不動產稅
    - PTRATIO pupil-teacher ratio by town: 城鎮的師生比例
    - B $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town :其中Bk是城鎮黑人的比例
    - LSTAT % lower status of the population:地位較低人士的百分比
    - MEDV Median value of owner-occupied homes in $1000’s:自住房屋的中位數(需要預測的值)

In [5]:
boston = datasets.load_boston()
x = boston.data
y = boston.target
print('Boston Housing Price Feature Array Shape:', x.shape)
print('Boston Housing Price Target Array Shape:', y.shape)
print(type(x))
cols = ['CRIM ', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO'
        , 'B', 'LSTAT', 'MEDV']
df = pd.DataFrame(np.concatenate((x, y.reshape(len(y), 1)), axis=1), columns=cols)
df.head()

Boston Housing Price Feature Array Shape: (506, 13)
Boston Housing Price Target Array Shape: (506,)
<class 'numpy.ndarray'>


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### Linear Regression Result

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=4)
regr = linear_model.LinearRegression()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
print(regr.coef_)
# 預測值與實際值的差距，使用 MSE
print("Linear Regression Mean squared error: %.2f"
      % metrics.mean_squared_error(y_test, y_pred))

[-1.25856659e-01  4.84257396e-02  1.84085281e-02  3.08509569e+00
 -1.73277018e+01  3.61674713e+00  2.19181853e-03 -1.49361132e+00
  3.19979200e-01 -1.27294649e-02 -9.27469086e-01  9.50912468e-03
 -5.33592471e-01]
Linear Regression Mean squared error: 17.04


### Decision Tree Regressor Result

[Parameters](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.decision_path)  
- criterion:mse、friedman_mse、mae

In [7]:
regr = DecisionTreeRegressor(
        criterion='mse'
        )
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)

# 預測值與實際值的差距，使用 MSE
print("Decision Tree Regressor Mean squared error: %.2f"
      % metrics.mean_squared_error(y_test, y_pred))

Decision Tree Regressor Mean squared error: 24.46


In [8]:
print("Feature importance: ", regr.feature_importances_)
print("DecisionTreeRegressor parameters:", regr.get_params)

Feature importance:  [0.07009911 0.00145927 0.00788165 0.00137664 0.01573421 0.56211908
 0.00812047 0.07838572 0.00083243 0.02640975 0.0099301  0.00689106
 0.21076052]
DecisionTreeRegressor parameters: <bound method BaseEstimator.get_params of DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')>


### 小結: 調整 criterion 可發現，mae > linear regression > friedman_mse > mse