## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [21]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd

In [6]:
# 讀取鳶尾花資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=10)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestRegressor(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [11]:
mse = metrics.mean_squared_error(y_test, y_pred)
r_square = metrics.r2_score(y_test, y_pred)
print("R^2: ", r_square)

R^2:  0.8101986545335425


#### 調整參數 看能不能提高準確率

In [18]:
# 建立模型 (使用 100 顆樹，每棵樹的最大深度為 4)
clf1 = RandomForestRegressor(n_estimators=20, max_depth=4, min_samples_split = 10, min_samples_leaf = 5)
clf2 = RandomForestRegressor(n_estimators=100, max_depth=4)
clf3 = RandomForestRegressor(n_estimators=100, max_depth=4, min_samples_split = 10, min_samples_leaf = 5)
# 訓練模型
clf1.fit(x_train, y_train)
clf2.fit(x_train, y_train)
clf3.fit(x_train, y_train)

# 預測測試集
y_pred1 = clf1.predict(x_test)
y_pred2 = clf2.predict(x_test)
y_pred3 = clf3.predict(x_test)

In [19]:
mse = metrics.mean_squared_error(y_test, y_pred)
r_square1 = metrics.r2_score(y_test, y_pred1)
r_square2 = metrics.r2_score(y_test, y_pred2)
r_square3 = metrics.r2_score(y_test, y_pred3)
print("Fisrt adj R^2: ", r_square1)
print("Second adj R^2: ", r_square2)
print("Third adj R^2: ", r_square3)

Fisrt adj R^2:  0.7805905367305189
Second adj R^2:  0.8274575937101761
Third adj R^2:  0.782500023068792


- 可以看到把樹的數目調高，可以增加準確率，但隨意調整各節點和最後的樣本數數目可能會使準確率下降
- 下表是各訓練模型得的特徵重要性，可以看到RM LSTAT為該模型重要的特徵

In [25]:
## 特徵重要性
adj = pd.DataFrame(clf.feature_importances_, index = boston.feature_names, columns = ['orginal'])
adj1 = pd.DataFrame(clf1.feature_importances_, index = boston.feature_names, columns = ['adj1'])
adj2 = pd.DataFrame(clf2.feature_importances_, index = boston.feature_names, columns = ['adj2'])
adj3 = pd.DataFrame(clf3.feature_importances_, index = boston.feature_names, columns = ['adj3'])
a = pd.concat([adj, adj1], axis = 1)
a = pd.concat([a, adj2], axis = 1)
a = pd.concat([a, adj3], axis = 1)
a

Unnamed: 0,orginal,adj1,adj2,adj3
CRIM,0.027334,0.024224,0.0255,0.02112
ZN,0.0,0.0,3.2e-05,0.00037
INDUS,0.000865,0.003042,0.002566,0.000946
CHAS,0.0,0.0,0.001406,0.0
NOX,0.023241,0.015413,0.019337,0.016476
RM,0.317364,0.392418,0.349752,0.374287
AGE,0.003243,0.004266,0.004507,0.007154
DIS,0.049071,0.013361,0.05468,0.021781
RAD,0.001881,0.000679,0.002387,0.00221
TAX,0.008552,0.007067,0.006425,0.005036
