## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [6]:
from sklearn import datasets, metrics, linear_model
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score, accuracy_score


## 用一般Linear Regression 跑boston資料集

In [3]:
# 讀取boston資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.1, random_state=4)

# 建立一個線性回歸模型
regr = linear_model.LinearRegression()

# 將訓練資料丟進去模型訓練
regr.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred = regr.predict(x_test)

# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Mean squared error: 17.03


## 用RandomForestRegressor跑Boston 資料集

In [63]:
# 讀取boston資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.1, random_state=4)
"""
RandomForestRegressor(n_estimators=10, #決策樹的數量
                      criterion='mse', 
                      max_depth=None, 
                      min_samples_split=2, 
                      min_samples_leaf=1, 
                      min_weight_fraction_leaf=0.0, 
                      max_features='auto', 
                      max_leaf_nodes=None, 
                      min_impurity_decrease=0.0, 
                      min_impurity_split=None, 
                      bootstrap=True, 
                      oob_score=False, 
                      n_jobs=1, 
                      random_state=None, 
                      verbose=0, 
                      warm_start=False)
"""
#建立RF Regressor Model
clf = RandomForestRegressor(n_estimators=20, min_samples_leaf=2)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))


#DT 最好MSE是13.99
#RF 完全都沒設參數，MSE = 12.28
#RF n_estimators=20 MSE = 10.88, n_estimators=30 MSE = 10.39, RF n_estimators=40 MSE = 9.36


Mean squared error: 9.09


In [64]:
print(boston.feature_names)
print(clf.feature_importances_)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[0.03889351 0.00055294 0.00484353 0.00049795 0.01653931 0.44638178
 0.01363648 0.06836038 0.0037264  0.01160363 0.01749434 0.00884277
 0.36862697]


## 用LogisticRegression跑breast_cancer資料集

In [39]:
#breast_cancer 是分類的, 0是malignant(惡性), 1是benign(良性)

# 讀取breast_cancer資料集
breast_cancer = datasets.load_breast_cancer()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target, test_size=0.1, random_state=4)

# 建立模型
logreg = linear_model.LogisticRegression()

# 訓練模型
logreg.fit(x_train, y_train)

# 預測測試集
y_pred = logreg.predict(x_test)

acc = accuracy_score(y_test, y_pred)

print("Accuracy: ", acc)

Accuracy:  0.8771929824561403


## 用RandomForestClassifier跑breast_cancer資料集

In [65]:
#breast_cancer 是分類的, 0是malignant(惡性), 1是benign(良性)

# 讀取breast_cancer資料集
breast_cancer = datasets.load_breast_cancer()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target, test_size=0.1, random_state=4)
"""
RandomForestClassifier(n_estimators=10, 
                       criterion='gini', 
                       max_depth=None, 
                       min_samples_split=2, 
                       min_samples_leaf=1, 
                       min_weight_fraction_leaf=0.0, 
                       max_features='auto', 
                       max_leaf_nodes=None, 
                       min_impurity_decrease=0.0, 
                       min_impurity_split=None, 
                       bootstrap=True, 
                       oob_score=False, 
                       n_jobs=1, 
                       random_state=None, 
                       verbose=0, 
                       warm_start=False, 
                       class_weight=None)
"""
# 建立模型
clf = RandomForestClassifier(n_estimators=20, min_samples_leaf=5)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = accuracy_score(y_test, y_pred)

print("Accuracy: ", acc)

#DT 的Accuracy:0.8947368421052632
# RF 的 Accuracy:  0.9298245614035088




Accuracy:  0.9298245614035088


In [66]:
print(breast_cancer.feature_names)
print(clf.feature_importances_)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
[3.62089433e-02 7.86411697e-03 4.60178024e-02 7.32495336e-02
 2.23165418e-03 0.00000000e+00 3.78382058e-02 5.40488933e-02
 1.31339675e-03 1.45970797e-04 5.52101562e-03 1.09072351e-03
 9.10572038e-03 9.44549825e-03 7.58282281e-04 3.10557448e-03
 1.66580182e-02 1.52093661e-03 1.72515701e-03 2.05860043e-03
 1.98303436e-01 1.03776513e-02 1.19870659e-01 9.99977559e-02
 6.75180378e-03 2.95077835e-02 4.15308429e-02 1.78068927e-01
 5.57283121e-03 1.