# Day044
## tree based model - 隨機森林程式碼撰寫
### 使用 Sklearn 中的隨機森林
- 如同決策樹的使用方式，根據不同問題 import 不同的模型
- 是從 sklearn.ensemble 裡面 import 的，代表隨機森林是個集成模型，透過多棵複雜的決策樹來投票得到結果，緩解原本決策樹容易過擬和的問題，實務上的結果通常都會比決策樹來得好

> from sklearn.ensemble import RandomForestClassifier <br>
from sklearn.ensemble import RandomForestRegressor <br> 
clf = RandomForestRegressor()

### 隨機森林的模型超參數
- 同樣是樹的模型，所以像是 max_depth, min_samples_split 都與決策樹相同
- 可決定要生成數的數量，越多越不容易過擬和，但是運算時間會變長
> from sklearn.ensemble import RandomForestClassifier <br> 
clf = RandomForestClassifier(<br>
        n_estimators=10, #決策樹的數量
        criterion="gini",
        max_features="auto", #如何選取 features
        max_depth=10,
        min_samples_split=2,
        min_samples_leaf=1
)

## 範例

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

  from numpy.core.umath_tests import inner1d


In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = RandomForestClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [3]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [4]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [5]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.11609989 0.03807772 0.40727996 0.43854243]


## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？

In [6]:
clf = RandomForestClassifier(n_estimators=100, oob_score=True)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


> 增加了樹的數量以及使用OOB，結果不會改變

2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

boston = datasets.load_boston()
train_X, val_X, train_y, val_y = train_test_split(boston.data, boston.target, random_state=1, test_size=0.25)

regr = LinearRegression()
regr.fit(train_X, train_y)
y_pred = regr.predict(val_X)
print("Mean squared error: %.2f"
      % mean_squared_error(val_y, y_pred))

Mean squared error: 21.89


In [8]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(random_state=1, n_estimators=100, oob_score=True)
clf.fit(train_X, train_y)
y_pred = clf.predict(val_X)
print("Mean squared error: %.2f"
      % mean_squared_error(val_y, y_pred))

Mean squared error: 9.26


> 效果變好了