## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

### Boston

#### 讀取資料

In [2]:
# 讀取boston資料集
boston = datasets.load_boston()
len(boston.data)

506

#### 設定參數

In [3]:
n_estimator = [10, 20, 30]
# criterion = ['gini', 'entropy']
criterion = ['mse']
max_depth = [10, 20]
min_samples_split = [2, 5, 7]
min_samples_leaf = [1, 5, 10]

#### 模型

In [4]:
p1 = []
p2 = []
p3 = []
p4 = []
p5 = []
mse = []
for a in n_estimator:
    for b in criterion:
        for c in max_depth:
            for d in min_samples_split:
                for e in min_samples_leaf:
                    print("n_estimator:", a, '\t')
                    print("criterion:", b, '\t')
                    print("max_depth:", c, '\t')
                    print("min_samples_split:", d, '\t')
                    print("min_samples_leaf:", e, '\t')
                    p1.append(a)
                    p2.append(b)
                    p3.append(c)
                    p4.append(d)
                    p5.append(e)
                    
                    # 切分訓練集/測試集
                    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

                    # 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
                    rg = RandomForestRegressor(n_estimators = a, criterion = b, max_depth = c,
                                                min_samples_split = d, min_samples_leaf = e)

                    # 訓練模型
                    rg.fit(x_train, y_train)

                    # 預測測試集
                    y_pred = rg.predict(x_test)
                    
                    print("Mean squared error: %.2f"
                  % mean_squared_error(y_test, y_pred),"\n")
                    
                    mse.append(mean_squared_error(y_test, y_pred))
                    

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 1 	
Mean squared error: 16.07 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 5 	
Mean squared error: 16.86 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 10 	
Mean squared error: 19.78 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 1 	
Mean squared error: 16.47 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 5 	
Mean squared error: 15.47 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 10 	
Mean squared error: 20.55 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 7 	
min_samples_leaf: 1 	
Mean squared error: 16.22 

n_estimator: 10 	
criterion: mse 	
max_depth: 10 	
min_samples_split: 7 	
min_samples_leaf: 5 	
Mean squared error: 16.71 

n_esti

In [5]:
result_boston = pd.DataFrame({'n_estimator':p1,
'criterion':p2,
'max_depth':p3,
'min_samples_split':p4,
'min_samples_leaf':p5,
'MSE':mse})
result_boston['MSE'].min()

12.6019022500111

In [6]:
x = result_boston.loc[result_boston['MSE'] == result_boston['MSE'].min(), ['n_estimator', 'criterion', 'max_depth', 'min_samples_split',
'min_samples_leaf']].index
result_boston.loc[result_boston['MSE'] == result_boston['MSE'].min(), ['n_estimator', 'criterion', 'max_depth', 'min_samples_split',
'min_samples_leaf']]

Unnamed: 0,n_estimator,criterion,max_depth,min_samples_split,min_samples_leaf
12,10,mse,20,5,1


In [7]:
y = x[0]

In [8]:
print("MSE最小的參數組成與結果為：\n")
result_boston.iloc[y]

MSE最小的參數組成與結果為：



n_estimator               10
criterion                mse
max_depth                 20
min_samples_split          5
min_samples_leaf           1
MSE                  12.6019
Name: 12, dtype: object

In [9]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
rg = RandomForestRegressor(n_estimators = result_boston.iloc[y][0], criterion = result_boston.iloc[y][1], 
                           max_depth = result_boston.iloc[y][2], min_samples_split = result_boston.iloc[y][3], min_samples_leaf = result_boston.iloc[y][4])

# 訓練模型
rg.fit(x_train, y_train)

# 預測測試集
y_pred = rg.predict(x_test)

print("Mean squared error: %.2f"
% mean_squared_error(y_test, y_pred),"\n")



Mean squared error: 21.30 



In [10]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [11]:
print("Feature importance: ", rg.feature_importances_)

Feature importance:  [0.06520177 0.00109234 0.00717901 0.00142548 0.01227719 0.53035128
 0.01216126 0.0475746  0.00259088 0.01917145 0.01331447 0.01256833
 0.27509194]


### Wine

#### 讀取資料

In [12]:
# 讀取boston資料集
wine = datasets.load_wine()
len(wine.data)

178

#### 參數設定

In [13]:
n_estimator = [10, 20, 30]
criterion = ['gini', 'entropy']
max_depth = [10, 20]
min_samples_split = [2, 5, 7]
min_samples_leaf = [1, 4, 7]

#### 模型

In [14]:
p1 = []
p2 = []
p3 = []
p4 = []
p5 = []
acc = []
for a in n_estimator:
    for b in criterion:
        for c in max_depth:
            for d in min_samples_split:
                for e in min_samples_leaf:
                    print("n_estimator:", a, '\t')
                    print("criterion:", b, '\t')
                    print("max_depth:", c, '\t')
                    print("min_samples_split:", d, '\t')
                    print("min_samples_leaf:", e, '\t')
                    p1.append(a)
                    p2.append(b)
                    p3.append(c)
                    p4.append(d)
                    p5.append(e)
                    
                    # 切分訓練集/測試集
                    x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

                    # 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
                    clf = RandomForestClassifier(n_estimators = a, criterion = b, max_depth = c,
                                                min_samples_split = d, min_samples_leaf = e)

                    # 訓練模型
                    clf.fit(x_train, y_train)

                    # 預測測試集
                    y_pred = clf.predict(x_test)
                    
                    accuracy = metrics.accuracy_score(y_test, y_pred)
                    print("Accuracy: ", accuracy)
                    
                    acc.append(accuracy)
                    

n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 1 	
Accuracy:  0.9555555555555556
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 4 	
Accuracy:  1.0
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 7 	
Accuracy:  0.9555555555555556
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 1 	
Accuracy:  0.9777777777777777
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 4 	
Accuracy:  0.9333333333333333
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 7 	
Accuracy:  0.9777777777777777
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 7 	
min_samples_leaf: 1 	
Accuracy:  0.9777777777777777
n_estimator: 10 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 7 	
min_samples_leaf: 4 	
Accuracy:  0.9555555555555556

Accuracy:  1.0
n_estimator: 20 	
criterion: entropy 	
max_depth: 20 	
min_samples_split: 5 	
min_samples_leaf: 7 	
Accuracy:  0.9777777777777777
n_estimator: 20 	
criterion: entropy 	
max_depth: 20 	
min_samples_split: 7 	
min_samples_leaf: 1 	
Accuracy:  1.0
n_estimator: 20 	
criterion: entropy 	
max_depth: 20 	
min_samples_split: 7 	
min_samples_leaf: 4 	
Accuracy:  0.9777777777777777
n_estimator: 20 	
criterion: entropy 	
max_depth: 20 	
min_samples_split: 7 	
min_samples_leaf: 7 	
Accuracy:  0.9777777777777777
n_estimator: 30 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 1 	
Accuracy:  0.9777777777777777
n_estimator: 30 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 4 	
Accuracy:  1.0
n_estimator: 30 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 2 	
min_samples_leaf: 7 	
Accuracy:  0.9777777777777777
n_estimator: 30 	
criterion: gini 	
max_depth: 10 	
min_samples_split: 5 	
min_samples_leaf: 1 	
Accuracy:  0.9777

In [15]:
result_wine = pd.DataFrame({'n_estimator':p1,
'criterion':p2,
'max_depth':p3,
'min_samples_split':p4,
'min_samples_leaf':p5,
'ACCURACY':acc})
result_wine['ACCURACY'].max()

1.0

In [16]:
x = result_wine.loc[result_wine['ACCURACY'] == result_wine['ACCURACY'].max(), ['n_estimator', 'criterion', 'max_depth', 'min_samples_split',
'min_samples_leaf']].index
result_wine.loc[result_wine['ACCURACY'] == result_wine['ACCURACY'].max(), ['n_estimator', 'criterion', 'max_depth', 'min_samples_split',
'min_samples_leaf']]

Unnamed: 0,n_estimator,criterion,max_depth,min_samples_split,min_samples_leaf
1,10,gini,10,2,4
11,10,gini,20,2,7
20,10,entropy,10,2,7
23,10,entropy,10,5,7
25,10,entropy,10,7,4
27,10,entropy,20,2,1
34,10,entropy,20,7,4
36,20,gini,10,2,1
38,20,gini,10,2,7
39,20,gini,10,5,1


In [17]:
y = x[0]

In [18]:
print("ACCURACY最大的參數組成與結果為：\n")
result_wine.iloc[y]

ACCURACY最大的參數組成與結果為：



n_estimator            10
criterion            gini
max_depth              10
min_samples_split       2
min_samples_leaf        4
ACCURACY                1
Name: 1, dtype: object

In [19]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators = result_wine.iloc[y][0], criterion = result_wine.iloc[y][1], 
                           max_depth = result_wine.iloc[y][2], min_samples_split = result_wine.iloc[y][3], min_samples_leaf = result_wine.iloc[y][4])

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  1.0


In [20]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [21]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.1509034  0.03082306 0.         0.05622994 0.00037571 0.08303012
 0.09328991 0.0234989  0.01396183 0.20481957 0.10198217 0.05838765
 0.18269776]
