## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [2]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred_test = clf.predict(x_test)
y_pred_train = clf.predict(x_train)
acc_train = metrics.accuracy_score(y_train, y_pred_train)
acc_test = metrics.accuracy_score(y_test, y_pred_test)
print("Accuracy_train: ", acc_train)
print("Accuracy_test: ", acc_test)
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy_train:  1.0
Accuracy_test:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.10315917 0.06393169 0.39033209 0.44257706]


In [28]:
############## 作業 1 #################################
### 基本上 default 參數, 表現結果不錯, 但 train acc 比 test 大一些
### 嘗試增加樹的顆樹, Max_Feature, min_samples_split, min_samples_leaf 皆無法在提升 test acc 
### Max_Feature 如果調太低
#########################################################
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
#############參數 ###########################
n_estimators =300
max_depth =5
min_samples_split =3
min_samples_leaf =2
min_weight_fraction_leaf =None 
min_impurity_decrease = 0
min_impurity_split =0
random_state =0
class_weight ='balanced_subsample' #balanced_subsample
max_leaf_nodes =None
max_features =0.9


############################################
clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, class_weight=class_weight, max_leaf_nodes=max_leaf_nodes, max_features=max_features,min_impurity_decrease=min_impurity_decrease,min_impurity_split=min_impurity_split,random_state=random_state)
##################################################################
# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred_test = clf.predict(x_test)
y_pred_train = clf.predict(x_train)
acc_train = metrics.accuracy_score(y_train, y_pred_train)
acc_test = metrics.accuracy_score(y_test, y_pred_test)
print("Accuracy_train: ", acc_train)
print("Accuracy_test: ", acc_test)
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)



Accuracy_train:  0.9821428571428571
Accuracy_test:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.008151   0.00689204 0.46977077 0.51518619]




In [37]:
############## 作業 2 ###############################
### 在 Wine 這比資料中RF, Tree 的 結果明顯比 Logistic Regression 好很多, logistic Regression 仍為 Under Fitting
### 其中 RF 更是表現傑出 test acc達 0.97以上
### Logistic Regression 仍需做 Feature selection 或其他參數調整才能再進一步提升準確率

###########################################################
##########   wine     ###################
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
wine = datasets.load_wine()
X = wine.data
y = wine.target
print(X.shape, y.shape)
### split data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

############## Linear Model ######################
#########################################
### find significance features

model1 = LogisticRegression(solver = 'saga', multi_class='multinomial')

model1.fit(X_train, y_train)
# 預測測試集
y_pred_test = model1.predict(X_test)
y_pred_train = model1.predict(X_train)
acc_train = metrics.accuracy_score(y_train, y_pred_train)
print("Acuuracy_train_LR: ", acc_train)

acc_test = metrics.accuracy_score(y_test, y_pred_test)
print("Acuuracy_test_LR: ", acc_test)

############  Tree #####################################################################
################################### 參數設定 ###################################
##### 改變 Criterion #####
criterion = 'gini'

### 關於切割設定, 可加快速度/減少overfitting ###
splitter = 'best'
max_features = 0.75 
min_samples_split = 5
min_samples_leaf = 3
min_impurity_decrease = 0.001
min_impurity_split = 1e-7
min_weight_fraction_leaf = 0
### 關於模型架構 ###
max_depth = 6
max_leaf_nodes = None

### Others ###
class_weight = 'balanced'
presort = False
random_state = 0

tree1 = DecisionTreeClassifier(criterion = criterion, min_samples_leaf=min_samples_leaf, max_depth = max_depth, max_features = max_features, min_samples_split = min_samples_split, min_impurity_decrease = min_impurity_decrease, min_impurity_split = min_impurity_split, class_weight =class_weight, presort = presort, random_state = random_state)
tree1.fit(X_train, y_train)

# 預測測試集
y_pred_test = tree1.predict(X_test)
y_pred_train = tree1.predict(X_train)
acc_train = metrics.accuracy_score(y_train, y_pred_train)
print("Acuuracy_train_Tree: ", acc_train)

acc_test = metrics.accuracy_score(y_test, y_pred_test)
print("Acuuracy_test_Tree: ", acc_test)

#print(iris.feature_names)
#print("Feature importance: ", tree1.feature_importances_)

################################################################################
######################Rndom Forest#####################
#############參數 ###########################
n_estimators =300
max_depth =5
min_samples_split =3
min_samples_leaf =2
min_weight_fraction_leaf =None 
#min_impurity_decrease = 0
#min_impurity_split =0
random_state =0
class_weight ='balanced_subsample' #balanced_subsample
max_leaf_nodes =None
max_features =0.9
####################################################################

RF1 = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_leaf=min_samples_leaf, class_weight=class_weight, max_leaf_nodes=max_leaf_nodes, max_features=max_features,min_impurity_decrease=min_impurity_decrease,random_state=random_state)
RF1.fit(X_train, y_train)

predict_train = RF1.predict(X_train)
predict_test = RF1.predict(X_test)
acc_train = metrics.accuracy_score(predict_train, y_train)
print("Acuuracy_train_RF: ", acc_train)
acc_test = metrics.accuracy_score(predict_test, y_test)
print("Acuuracy_test_RF: ", acc_test)

(178, 13) (178,)
Acuuracy_train_LR:  0.6901408450704225
Acuuracy_test_LR:  0.75
Acuuracy_train_Tree:  0.9788732394366197
Acuuracy_test_Tree:  0.9166666666666666




Acuuracy_train_RF:  1.0
Acuuracy_test_RF:  0.9722222222222222


In [120]:
#################### 作業 2 ##########################################
### 在 Boston 這比資料中 3 方法基本上都有嚴重 overting, 但從 test R2 來看 RF 的表現優於其他2方法

##########   boston     ###################
#########################################
#####################################
boston = datasets.load_boston()
X = boston.data
y = boston.target
print(X.shape, y.shape)

### split data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2 , random_state = 0)
################# Linear Model ##########################################
model2 = LinearRegression()
model2.fit(X_train, y_train)
y_pred_test = model2.predict(X_test)
y_pred_train = model2.predict(X_train)
print('training R2_LM:', model2.score(X_train, y_train))
print('testing R2_LM:', model2.score(X_test, y_test))

############### Tree ########################
##### 改變 Criterion #####
criterion = 'mse'

### 關於切割設定, 可加快速度/減少overfitting ###
splitter = 'best'
max_features = None 
min_samples_split = 25
min_samples_leaf = 10
min_impurity_decrease = 0
min_impurity_split = 0
min_weight_fraction_leaf = 0
### 關於模型架構 ###
max_depth = 10
max_leaf_nodes = 15

### Others ###
presort = False
random_state = 0
tree2 = DecisionTreeRegressor(criterion = criterion, min_samples_leaf=min_samples_leaf, max_depth = max_depth, max_features = max_features, min_samples_split = min_samples_split, min_impurity_decrease = min_impurity_decrease, min_impurity_split = min_impurity_split, presort = presort, random_state = random_state, max_leaf_nodes = max_leaf_nodes)
tree2.fit(X_train, y_train)

y_pred_test = tree2.predict(X_test)
y_pred_train = tree2.predict(X_train)
print('training R2_Tree:', tree2.score(X_train, y_train))
print('testing R2_Tree:', tree2.score(X_test, y_test))


###### Rndom Forest ##############
from sklearn.ensemble import RandomForestRegressor
#############參數 ###########################
n_estimators =100
max_depth =None
min_samples_split = 2
min_samples_leaf =1
min_weight_fraction_leaf =None 
#min_impurity_decrease = 0
#min_impurity_split =0
random_state =10
max_leaf_nodes =None
max_features =0.8
############################################################
RF2 = RandomForestRegressor(random_state=random_state)
#RF2 = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_leaf=min_samples_leaf, max_leaf_nodes=max_leaf_nodes, max_features=max_features,min_impurity_decrease=min_impurity_decrease,random_state=random_state)

RF2.fit(X_train, y_train)

predict_train = RF2.predict(X_train)
predict_test = RF2.predict(X_test)

r2_trin = metrics.r2_score(predict_train, y_train)
r2_test = metrics.r2_score(predict_test, y_test)
print('training R2_RF:', r2_trin)
print('testing R2_RF:', r2_test)

(506, 13) (506,)
training R2_LM: 0.7730135569264233
testing R2_LM: 0.5892223849182525
training R2_Tree: 0.8552006275447659
testing R2_Tree: 0.6013070252047278
training R2_RF: 0.9714511910679134
testing R2_RF: 0.7584588707768365


