# Ensemble Learning
---
다양한 supervised 알고리즘을 앙상블하여 성능이 더 좋은 머신러닝 모델

(혼공 263p~283p)
- 1. RandomForestClassifier
- 2. ExtraTreeClassifier
- 3. Gradient Boosting
- 4. Histogram-based Gradient Boosting



### Read CSV File

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://bit.ly/wine_csv_data')
df.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [18]:
X=df.drop('class', axis=1)
y=df['class']

print(X.shape, y.shape)

(6497, 3) (6497,)


In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 1. RandomForestClassifier

**Bootstrap 샘플** 추출하여 트리를 학습시키는 앙상블 알고리즘

- (1-1). CrossValidaton을 통한 성능 평가

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

rf = RandomForestClassifier(n_jobs=-1, random_state=42)

# model 성능 검증
score = cross_validate(rf, X_train, y_train, n_jobs=-1, return_train_score=True)

# 각 fold별 train, test score
print(score['train_score'], score['test_score'])

# 최종 train, test score
print(np.mean(score['train_score']), np.mean(score['test_score']))

[0.9971133  0.99663219 0.9978355  0.9973545  0.9978355 ] [0.88461538 0.88942308 0.90279115 0.88931665 0.88642926]
0.9973541965122431 0.8905151032797809


In [21]:
# train
rf.fit(X_train, y_train)

# feature importance
print(rf.feature_importances_)

[0.23167441 0.50039841 0.26792718]


DecisionTree에서 구한 feature importance보다 평준화 되었다.

-> 일반화 성능 향상

- (1-2) RandomForest의 Bootstrap 샘플로 사용되지 않은 data(**OOB sample**)를 통한 성능 평가

  cross validation을 대신하기에 더 많은 train data를 확보할 수 있다.

In [22]:
# 자체 성능 평가
rf = RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=42)
rf.fit(X_train, y_train)
rf.oob_score_

0.8934000384837406

## 2. ExtraTreeClassifier

bootstrap sample이 아닌 **전체 sample**을 사용해 **random split**한 트리를 통한 앙상블 알고리즘

>장점: 무작위 split이기에 속도가 빠르다.


In [23]:
from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
score = cross_validate(et, X_train, y_train, n_jobs=-1, return_train_score=True)

print(np.mean(score['train_score']), np.mean(score['test_score']))

0.9974503966084433 0.8887848893166506


In [24]:
et.fit(X_train, y_train)

# feature importance
print(et.feature_importances_)

[0.20183568 0.52242907 0.27573525]


## 3. Gradient Boosting

확률적 경사하강법 즉, **손실함수의 loss**를 최소화하는 앙상블 알고리즘

이때, 깊이가 얕은 트리를 사용해 학습한다.

In [25]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=42)

score = cross_validate(gb, X_train, y_train, n_jobs=-1, return_train_score=True)
print(np.mean(score['train_score']), np.mean(score['test_score']))

0.8881086892152563 0.8720430147331015


트리 개수를 늘리고, 학습률을 늘려 성능을 더 높여보자.

In [26]:
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)

score = cross_validate(gb, X_train, y_train, n_jobs=-1, return_train_score=True)
print(np.mean(score['train_score']), np.mean(score['test_score']))

0.9464595437171814 0.8780082549788999


In [27]:
gb.fit(X_train, y_train)

#feature importance
gb.feature_importances_

array([0.15872278, 0.68010884, 0.16116839])

## 4. Histogram-based Gradient Boosting

특성을 256구간으로 나누어 최적의 분할이 빠른 앙상블 알고리즘

(4-1) Sklearn

In [28]:
# from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(random_state=42)

score = cross_validate(hgb, X_train, y_train, n_jobs=-1, return_train_score=True)
print(np.mean(score['train_score']), np.mean(score['test_score']))

0.9321723946453317 0.8801241948619236



Feature importance - train set

In [29]:
from sklearn.inspection import permutation_importance

hgb.fit(X_train, y_train)

result = permutation_importance(hgb, X_train, y_train, n_repeats=10,
                                random_state=42, n_jobs=-1)
print(result.importances_mean)

[0.08876275 0.23438522 0.08027708]


Feature importance - test set

In [31]:
result = permutation_importance(hgb, X_test, y_test, n_repeats=10,
                                n_jobs=-1, random_state=42)
print(result.importances_mean)

[0.05969231 0.20238462 0.049     ]


평가

In [32]:
hgb.score(X_test, y_test)

0.8723076923076923

(4-2) XGBoost

In [34]:
from xgboost import XGBClassifier

xgb = XGBClassifier(tree_method='hist', random_state=42)
score = cross_validate(xgb, X_train, y_train, n_jobs=-1, return_train_score=True)
print(np.mean(score['train_score']), np.mean(score['test_score']))

0.9558403027491312 0.8782000074035686


(4-3) LightGBM

In [36]:
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(random_state=42)
score = cross_validate(lgb, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(score['train_score']), np.mean(score['test_score']))

0.935828414851749 0.8801251203079884
