From 이수안 컴퓨터: http://suanlab.com/youtube/ml.html

# 앙상블(Ensemble)

* 일반화와 강건성(Robustness)을 향상시키기 위해 여러 모델의 예측 값을 결합하는 방법
* 앙상블에는 크게 두가지 종류가 존재
  * 평균 방법
    * 여러개의 추정값을 독립적으로 구한뒤 평균을 취함
    * 결합 추정값은 분산이 줄어들기 때문에 단일 추정값보다 좋은 성능을 보임
  * 부스팅 방법
    * 순차적으로 모델 생성
    * 결합된 모델의 편향을 감소 시키기 위해 노력
    * 부스팅 방법의 목표는 여러개의 약한 모델들을 결합해 하나의 강력한 앙상블 모델을 구축하는 것

## Bagging meta-estimator

* bagging은 bootstrap aggregating의 줄임말
* 원래 훈련 데이터셋의 일부를 사용해 여러 모델을 훈련
* 각각의 결과를 결합해 최종 결과를 생성
* 분산을 줄이고 과적합을 막음
* 강력하고 복잡한 모델에서 잘 동작

In [1]:
import numpy as np
import pandas as pd
import multiprocessing

from sklearn.datasets import load_iris, load_wine, load_breast_cancer, load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

In [2]:
seed = 2022
np.random.seed(seed)

### Bagging을 사용한 분류

#### 데이터셋 불러오기

In [3]:
iris = load_iris()
wine = load_wine()
cancer = load_breast_cancer()

#### KNN

##### 붓꽃 데이터

In [4]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [5]:
cross_val = cross_validate(estimator=base_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.002598142623901367 (+/- 0.0007992267822011628)
avg score time: 0.004198265075683594 (+/- 0.0007478315063778037)
avg test score: 0.96 (+/- 0.024944382578492935)


In [6]:
cross_val = cross_validate(estimator=bagging_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.06436243057250976 (+/- 0.013829491799255234)
avg score time: 0.020389080047607422 (+/- 0.0038226230211636336)
avg test score: 0.9533333333333334 (+/- 0.02666666666666666)


##### 와인 데이터

In [7]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [8]:
cross_val = cross_validate(estimator=base_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.004598093032836914 (+/- 0.001018740034177349)
avg score time: 0.009793806076049804 (+/- 0.0031860029908464373)
avg test score: 0.9493650793650794 (+/- 0.037910929811115976)


In [9]:
cross_val = cross_validate(estimator=bagging_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.06935510635375977 (+/- 0.0102401677823819)
avg score time: 0.029389142990112305 (+/- 0.007809394030072566)
avg test score: 0.9607936507936508 (+/- 0.022468028291073656)


##### 유방암 데이터

In [10]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [11]:
cross_val = cross_validate(estimator=base_model,
               X=cancer.data, y=cancer.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.003598308563232422 (+/- 0.0004889871618716363)
avg score time: 0.015390396118164062 (+/- 0.0054224933165927846)
avg test score: 0.9648501785437045 (+/- 0.009609970350036127)


In [12]:
cross_val = cross_validate(estimator=bagging_model,
               X=cancer.data, y=cancer.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.09934301376342773 (+/- 0.043432684599444465)
avg score time: 0.041975879669189455 (+/- 0.011003481596416882)
avg test score: 0.9630957925787922 (+/- 0.010219838955441114)


#### SVC

##### 붓꽃 데이터

In [13]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [14]:
cross_val = cross_validate(estimator=base_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0033935070037841796 (+/- 0.0004855644184291338)
avg score time: 0.0009989261627197266 (+/- 1.4415947397070861e-06)
avg test score: 0.9666666666666666 (+/- 0.02108185106778919)


In [15]:
cross_val = cross_validate(estimator=bagging_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0549659252166748 (+/- 0.002532194921610228)
avg score time: 0.008402585983276367 (+/- 0.000490940362666761)
avg test score: 0.9399999999999998 (+/- 0.024944382578492935)


In [16]:
cross_val.keys()

dict_keys(['fit_time', 'score_time', 'test_score'])

##### 와인 데이터

In [17]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [18]:
cross_val = cross_validate(estimator=base_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.009194469451904297 (+/- 0.0031849167341325094)
avg score time: 0.0023985862731933593 (+/- 0.0004898041858020194)
avg test score: 0.9833333333333334 (+/- 0.022222222222222233)


In [19]:
cross_val = cross_validate(estimator=bagging_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.07455801963806152 (+/- 0.025741224536144098)
avg score time: 0.013795566558837891 (+/- 0.008605604462818356)
avg test score: 0.9665079365079364 (+/- 0.027185986341561497)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=seed)

In [21]:
base_model = make_pipeline(StandardScaler(),
                          SVC())
base_model.fit(X_train, y_train)
print('base model 훈련 정확도: {}'.format(base_model.score(wine.data, wine.target)))

base model 훈련 정확도: 1.0


In [22]:
base_model = make_pipeline(StandardScaler(),
                          SVC())
base_model.fit(X_test, y_test)
print('base model 테스트 정확도: {}'.format(base_model.score(wine.data, wine.target)))

base model 테스트 정확도: 0.9831460674157303


In [23]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

bagging_model.fit(X_train, y_train)
print('bagging model 훈련 정확도: {}'.format(bagging_model.score(wine.data, wine.target)))

bagging model 훈련 정확도: 0.9943820224719101


In [24]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

bagging_model.fit(X_test, y_test)
print('bagging model 테스트 정확도: {}'.format(bagging_model.score(wine.data, wine.target)))

bagging model 테스트 정확도: 0.9269662921348315


##### 유방암 데이터

In [25]:
cancer = load_breast_cancer()

In [26]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

In [27]:
bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [28]:
cross_val = cross_validate(estimator=base_model,
              X=cancer.data, y=cancer.target,
              cv=5, verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [29]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.01039419174194336 (+/- 0.0004897069698645339)
avg score time: 0.004597091674804687 (+/- 0.000489512201374942)
avg test score: 0.9736376339077782 (+/- 0.014678541667933545)


In [30]:
base_model.fit(cancer.data, cancer.target)
print('bagging model 테스트 정확도: {}'.format(base_model.score(cancer.data, cancer.target)))

bagging model 테스트 정확도: 0.9876977152899824


In [31]:
cross_val = cross_validate(estimator=bagging_model,
              X=cancer.data, y=cancer.target,
              cv=5, verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.5s finished


In [32]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0783599853515625 (+/- 0.00726944897503338)
avg score time: 0.035175418853759764 (+/- 0.012537581603470401)
avg test score: 0.9613258810743673 (+/- 0.016295338833238505)


In [33]:
bagging_model.fit(cancer.data, cancer.target)
print('bagging model 테스트 정확도: {}'.format(bagging_model.score(cancer.data, cancer.target)))

bagging model 테스트 정확도: 0.9771528998242531


#### Decision Tree

##### 붓꽃 데이터

In [34]:
base_model = make_pipeline(StandardScaler(),
                          DecisionTreeClassifier())

In [35]:
bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_features=0.5,
                                  max_samples=0.5)

In [36]:
cross_val = cross_validate(estimator=base_model,
              X=iris.data, y=iris.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:   14.2s finished


In [37]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.008595943450927734 (+/- 0.004585256279991955)
avg score time: 0.0015984535217285155 (+/- 0.0004895710952037857)
avg test score: 0.9666666666666668 (+/- 0.036514837167011066)


In [38]:
cross_val = cross_validate(estimator=bagging_model,
              X=iris.data, y=iris.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.6s finished


In [39]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.31282196044921873 (+/- 0.037009103774813513)
avg score time: 0.046772432327270505 (+/- 0.02434415294493298)
avg test score: 0.9600000000000002 (+/- 0.038873012632301994)


##### 와인 데이터

##### 유방암 데이터

### Bagging을 사용한 회귀

#### 데이터셋 불러오기

#### KNN

##### 보스턴 주택 가격 데이터

##### 당뇨병 데이터

#### SVR

##### 보스턴 주택 가격 데이터

##### 당뇨병 데이터

#### Decision Tree

##### 보스턴 주택 가격 데이터

##### 당뇨병 데이터

## Forests of randomized trees

* `sklearn.ensemble` 모듈에는 무작위 결정 트리를 기반으로하는 두 개의 평균화 알고리즘이 존재
  * Random Forest
  * Extra-Trees
* 모델 구성에 임의성을 추가해 다양한 모델 집합이 생성
* 앙상블 모델의 예측은 각 모델의 평균

### Random Forests 분류

### Random Forests 회귀

### Extremely Randomized Trees 분류

### Extremely Randomized Trees 회귀

### Random Forest, Extra Tree 시각화

* 결정 트리, Random Forest, Extra Tree의 결정 경계와 회귀식 시각화

## AdaBoost

* 대표적인 부스팅 알고리즘
* 일련의 약한 모델들을 학습
* 수정된 버전의 데이터를 반복 학습 (가중치가 적용된)
* 가중치 투표(또는 합)을 통해 각 모델의 예측 값을 결합
* 첫 단계에서는 원본 데이터를 학습하고 연속적인 반복마다 개별 샘플에 대한 가중치가 수정되고 다시 모델이 학습
  * 잘못 예측된 샘플은 가중치 증가, 올바르게 예측된 샘플은 가중치 감소
  * 각각의 약한 모델들은 예측하기 어려운 샘플에 집중하게 됨

![AdaBoost](https://scikit-learn.org/stable/_images/sphx_glr_plot_adaboost_hastie_10_2_0011.png)

### AdaBoost 분류

### AdaBoost 회귀

## Gradient Tree Boosting

* 임의의 차별화 가능한 손실함수로 일반화한 부스팅 알고리즘
* 웹 검색, 분류 및 회귀 등 다양한 분야에서 모두 사용 가능

### Gradient Tree Boosting 분류

### Gradient Tree Boosting 회귀

## 투표 기반 분류 (Voting Classifier)

* 서로 다른 모델들의 결과를 투표를 통해 결합
* 두가지 방법으로 투표 가능
  * 가장 많이 예측된 클래스를 정답으로 채택 (hard voting)
  * 예측된 확률의 가중치 평균 (soft voting)

### 결정 경계 시각화

## 투표 기반 회귀 (Voting Regressor)

* 서로 다른 모델의 예측 값의 평균을 사용

### 회귀식 시각화

## 스택 일반화 (Stacked Generalization)

* 각 모델의 예측 값을 최종 모델의 입력으로 사용
* 모델의 편향을 줄이는데 효과적

### 스택 회귀

#### 회귀식 시각화

### 스택 분류

#### 결정 경계 시각화