From 이수안 컴퓨터: http://suanlab.com/youtube/ml.html

# 앙상블(Ensemble)

* 일반화와 강건성(Robustness)을 향상시키기 위해 여러 모델의 예측 값을 결합하는 방법
* 앙상블에는 크게 두가지 종류가 존재
  * 평균 방법
    * 여러개의 추정값을 독립적으로 구한뒤 평균을 취함
    * 결합 추정값은 분산이 줄어들기 때문에 단일 추정값보다 좋은 성능을 보임
  * 부스팅 방법
    * 순차적으로 모델 생성
    * 결합된 모델의 편향을 감소 시키기 위해 노력
    * 부스팅 방법의 목표는 여러개의 약한 모델들을 결합해 하나의 강력한 앙상블 모델을 구축하는 것

## Bagging meta-estimator

* bagging은 bootstrap aggregating의 줄임말
* 원래 훈련 데이터셋의 일부를 사용해 여러 모델을 훈련
* 각각의 결과를 결합해 최종 결과를 생성
* 분산을 줄이고 과적합을 막음
* 강력하고 복잡한 모델에서 잘 동작

In [92]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import multiprocessing
plt.style.use(['seaborn-whitegrid'])
from matplotlib.colors import ListedColormap

from sklearn.datasets import load_iris, load_wine, load_breast_cancer, load_boston, load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

In [2]:
seed = 2022
np.random.seed(seed)

### Bagging을 사용한 분류

#### 데이터셋 불러오기

In [3]:
iris = load_iris()
wine = load_wine()
cancer = load_breast_cancer()

#### KNN

##### 붓꽃 데이터

In [4]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [5]:
cross_val = cross_validate(estimator=base_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0027982711791992186 (+/- 0.000399685503331846)
avg score time: 0.004197359085083008 (+/- 0.0007480108495443122)
avg test score: 0.96 (+/- 0.024944382578492935)


In [6]:
cross_val = cross_validate(estimator=bagging_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.045973348617553714 (+/- 0.006288979576414651)
avg score time: 0.018190670013427734 (+/- 0.0039673361935450464)
avg test score: 0.9533333333333334 (+/- 0.02666666666666666)


##### 와인 데이터

In [7]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [8]:
cross_val = cross_validate(estimator=base_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.002997875213623047 (+/- 0.0008938294024281691)
avg score time: 0.005996799468994141 (+/- 0.002189817485468174)
avg test score: 0.9493650793650794 (+/- 0.037910929811115976)


In [9]:
cross_val = cross_validate(estimator=bagging_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.04037766456604004 (+/- 0.0013692416933828466)
avg score time: 0.01559147834777832 (+/- 0.0004980311220686654)
avg test score: 0.9607936507936508 (+/- 0.022468028291073656)


##### 유방암 데이터

In [10]:
base_model = make_pipeline(StandardScaler(),
                          KNeighborsClassifier())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [11]:
cross_val = cross_validate(estimator=base_model,
               X=cancer.data, y=cancer.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0029982566833496095 (+/- 0.0006321835506688758)
avg score time: 0.010938167572021484 (+/- 0.0019922017821457113)
avg test score: 0.9648501785437045 (+/- 0.009609970350036127)


In [12]:
cross_val = cross_validate(estimator=bagging_model,
               X=cancer.data, y=cancer.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.04857292175292969 (+/- 0.0004883559503262607)
avg score time: 0.031181144714355468 (+/- 0.0009818022942266356)
avg test score: 0.9630957925787922 (+/- 0.010219838955441114)


#### SVC

##### 붓꽃 데이터

In [13]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [14]:
cross_val = cross_validate(estimator=base_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0035935401916503905 (+/- 0.0007915467756286922)
avg score time: 0.0011992931365966796 (+/- 0.0003994467293459781)
avg test score: 0.9666666666666666 (+/- 0.02108185106778919)


In [15]:
cross_val = cross_validate(estimator=bagging_model,
               X=iris.data, y=iris.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.053978776931762694 (+/- 0.0014239682885157704)
avg score time: 0.008185100555419923 (+/- 0.0003928060927904803)
avg test score: 0.9399999999999998 (+/- 0.024944382578492935)


In [16]:
cross_val.keys()

dict_keys(['fit_time', 'score_time', 'test_score'])

##### 와인 데이터

In [17]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [18]:
cross_val = cross_validate(estimator=base_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.003997421264648438 (+/- 0.0006326364931372153)
avg score time: 0.0015995025634765625 (+/- 0.0004885777256966608)
avg test score: 0.9833333333333334 (+/- 0.022222222222222233)


In [19]:
cross_val = cross_validate(estimator=bagging_model,
               X=wine.data, y=wine.target,
               cv=5,)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.06116561889648438 (+/- 0.0077262511163138925)
avg score time: 0.009793853759765625 (+/- 0.0003995182513194003)
avg test score: 0.9665079365079364 (+/- 0.027185986341561497)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=seed)

In [21]:
base_model = make_pipeline(StandardScaler(),
                          SVC())
base_model.fit(X_train, y_train)
print('base model 훈련 정확도: {}'.format(base_model.score(wine.data, wine.target)))

base model 훈련 정확도: 1.0


In [22]:
base_model = make_pipeline(StandardScaler(),
                          SVC())
base_model.fit(X_test, y_test)
print('base model 테스트 정확도: {}'.format(base_model.score(wine.data, wine.target)))

base model 테스트 정확도: 0.9831460674157303


In [23]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

bagging_model.fit(X_train, y_train)
print('bagging model 훈련 정확도: {}'.format(bagging_model.score(wine.data, wine.target)))

bagging model 훈련 정확도: 0.9943820224719101


In [24]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

bagging_model.fit(X_test, y_test)
print('bagging model 테스트 정확도: {}'.format(bagging_model.score(wine.data, wine.target)))

bagging model 테스트 정확도: 0.9269662921348315


##### 유방암 데이터

In [25]:
cancer = load_breast_cancer()

In [26]:
base_model = make_pipeline(StandardScaler(),
                          SVC())

In [27]:
bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_samples=0.5, # 50% 사용
                                  max_features=0.5, # 50% 사용
                                 )

In [28]:
cross_val = cross_validate(estimator=base_model,
              X=cancer.data, y=cancer.target,
              cv=5, verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [29]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.010592794418334961 (+/- 0.0004896695918393925)
avg score time: 0.0049969196319580075 (+/- 8.064048063892252e-07)
avg test score: 0.9736376339077782 (+/- 0.014678541667933545)


In [30]:
base_model.fit(cancer.data, cancer.target)
print('bagging model 테스트 정확도: {}'.format(base_model.score(cancer.data, cancer.target)))

bagging model 테스트 정확도: 0.9876977152899824


In [31]:
cross_val = cross_validate(estimator=bagging_model,
              X=cancer.data, y=cancer.target,
              cv=5, verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.4s finished


In [32]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.07655658721923828 (+/- 0.0037585224309997054)
avg score time: 0.02698373794555664 (+/- 0.0006260761846895803)
avg test score: 0.9613258810743673 (+/- 0.016295338833238505)


In [33]:
bagging_model.fit(cancer.data, cancer.target)
print('bagging model 테스트 정확도: {}'.format(bagging_model.score(cancer.data, cancer.target)))

bagging model 테스트 정확도: 0.9771528998242531


#### Decision Tree

##### 붓꽃 데이터

In [34]:
base_model = make_pipeline(StandardScaler(),
                          DecisionTreeClassifier())

In [35]:
bagging_model = BaggingClassifier(base_model,
                                  n_estimators=10,
                                  max_features=0.5,
                                  max_samples=0.5)

In [36]:
cross_val = cross_validate(estimator=base_model,
              X=iris.data, y=iris.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    9.0s finished


In [37]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.0043985843658447266 (+/- 0.0010173557398091178)
avg score time: 0.0011986732482910157 (+/- 0.00040035409663519714)
avg test score: 0.9666666666666668 (+/- 0.036514837167011066)


In [38]:
cross_val = cross_validate(estimator=bagging_model,
              X=iris.data, y=iris.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.1s finished


In [39]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.09194622039794922 (+/- 0.01699625372328629)
avg score time: 0.008595895767211915 (+/- 0.00224455139972937)
avg test score: 0.9400000000000001 (+/- 0.04898979485566354)


##### 와인 데이터

##### 유방암 데이터

### Bagging을 사용한 회귀

#### 데이터셋 불러오기

In [42]:
boston = load_boston()
diabetes = load_diabetes()

#### KNN

##### 보스턴 주택 가격 데이터

In [44]:
base_model = make_pipeline(StandardScaler(),
                         KNeighborsRegressor())

In [46]:
bagging_model = BaggingRegressor(base_model,
                                 n_estimators=10,
                                 max_features=0.5,
                                 max_samples=0.5)                            

In [50]:
cross_val = cross_validate(base_model,
              X=boston.data, y=boston.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score : {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.006395912170410157 (+/- 0.0004903889081666397)
avg score time: 0.005402231216430664 (+/- 0.0010130205902106283)
avg test score : 0.47357748833823543 (+/- 0.13243123464477455)


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished


In [51]:
cross_val = cross_validate(bagging_model,
              X=boston.data, y=boston.target,
              cv=5, n_jobs=multiprocessing.cpu_count(),
              verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score : {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.0997426986694336 (+/- 0.015372230781433571)
avg score time: 0.03857951164245606 (+/- 0.012885002887535203)
avg test score : 0.44273199452435785 (+/- 0.16707228466927612)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.3s finished


##### 당뇨병 데이터

In [52]:
base_model = make_pipeline(StandardScaler(),
                           KNeighborsRegressor())

In [53]:
bagging_model = BaggingRegressor(base_model,
                                n_estimators=10,
                                max_features=0.5,
                                max_samples=0.5)

In [57]:
cross_val = cross_validate(base_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                           verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score : {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.005595207214355469 (+/- 0.0004888113504555947)
avg score time: 0.005196952819824218 (+/- 0.00040011432937742076)
avg test score : 0.3689720650295623 (+/- 0.044659049060165365)


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished


In [58]:
cross_val = cross_validate(bagging_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                           verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score : {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.12133097648620605 (+/- 0.027051117144110626)
avg score time: 0.03797907829284668 (+/- 0.005174451648753201)
avg test score : 0.4033112416625221 (+/- 0.06070489146332469)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.3s finished


#### SVR

##### 보스턴 주택 가격 데이터

In [60]:
base_model = make_pipeline(StandardScaler(),
                          SVR())

In [61]:
bagging_model = BaggingRegressor(base_model,
                                n_estimators=10,
                                max_features=0.5,
                                max_samples=0.5)

In [62]:
cross_val = cross_validate(base_model,
                          X=boston.data, y=boston.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished


In [63]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.042574739456176756 (+/- 0.010125572107599444)
avg score time: 0.020589208602905272 (+/- 0.005000043284539062)
avg test score: 0.17631266230186618 (+/- 0.5224914915128981)


In [64]:
cross_val = cross_validate(bagging_model,
                          X=boston.data, y=boston.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.19108986854553223 (+/- 0.02887909098230622)
avg score time: 0.1679044246673584 (+/- 0.058873772756669426)
avg test score: 0.1968656402067442 (+/- 0.2863374747052458)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.5s finished


##### 당뇨병 데이터

In [65]:
cross_val = cross_validate(base_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.026778221130371094 (+/- 0.0007406398157138773)
avg score time: 0.019591045379638673 (+/- 0.006824739640910243)
avg test score: 0.14659936199629434 (+/- 0.02190798003342928)


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished


In [66]:
cross_val = cross_validate(bagging_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.13891806602478027 (+/- 0.020886961674807223)
avg score time: 0.11353573799133301 (+/- 0.03704024538078205)
avg test score: 0.055531430479438204 (+/- 0.021846354857192423)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.3s finished


#### Decision Tree

##### 보스턴 주택 가격 데이터

In [67]:
base_model = make_pipeline(StandardScaler(),
                          DecisionTreeRegressor())

In [70]:
bagging_model = BaggingRegressor(base_model,
                                 n_estimators=10,
                                 max_features=0.5,
                                 max_samples=0.5)

In [72]:
cross_val = cross_validate(base_model,
                          X=boston.data, y=boston.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.013191509246826171 (+/- 0.000979220502532613)
avg score time: 0.001804018020629883 (+/- 0.00040248461665117293)
avg test score: 0.033393367347999005 (+/- 0.9468177624310712)


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished


In [73]:
cross_val = cross_validate(bagging_model,
                          X=boston.data, y=boston.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.12033085823059082 (+/- 0.020623181035130087)
avg score time: 0.015392875671386719 (+/- 0.009172096182802132)
avg test score: 0.44110137210924194 (+/- 0.19586019473958183)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.2s finished


##### 당뇨병 데이터

In [74]:
cross_val = cross_validate(base_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.011788082122802735 (+/- 0.0019424560085108493)
avg score time: 0.003199338912963867 (+/- 0.0009782558875544402)
avg test score: -0.14987456754914263 (+/- 0.12043883836668544)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:   10.2s finished


In [75]:
cross_val = cross_validate(bagging_model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.14171829223632812 (+/- 0.02008284196861171)
avg score time: 0.02198786735534668 (+/- 0.00909435774652489)
avg test score: 0.34472048313664244 (+/- 0.08232687964248388)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.3s finished


## Forests of randomized trees

* `sklearn.ensemble` 모듈에는 무작위 결정 트리를 기반으로하는 두 개의 평균화 알고리즘이 존재
  * Random Forest
  * Extra-Trees
* 모델 구성에 임의성을 추가해 다양한 모델 집합이 생성
* 앙상블 모델의 예측은 각 모델의 평균

In [None]:
# from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor

### Random Forests 분류

In [77]:
model = make_pipeline(StandardScaler(),
                     RandomForestClassifier())

In [78]:
cross_val = cross_validate(estimator=model,
                          X=iris.data, y=iris.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:   11.3s finished


In [79]:
print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

avg fit time: 0.6808154106140136 (+/- 0.12270093803047905)
avg score time: 0.03937296867370606 (+/- 0.011875957534835642)
avg test score: 0.9666666666666668 (+/- 0.02108185106778919)


### Random Forests 회귀

In [81]:
model = make_pipeline(StandardScaler(),
                     RandomForestRegressor())

In [82]:
cross_val = cross_validate(estimator=model,
                          X=boston.data, y=boston.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 1.2428889274597168 (+/- 0.203637781702091)
avg score time: 0.044179105758666994 (+/- 0.012209001413442092)
avg test score: 0.6341312087831044 (+/- 0.19866643783095084)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.2s finished


In [83]:
cross_val = cross_validate(estimator=model,
                          X=diabetes.data, y=diabetes.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.8373208522796631 (+/- 0.13092289199045334)
avg score time: 0.031782007217407225 (+/- 0.0020380621288949787)
avg test score: 0.41861188460973453 (+/- 0.040616880277457336)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    1.5s finished


### Extremely Randomized Trees 분류

In [84]:
model = make_pipeline(StandardScaler(),
                     ExtraTreesClassifier())

In [85]:
cross_val = cross_validate(estimator=model,
                          X=iris.data, y=iris.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.4537407398223877 (+/- 0.08089111029167216)
avg score time: 0.03597936630249023 (+/- 0.008915929525219949)
avg test score: 0.9533333333333334 (+/- 0.03399346342395189)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.8s finished


In [86]:
cross_val = cross_validate(estimator=model,
                          X=wine.data, y=wine.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.6278254508972168 (+/- 0.18176466734141458)
avg score time: 0.048172521591186526 (+/- 0.010181391709644569)
avg test score: 0.9833333333333332 (+/- 0.022222222222222233)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    1.0s finished


In [87]:
cross_val = cross_validate(estimator=model,
                          X=cancer.data, y=cancer.target,
                          cv=5, n_jobs=multiprocessing.cpu_count(),
                          verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.4847268581390381 (+/- 0.11358975057153671)
avg score time: 0.035575151443481445 (+/- 0.006023613113197506)
avg test score: 0.9648501785437043 (+/- 0.014678771565898205)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.8s finished


### Extremely Randomized Trees 회귀

In [88]:
model = make_pipeline(StandardScaler(),
                     ExtraTreesRegressor())

In [89]:
cross_val = cross_validate(estimator=model,
                           X=boston.data, y=boston.target,
                           cv=5, n_jobs=multiprocessing.cpu_count(),
                           verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.7015979290008545 (+/- 0.14855760354165107)
avg score time: 0.028183555603027342 (+/- 0.00461905642367362)
avg test score: 0.6410406504199446 (+/- 0.24486836240443524)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    1.1s finished


In [90]:
cross_val = cross_validate(estimator=model,
                           X=diabetes.data, y=diabetes.target,
                           cv=5, n_jobs=multiprocessing.cpu_count(),
                           verbose=1)

print('avg fit time: {} (+/- {})'.format(cross_val['fit_time'].mean(), cross_val['fit_time'].std()))
print('avg score time: {} (+/- {})'.format(cross_val['score_time'].mean(), cross_val['score_time'].std()))
print('avg test score: {} (+/- {})'.format(cross_val['test_score'].mean(), cross_val['test_score'].std()))

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


avg fit time: 0.9212769031524658 (+/- 0.20622518015052746)
avg score time: 0.05516486167907715 (+/- 0.01626413243806954)
avg test score: 0.4436155011145192 (+/- 0.040213395733187704)


[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    1.5s finished


### Random Forest, Extra Tree 시각화

* 결정 트리, Random Forest, Extra Tree의 결정 경계와 회귀식 시각화

In [93]:
n_classes = 3
n_estimator = 30
cmap = plt.cm.RdYlBu
plot_step = 0.02
plot_step_coarser = 0.5
RANDOM_SEED = 13

In [94]:
iris = load_iris()
plot_idx = 1
models = [DecisionTreeClassifier(max_depth=None),
         RandomForestClassifier(n_estimators=n_estimator),
         ExtraTreesClassifier(n_estimators=n_estimator)]

In [None]:
plt.figure(figsize=(16,8))

for pair in ([0,1], [0,2], [2,3]):
    for model in models:        
        X = iris.data[:,pair]
        y = iris.target
        
        idx = np.arange(X.shape[0])
        np.random.seed(RANDOM_SEED)
        np.random.shuffle(idx)
        
        X = X[idx]
        y = y[idx]
        
        mean = X.mean(axis=0)
        std = X.std(axis=0)
        X = (X-mean) / std
        
        model.fit(X, y)
        
        model_title = str(type(model)).split(".")[-1][:-2][:len("Classifier")]
        
        plt.subplot(3,3, plot_idx)
        if plot_idx <= len(models):
            plt.title(model_title, fontsize=9)
            
        x_min, x_max = X[:, 0].min()-1, X[:, 0].max()+1
        y_min, y_max = X[:, 1].min()-1, X[:, 1].max()+1
        
        xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                            np.arange(y_min, y_max, plot_step))
        
        if isinstance(model, DecisionTreeClassifier):
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            cs = plt.contourf(xx, yy, Z, cmap=cmap)
        else:
            estimator_alpha = 1.0 / len(model.estimators_)
            for tree in model.estimators_:
                Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
                Z = Z.reshape(xx.shape)
                cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)
                
        xx_coarser, yy_coarser = np.meshgrid(np.arange(x_min, x_max, plot_step_coarser),
                                            np.arange(y_min, y_max, plot_step_coarser))
        
        Z_points_coarser = model.predict(np.c_[xx_coarser.ravel(), 
                                               yy_coarser.ravel()]).reshape(xx_coarser.shape)
        
        cs_points = plt.scatter(xx_coarser, yy_coarser, s=15,
                               c=Z_points_coarser, cmap=cmap,
                               edgecolors='none')
        
        plt.scatter(X[:, 0], X[:,1], c=y,
                   cmap=ListedColormap(['r', 'y', 'b']),
                   edgecolor='k', s=20,)
        plot_idx += 1
        
plt.subtitle('Classifiers', fontsize=12)
plt.axis('tight')
plt.tight_layout(h_pad=0.2, w_pad=0.2, pad=2.5)
plt.show()

## AdaBoost

* 대표적인 부스팅 알고리즘
* 일련의 약한 모델들을 학습
* 수정된 버전의 데이터를 반복 학습 (가중치가 적용된)
* 가중치 투표(또는 합)을 통해 각 모델의 예측 값을 결합
* 첫 단계에서는 원본 데이터를 학습하고 연속적인 반복마다 개별 샘플에 대한 가중치가 수정되고 다시 모델이 학습
  * 잘못 예측된 샘플은 가중치 증가, 올바르게 예측된 샘플은 가중치 감소
  * 각각의 약한 모델들은 예측하기 어려운 샘플에 집중하게 됨

![AdaBoost](https://scikit-learn.org/stable/_images/sphx_glr_plot_adaboost_hastie_10_2_0011.png)

### AdaBoost 분류

### AdaBoost 회귀

## Gradient Tree Boosting

* 임의의 차별화 가능한 손실함수로 일반화한 부스팅 알고리즘
* 웹 검색, 분류 및 회귀 등 다양한 분야에서 모두 사용 가능

### Gradient Tree Boosting 분류

### Gradient Tree Boosting 회귀

## 투표 기반 분류 (Voting Classifier)

* 서로 다른 모델들의 결과를 투표를 통해 결합
* 두가지 방법으로 투표 가능
  * 가장 많이 예측된 클래스를 정답으로 채택 (hard voting)
  * 예측된 확률의 가중치 평균 (soft voting)

### 결정 경계 시각화

## 투표 기반 회귀 (Voting Regressor)

* 서로 다른 모델의 예측 값의 평균을 사용

### 회귀식 시각화

## 스택 일반화 (Stacked Generalization)

* 각 모델의 예측 값을 최종 모델의 입력으로 사용
* 모델의 편향을 줄이는데 효과적

### 스택 회귀

#### 회귀식 시각화

### 스택 분류

#### 결정 경계 시각화