# 앙상블 학습 (Ensemble Learning)

여러개의 분류기를 생성하고 그 예측을 결합함으로써 보다 정확한 최종 예측을 도출하는 기법

- 보팅 : 여러개의 분류기가 투표를 통해 예측 결과를 결정하는 방식 (서로 다른 알고리즘 모델)
- 배깅 : 데이터 샘플링을 서로 다르게 하면서 같은 알고리즘 모델을 적용하여 보팅을 수행하는 방법 (ex, 랜덤포레스트)

## 보팅

- 하드보팅 : 예측한 결과값중 다수의 분류기가 결정한 예측값을 최종 결과값으로 선정하는 방식
- 소프트보팅 : 분류기들의 레이블 값 결정확률을 모두 더하고, 이를 평균해서 이들 중 가장 확률이 높은 레이블 값을 선정하는 방식 (default)

![voting](./voting.png)

In [9]:
import pandas as pd

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

cancer=load_breast_cancer()

cancer_df=pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_df['cancer']=cancer.target

cancer_df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,cancer
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0


In [6]:
lr_clf=LogisticRegression()
knn_clf=KNeighborsClassifier(n_neighbors=8)

In [7]:
vo_clf=VotingClassifier(estimators=[('LR',lr_clf),('KNN',knn_clf)], voting='soft')

In [16]:
x_train, x_test, y_train, y_test=train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=300)

In [17]:
vo_clf.fit(x_train, y_train)
y_pred=vo_clf.predict(x_test)

print('Voting 분류기 정확도 : {0:.4f}'.format(accuracy_score(y_test, y_pred)))

Voting 분류기 정확도 : 0.9211


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [18]:
# 개별 모델의 학습/예측/평가

classifiers=[lr_clf, knn_clf]
for classifier in classifiers:
    classifier.fit(x_train, y_train)
    pred=classifier.predict(x_test)
    class_name=classifier.__class__.__name__
    print('{0} 정확도: {1:.4f}'.format(class_name, accuracy_score(y_test, pred)))

LogisticRegression 정확도: 0.9123
KNeighborsClassifier 정확도: 0.9298


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
