---
title:  "What is Stacking"
excerpt: "Stacking , Ensemble"

categories:
  - Machine-Learning
tags:
  - Stacking
  - Ensemble
  - Medium
last_modified_at: 2020-05-09T21:15:00-05:00
---

## Reference  
- 이수진의 머신러닝 스태킹 앙상블 (https://lsjsj92.tistory.com/559?category=853217)
- 이수진의 머신러닝 스태킹 앙상블 베이식 (https://lsjsj92.tistory.com/558?category=853217)

## Stacking-Ensemble Basic

- 개별학습기의 결과물을 train 셋으로 하는 메타 학습기를 만들고, 이를 훈련시켜서 예측하는 방법. 앙상블의 일종이다.
- stacked generalization 의 줄임말이다.  
- 따라서, 두 가지 컨셉의 분류기가 필요하다.
 > 개별 분류기 각각의 분류기  
 > 개별 분류기의 예측값을 학습셋으로 활용하는 메타 분류기

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots
import numpy as np

In [52]:
# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer,load_iris
from sklearn import model_selection
from sklearn.model_selection import cross_val_score,KFold

In [3]:
data = load_iris()

In [4]:
x = data.data
y = data.target
x_train,x_test,y_train,y_test = model_selection.train_test_split(x,y,test_size=0.2, random_state=42)

In [5]:
print(x_train.shape,x_test.shape)

(120, 4) (30, 4)


In [6]:
x_train[0:4]

array([[4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2]])

In [27]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler

scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
print(x_train_scaled.shape,x_test_scaled.shape)

(120, 4) (30, 4)


In [28]:
clf01 = RandomForestClassifier(n_estimators=13)
scoring = 'accuracy'
score = cross_val_score(clf01, x_train_scaled, y_train, cv=5, n_jobs=1, scoring=scoring)
print(np.mean(score))

0.9249710144927537


In [29]:
# Random Forest Score
round(np.mean(score)*100, 2)

92.5

In [30]:
clf02 = GaussianNB()
scoring = 'accuracy'
score = cross_val_score(clf02, x_train_scaled, y_train, cv=5, n_jobs=1, scoring=scoring)
print(score)

[0.96       1.         0.83333333 1.         0.91304348]


In [31]:
# Naive Bayes Score
round(np.mean(score)*100, 2)

94.13

In [32]:
clf03 = SVC(gamma='auto')
scoring = 'accuracy'
score = cross_val_score(clf03, x_train_scaled, y_train, cv=5, n_jobs=1, scoring=scoring)
print(np.mean(score))

0.9329420289855073


In [33]:
## 최종모델 Meta clf
from lightgbm import LGBMClassifier,plot_importance
meta_clf = LGBMClassifier()

In [34]:
clf01.fit(x_train_scaled, y_train)
clf02.fit(x_train_scaled, y_train)
clf03.fit(x_train_scaled, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [35]:
clf01_pred = clf01.predict(x_test_scaled)
clf02_pred = clf02.predict(x_test_scaled)
clf03_pred = clf03.predict(x_test_scaled)

In [36]:
## 현 상태 그대로의 y_test 결과
from sklearn.metrics import accuracy_score

cla01_y_test_rslt = accuracy_score(y_test, clf01_pred, normalize=True, sample_weight=None)
cla02_y_test_rslt = accuracy_score(y_test, clf02_pred, normalize=True, sample_weight=None)
cla03_y_test_rslt = accuracy_score(y_test, clf03_pred, normalize=True, sample_weight=None)
print(cla01_y_test_rslt,cla02_y_test_rslt,cla03_y_test_rslt)

1.0 1.0 1.0


In [37]:
new_data = np.array([clf01_pred,clf02_pred,clf03_pred])
new_data.shape

(3, 30)

In [38]:
new_data = np.transpose(new_data)

In [39]:
new_data.shape

(30, 3)

In [40]:
x_test_scaled.shape

(30, 4)

In [41]:
meta_clf.fit(new_data,y_test)
meta_predict = meta_clf.predict(new_data)
meta_y_test_rslt = accuracy_score(y_test, meta_predict, normalize=True, sample_weight=None)
print(np.mean(meta_y_test_rslt))

0.36666666666666664


결과는 상당히 구리지만....넘어가는 걸로

## 참조 Blog 원본데로, breast_canser 데이터로 실습하기

In [44]:
data = load_breast_cancer()
x_data = data.data
y_data = data.target
X_train,X_test,y_train,y_test = model_selection.train_test_split(x_data,y_data,test_size = 0.2, random_state=0)

In [53]:
svm = SVC(gamma= 'auto')
rf = RandomForestClassifier(n_estimators=100,random_state=0)
lr = LogisticRegression()

In [54]:
svm.fit(X_train,y_train)
rf.fit(X_train,y_train)
lr.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [55]:
svm_pred = svm.predict(X_test)
rf_pred = rf.predict(X_test)
lr_pred = lr.predict(X_test)

In [56]:
new_data = np.array([svm_pred,rf_pred,lr_pred])
new_data.shape

(3, 114)

In [57]:
new_data = np.transpose(new_data)
new_data.shape

(114, 3)

In [59]:
lgbm = LGBMClassifier()
lgbm.fit(new_data,y_test)
lgbm_pred = lgbm.predict(new_data)
meta_y_test_rslt = accuracy_score(y_test, lgbm_pred, normalize=True, sample_weight=None)
print(np.mean(meta_y_test_rslt))

0.9824561403508771


## Stacking-Ensemble K-Fold