**Documentation:** 
- [scikit-learn](http://scikit-learn.org/stable/user_guide.html)
- [pandas](http://pandas.pydata.org/pandas-docs/stable/)
- [numpy](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.html)
- [matplotlib](https://matplotlib.org/2.0.2/users/pyplot_tutorial.html)
- [vecstack](https://github.com/vecxoz/vecstack)

<br><font color = "#CC3D3D">
# Ensemble Learning #

#### Data Preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data.data.shape, data.target.mean()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=0)

<font color = "#CC3D3D">
### Ensemble with different models

### Voting ensemble

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

logreg = LogisticRegression(random_state=0)
tree = DecisionTreeClassifier(max_depth=7, random_state=0)
knn = KNeighborsClassifier()
voting = VotingClassifier(
    estimators = [('logreg', logreg), ('tree', tree), ('knn', knn)],
    voting = 'hard')

In [None]:
from sklearn.metrics import accuracy_score

for clf in (logreg, tree, knn, voting) :
    clf.fit(X_train, y_train)
    print(clf.__class__.__name__, accuracy_score(
        y_test, clf.predict(X_test)))

*Plotting Decision Regions*

In [None]:
import matplotlib.gridspec as gridspec
import itertools
from mlxtend.plotting import plot_decision_regions

X, y = X_train, y_train
X = X[:,[0, 10]]

gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))

labels = ['Logistic Regression',
          'Decision Tree',
          'k-NN',
          'Voting Ensemble']

for clf, lab, grd in zip([logreg, tree, knn, voting],
                         labels,
                         itertools.product([0, 1],
                         repeat=2)):
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(lab)

plt.show()

### Averaging predictions

*Arithmetic mean*

In [None]:
averaging = VotingClassifier(
    estimators = [('logreg', logreg), ('tree', tree), ('knn', knn)],
    voting = 'soft')
averaging.fit(X_train, y_train)

In [None]:
roc_auc_score(y_test, averaging.predict_proba(X_test)[:,1])

<img align='left' src="https://t1.daumcdn.net/cfile/tistory/2454233C57FA242D11">

*Geometric mean*

In [None]:
from scipy.stats.mstats import gmean

pred_logreg = logreg.fit(X_train, y_train).predict_proba(X_test)[:,1]
pred_tree = tree.fit(X_train, y_train).predict_proba(X_test)[:,1]
pred_knn = knn.fit(X_train, y_train).predict_proba(X_test)[:,1]

roc_auc_score(y_test, gmean([pred_logreg, pred_tree, pred_knn], axis=0))

### Stacking 
<br>
<img align='left' src="https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier_files/stackingclassification_overview.png" width=500 height=400>

<font color = "blue">
Install **vecstack** package using the following command:
```
!pip install vecstack
```

In [None]:
#!pip install vecstack

In [None]:
from vecstack import stacking

models = [logreg, tree, knn]
S_train, S_test = stacking(models,                     # list of models
                           X_train, y_train, X_test,   # data
                           regression=False,           # classification task (if you need 
                                                       #     regression - set to True)
                           needs_proba=False,          # predict class labels (if you need 
                                                       #     probabilities - set to True) 
                           metric=accuracy_score,      # metric: callable
                           n_folds=4,                  # number of folds
                           stratified=True,            # stratified split for folds
                           shuffle=True,               # shuffle the data
                           random_state=0,             # ensure reproducibility
                           verbose=2)                  # print all info
meta_model = logreg.fit(S_train, y_train)
accuracy_score(y_test, meta_model.predict(S_test))

<br>
#### Improvements in stacked model performance can be accomplished by:
- Adding models to Level 0 and Level 1 using different algorithms
- Tuning Hyper-parameters
- Adding feature sets by feature engineering
- Adding levels in the model structure

#### Model Stacking in Kaggle: 
- [1st Place Solution of "Home Depot Product Search Relevance"](http://blog.kaggle.com/2016/05/18/home-depot-product-search-relevance-winners-interview-1st-place-alex-andreas-nurlan/)
<img src="http://s5047.pcdn.co/wp-content/uploads/2016/05/model-1024x768.png" width=700, height=500>
- [1st Place Solution of "Otto Group Product Classification Challenge"](https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/14335)
<br>
<img src="https://kaggle2.blob.core.windows.net/forum-message-attachments/79598/2514/FINAL_ARCHITECTURE.png" width=600, height=400>
<br><br>
- [1st Place Solution of "Homesite Quote Conversion"](http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/)
<br>
<img src="http://blog.kaggle.com/wp-content/uploads/2016/04/ensemble.jpg" width=600, height=400>
<br><br>
- [1st Place Solution of "Avito Demand Prediction Challenge"](https://www.kaggle.com/c/avito-demand-prediction/discussion/59880)
<br>
<img src="https://pbs.twimg.com/media/DgvX3pWUYAAlCAK.jpg:large" width=650, height=500>

<font color='#CC3D3D'>
### Ensemble with the same model

### Bagging
<img src="http://drive.google.com/uc?export=view&id=1px4nXiYkoRZrPpnHlkYn0hWfGih9SHpB" width=650, height=500>

In [None]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(base_estimator=KNeighborsClassifier(), random_state=0, n_estimators=200)
bagging.fit(X_train, y_train).score(X_test, y_test)

### Boosting
<img src="https://api.ning.com/files/p15FqbgoJADVRIs8f5gZ0AA9rStpTf5RhEjk042yzPGQ4MkxRz1vgGkmb0laAusVQ6eiJ5FTD8Y80zOFTXhDz3kDeBa4jBix/Capture.PNG" width=600, height=400>
- AdaBoost(Adaptive Boosting)
- Gradient Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=200, random_state=0)
ada.fit(X_train, y_train).score(X_test, y_test)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(n_estimators=200, random_state=0)
gbm.fit(X_train, y_train).score(X_test, y_test)

<font color='#CC3D3D'>
### Performance evaluation of ensemble methods

In [None]:
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction

kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts
xyz = []
accuracy = []
std = []
classifiers = ['Voting','Averaging','Bagging','AdaBoost','Gradient Boosting']
models = [voting, averaging, bagging, ada, gbm]

for model in models:
    cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)

models_dataframe = pd.DataFrame({'CV Mean':xyz,'Std':std},index=classifiers)       
plt.subplots(figsize=(12,6))
box = pd.DataFrame(accuracy,index=classifiers)
box.T.boxplot()

<font color = "#CC3D3D">
## End