# Ensemble Learning Algorithm
> More heads better than one
* Bagging
* Boosting
* Stacking

## Voting Classifier
* Ensemble various models which trained on the same dataset and select majority vote
* show slightly better performance than one
* but no big difference. (it's because models trained on same dataset.)

**Classifiers should be independent each other**


In [2]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

moon_data = make_moons(n_samples=10000, noise=0.3)
input = moon_data[0]
labels = moon_data[1]

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3)
lg_reg = LogisticRegression()
rn_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', lg_reg), ('rf', rn_clf), ('svm', svm_clf)],
    voting='hard')

better_voting_clf = VotingClassifier(
    estimators=[('lr', lg_reg), ('rf', rn_clf), ('svm', SVC(probability=True))],
    voting='soft')

for train_index, test_index in sss.split(input,labels):
  train_X, test_X = input[train_index], input[test_index]
  train_Y, test_Y = labels[train_index], labels[test_index]


voting_clf.fit(train_X, train_Y)

for clf in (lg_reg, rn_clf, svm_clf, voting_clf, better_voting_clf):
  clf.fit(train_X, train_Y)
  pred = clf.predict(test_X)
  print(clf.__class__.__name__, accuracy_score(test_Y, pred))







LogisticRegression 0.8593333333333333
RandomForestClassifier 0.9043333333333333
SVC 0.9186666666666666
VotingClassifier 0.915
VotingClassifier 0.9166666666666666


## Bagging & Pasting
> training same multiple training algorithm on different random subset of the training set
* Bagging : replacement on each sampling (means same data can exist within training input)
* Pasting : no replacement on each sampling (every sample is unique within training input)



In [3]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True,n_jobs=-1)
bag_clf.fit(train_X, train_Y)
bag_pred = bag_clf.predict(test_X)
print(accuracy_score(bag_pred, test_Y))

0.9166666666666666


### Out-of-bag evaluation
> bagging, some instance remains unsampled => out-of-bag instances. so they can be used for cross-validation




In [4]:
## cross validation using OOB in bagging
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=1.0, bootstrap=True,n_jobs=-1, oob_score=True)
bag_clf.fit(train_X, train_Y)
bag_pred = bag_clf.predict(test_X)
print(accuracy_score(bag_pred, test_Y))
print(bag_clf.oob_score_)


0.901
0.896


## Random Forests
* ensemble of Decision Trees, trained via the bagging or pasting method 
### Extra-Trees (Extremely Randomized Trees) 
* faster learning speed (because use of random threshold for split)
### Feature Importance
> by investigating how much the tree nodes use the feature to reduce impurity on average 

In [5]:
bag_clf = RandomForestClassifier(n_estimators=100,bootstrap=True, n_jobs=-1)
bag_clf.fit(train_X, train_Y)
bag_pred = bag_clf.predict(test_X)
print(bag_clf.feature_importances_)

[0.45617774 0.54382226]


## Boosting (Hypothesis boosting)
> train predictors sequentially, each trying to correct its predecessor. 
* Most Popular
  * AdaBoost(Adaptive Boosting)
    * Sequentially train multiple model, increasing weight on the instances misclassified.
  * Gradient Boosting
    * Sequentially train multiple model, fitting model to residual error of prior stage
    * early stop technique is useful for finding optimal


In [7]:
## AdaBoost in scikit learn
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5)

ada_clf.fit(train_X, train_Y)
ada_pred = ada_clf.predict(test_X)
print(accuracy_score(test_Y,ada_pred))


0.916


In [17]:
## Gradient Boost in scikit learn

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

gb_clf = GradientBoostingClassifier(max_depth=2, n_estimators=100, learning_rate=0.1)
gb_clf.fit(train_X, train_Y)
gb_pred = gb_clf.predict(test_X)
accuracy_score(test_Y, gb_pred)

best_score = 0
best_i = 0;

accus = [accuracy_score(test_Y, pred) for pred in gb_clf.staged_predict(test_X)]
best_esti = np.argmax(accus)


print('best estimators : ', best_esti, accus[best_esti])



best estimators :  76 0.9173333333333333


## Stacking
* enhanced version of voting classification
* aggregating multiple predictions using ml model instead of trivial functions





In [61]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.svm import NuSVC, SVC
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25)

digits = load_digits()
list(digits)

data = digits['data']
labels = digits['target']
for train_index, test_index in sss.split(data, labels):
  train_X, test_X = data[train_index], data[test_index]
  train_Y, test_Y = labels[train_index], labels[test_index]

rn_clf = RandomForestClassifier()
xtr_clf = ExtraTreesClassifier()
svc_clf = SVC(probability=True)
vt_clf = VotingClassifier(estimators=[('rn',rn_clf),('xtr', xtr_clf), ('sv', svc_clf)], voting='soft',n_jobs=-1, flatten_transform=True)
stk_clf = StackingClassifier(estimators=[('rn',rn_clf),('xtr', xtr_clf), ('sv', svc_clf)], n_jobs=-1)

pred_map = []
for clf in (rn_clf, xtr_clf, svc_clf):
  print(clf.__class__.__name__, cross_val_score(clf,train_X, train_Y))
  clf.fit(train_X, train_Y)
  pred = clf.predict(test_X)
  print(clf.__class__.__name__, accuracy_score(test_Y, pred))
  pred_map.append(np.array(pred))
  
pred_map = np.column_stack(pred_map)
print(pred_map)
  








RandomForestClassifier [0.97777778 0.98148148 0.95910781 0.97026022 0.98141264]
RandomForestClassifier 0.9777777777777777
ExtraTreesClassifier [0.98148148 0.98148148 0.9739777  0.97769517 0.98884758]
ExtraTreesClassifier 0.9844444444444445
SVC [0.98148148 0.97407407 0.98884758 0.98884758 0.99256506]
SVC 0.9911111111111112
[[1 1 1]
 [3 3 3]
 [3 3 3]
 ...
 [2 2 2]
 [3 3 3]
 [9 9 9]]


In [74]:
from sklearn.svm import SVC
blender = DecisionTreeClassifier(max_depth=10)
blender.fit(pred_map, test_Y)
pred_val = []
for clf in (rn_clf, xtr_clf, svc_clf):
  clf.fit(train_X, train_Y)
  pred = clf.predict(test_X)
  pred_val.append(np.array(pred))

pred_val = np.column_stack(pred_val)
bl_pred = blender.predict(pred_val)
accuracy_score(test_Y, bl_pred)

0.9911111111111112