## Ensemble ##
* A group of predictors is called an ensemble and thus the technique is called ensemble learning and an ensemble learning algorithm is called an Ensemble method.

** Popular Ensemble Methods **
* Bagging
* Boosting
* Stacking

** Voting Classifiers **
* Aggregating classifiers and predicting with the most votes i.e. a majority vote classifier is called a hard voting classifier.

* Even if some of the classifiers are weak learners(meaning slightly better than random guessing) the ensemble can be a strong learner,given there are sufficient number of weak learners and they are sufficiently diverse.

 
** Use diverse predictors for ensemble as this may improve the ensembles accuracy as this increases the chances that they will make very different types of errors**

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

sv_clf=SVC(random_state=42)
rnd_clf=RandomForestClassifier(random_state=42)
log_clf=LogisticRegression(random_state=42)

voting_clf=VotingClassifier(
                             estimators=[("sv_clf",sv_clf),
                                        ("rnd_clf",rnd_clf),
                                        ("log_clf",log_clf)],
                             voting="hard"
                            )

for clf in [sv_clf,rnd_clf,sgd_clf,voting_clf]:
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

('SVC', 0.88800000000000001)
('RandomForestClassifier', 0.872)
('SGDClassifier', 0.85599999999999998)
('VotingClassifier', 0.89600000000000002)


** Soft Voting **
* Instead of going for majority vote if the probabilities are considered and average is taken and classification is done based on same.
* For soft voting the estimators must have predict_proba as a function and that is they have to estimate the probabilites.
* For above changing the voting to "soft" will transform it to soft classifier

** Bagging And Pasting**

* Using same training algorithm for every predictor but training on different random subsets of the training set.
* Sampling done with replacement is called bagging(bootstrap aggregating)
* Sampling done without replacement is called Pasting
* For aggregation of results typically statistical mode(most frequent) is used for classification and average is used for aggregation.

** Scaling Aspect **
* The above methods are highly scalable as training and predictions can be done in parallel.

In [40]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf=BaggingClassifier(DecisionTreeClassifier(random_state=42),n_estimators=500,max_samples=100,n_jobs=-1,bootstrap=True,random_state=42)
bag_clf.fit(X_train,y_train)

accuracy_score(y_test,bag_clf.predict(X_test))

0.90400000000000003

**Note **
* BaggingClassifier automatically performs soft voting if the base classifier has predict_proba
* For pasting bootstrap needs to be set to False

** Out of Bag Evaluation **
* With bagging(bootstrap= True) during sampling only some of the samples(63%) are used for training of the data and the remaining 37% are called out of bag(oob) instances.
* They are not the same 37% for all predictors.
* Now a cross validation can be made possible on evaluating the predictors on the oob instances.

In [46]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42),
                             n_estimators=500,n_jobs=-1,bootstrap=True,
                             oob_score=True)
bag_clf.fit(X_train,y_train)
print(bag_clf.oob_score_)
y_pred=bag_clf.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.888
0.904


In [49]:
bag_clf.oob_decision_function_

array([[ 0.42613636,  0.57386364],
       [ 0.37096774,  0.62903226],
       [ 1.        ,  0.        ],
       [ 0.        ,  1.        ],
       [ 0.        ,  1.        ],
       [ 0.09142857,  0.90857143],
       [ 0.38118812,  0.61881188],
       [ 0.01604278,  0.98395722],
       [ 0.9902439 ,  0.0097561 ],
       [ 0.96276596,  0.03723404],
       [ 0.8       ,  0.2       ],
       [ 0.00578035,  0.99421965],
       [ 0.78172589,  0.21827411],
       [ 0.82065217,  0.17934783],
       [ 0.96236559,  0.03763441],
       [ 0.07978723,  0.92021277],
       [ 0.00571429,  0.99428571],
       [ 0.975     ,  0.025     ],
       [ 0.92899408,  0.07100592],
       [ 1.        ,  0.        ],
       [ 0.02688172,  0.97311828],
       [ 0.35078534,  0.64921466],
       [ 0.89204545,  0.10795455],
       [ 1.        ,  0.        ],
       [ 0.99029126,  0.00970874],
       [ 0.        ,  1.        ],
       [ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.        ,

** Random Patches N Random Subspaces**
* As max_samples and bootstrap is for training set sampling max_features and bootstrap_features is for featire sampling
* Sampling both training set and feature set is called Random Patches
* Sampling feature set but keeping all training set(bootstrap=false and max_sample=1.0) is called Random subspaces method.
* Sampling features results in bit more diversity trading bit more bias for lower variance

** Random Forests **
* Random forests are ensemble of decisiontreeclassifiers generally trained via Bagging(sometimes pasting) typically with max_samples set to size of the training set.
* RF is optimized Bagging of dcf
* with few exceptions RF has all parameters of DCF and Bagging
* It also brings randomness in selection of best feature when splitting node by selecting the same from random subset of features.Again trading more bias for lower variance.

In [69]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rf_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1,random_state=42)
rf_clf.fit(X_train,y_train)
y_pred_rf=rnd_clf.predict(X_test)

#Bagging rough equivalent of RF
bag_clf=BaggingClassifier(DecisionTreeClassifier(splitter="random",max_leaf_nodes=16)
                          ,n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1,random_state=42)
bag_clf.fit(X_train,y_train)
y_pred_bag=bag_clf.predict(X_test)
print(np.float(np.sum(y_pred_rf == y_pred_bag))/len(y_pred_rf))

0.92


** Extra Trees **
* Random forest can be made more random by selecting feature more randomly by using random thresholds.
* These are called Extremely Randomized Trees
* sklearn: ExtraTreesClassifier/ExtraTreesRegressor

** Feature Importances **
* Feature importances in Random Forest can be calculated by averaging the depth at which the feature appeared across all trees .
* Can be accessed through feature_importances_ variable.
* Can be used for feature selection.

** Boosting **
* Refers to any ensemble method that can combine several weak learners into a strong learner
* Main idea is to train sequentially,each trying to correct it's predecessor.
* Most popular one's:
* Adaptive Boosting(Adaboost)
* Gradient Boosting

** AdaBoost **

* Adaboost involves assigning weights to different training instances and construction of successor predictor such that more weight is assigned to underfitted instances in the predecessor.

* Each predictor is given a weight based on their performance and the same weight is used in assigning weight to the misclassified instances in successor predictor

* The prediction is made based on the a value calculated by weighted averaging of the probabilities or taking the weight of the predictors which classified the instance as one particular class.

* Sklearn uses multiclass version of AdaBoost called Stagewise Additive Modelling using a MultiClass Exponential loss function.

* Sklearn also uses a variant of SAMME called SAMME.R(R stands for Real) which relies on class probabilities rather than predictions.

** Drawback **
* The main drawback is it cannot be parallelized(only partially) since each predcitor is only trained after the predecessor training as it is dependent on the end result.

In [75]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=200,algorithm="SAMME.R",learning_rate=0.5)
ada_clf.fit(X_train,y_train)
print(accuracy_score(y_test,ada_clf.predict(X_test)))

0.896
