# Ensemble Models

-- It's better to use models that are essentially different, as the weakness in one balances the other one.

-- If all models can estimate class probability, then you can set *voting=soft*. This often performs better because it gives more weight to confident scores.

-- Another **way to ensemble models is to train the same model with different samples from you training data**. Here we have *bagging (bootstrap aggregating, with replacament)* and *pasting (without replacement)*. Bagging often performs better, so it's preferred. They scale well because they can run in parallel.

-- **Random Patches Method**: Sample both training instances and features.

-- **Random Subspaces Method**: Sample features, keeps all training instances (i.e., bootstrap=False and max_samples=1.0)

-- The aggregation function is typically the mode (most frequent value) or the average (for regression)

-- Random Forests are great to examine Feature Importance, specially when you need to perform feature selection.

-- **Boosting** is taking several weak learners models, in a sequence, and improving their performance. Every new model has a small improvement from the model before it.
It does not scale well since it can't always run in parallel.  GradientBoosting is a boosting ensemble model, that picks up on the residual errors of the trees, in sequence. 
A learning rate of 0.1 is low, but it usually learns well.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

logreg = LogisticRegression()
forest = RandomForestClassifier()
svm = SVC()

voting_classifier = VotingClassifier(
    estimators=[("linear", logreg), ("forest", forest), ("svm", svm)],
    voting="hard"
)

# fit the classifier
voting_classifier.fit(X_train,y_train)

# check each classifier accuracy on the test set
from sklearn.metrics import accuracy_score

for clf in (logreg, forest, svm, voting_classifier):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(score.__class__.__name__, accuracy_score(y_pred, y_test))

In [None]:
# Bagging and Pasting in Scikit-Learn

# the code below will take a X_train and use only 100 samples for each training
# it will probably generalise it better than a single decision tree

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_classifier = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, # if you change boostrap=False then you have Pasting, and not longer Bagging
    # n_jobs=-1 uses all the cores available
    oob_score= True # if you set this to true, you can use the score of "out of bag" as an evaluation, since these instances were not used to train the model
)

bagging_classifier.fit(X_train, y_train)

y_pred = bag_classifier.predict(X_test)

# if you set oob_score=True you can get the score with
bagging_classifier.oob_score_
# you can also check the decision function
bagging_classifier.obb_decision_function_
# it will show for example, the probability of belonging to one class or another