# Ensemble Learning

When you want to purchase a new car, will you walk up to the first car shop and purchase one based on the advice of the dealer? It’s highly unlikely.

You would likely browser a few web portals where people have posted their reviews and compare different car models, checking for their features and prices. You will also probably ask your friends and colleagues for their opinion. In short, you wouldn’t directly reach a conclusion, but will instead make a decision considering the opinions of other people as well.

Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to improve the overall performance. 

# Table of Contents
1. [Voting Classifiers](#example)
2. [Bagging](## 2. Bagging)
3. [Random Forest](#third-example)[TODO]
4. [Boosting](#fourth-examplehttpwwwfourthexamplecom)  
4.1 [Ada Boost](#) [TODO]  
4.2 [Gradient Boost](#) [TODO]  
4.3 [XGBoost](#) [TODO]  
4.4 [LightGBM](#) [TODO]

## 1. Voting Classifiers

Suppose you have trained a few classifiers, each one has 80% accuracy. You have a Logistic Regression classifier, a SVM classifier, a Random Forest classifier, and a few more. How do we create a better classifier based on them. 


A simple way to create it is to aggreate the prediction of each classifier and predict the class that gets the most votes. This majority vote classifier is called a **hard voting** classifier

<img src="https://images.theconversation.com/files/193473/original/file-20171106-1041-b3hljk.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=926&fit=clip" style="width:50%;margin-left: 200px;padding:auto;">

This voting classifier often achieves a higher accuracy than the best classifier in the emsemble. In fact, even if each classifier is a weak learner, the ensemble can still be a **strong learner**, provided there are a sufficient number of **weak learners** and they are still **sufficiently diverse**

I think you will be confusing about what happenned. **How is this possible?** 


This can be explained by a **"law of large numbers"**

Suppose you have a slightly biased coin that has a 51% chance of coming up heads, and 49% chance of coming up tails. 
If you toss 1000 times, you will generally get more or less 510 heads and 490 tails. As you keep tossing the coin, 
the ratio of heads gets closer and closer to the probability of heads (51%)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Lawoflargenumbers.svg/400px-Lawoflargenumbers.svg.png"
     style="margin-left:200px">

**NOTE 1**: ensemble methods work best when the predictors are as independent from one another as possible. 
One way to get diverse classifiers is to train them using very different algorithms. 

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split 


# fake some data 
y = np.array([1]*5000 + [0]*5000)
X = np.random.normal(loc=1, scale=0.5, size=10000).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# train model
log_model = LogisticRegression(solver='lbfgs')
rnd_model = RandomForestClassifier(n_estimators=100)
svc_model = SVC(gamma='scale')
voting_model = VotingClassifier(
    estimators=[('lr', log_model), ('rf', rnd_model), ('svc', svc_model)],
    voting='hard'
)
# voting_model.fit(X_train, y_train)

In [35]:
from sklearn.metrics import accuracy_score
for model in (log_model, rnd_model, svc_model, voting_model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(model.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.4893939393939394
RandomForestClassifier 0.5051515151515151
SVC 0.4990909090909091
VotingClassifier 0.5006060606060606


Oh, this result is not amazing because of my fake data. 
But I think you can learn how to do voting classifier after this example.
**Let try it!**

**NOTE 2**: If all classifiers are able to estimate class probabilities, then you can tell sklearn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called **soft voting**.
Now we will consider my example with *soft voting*.

In [37]:
# define model 
log_model = LogisticRegression(solver='lbfgs')
rnd_model = RandomForestClassifier(n_estimators=100)
svc_model = SVC(gamma='scale', probability=True)
voting_model = VotingClassifier(
    estimators=[('lr', log_model), ('rf', rnd_model), ('svc', svc_model)],
    voting='soft'
)

# train and calculate accuracy 
for model in (log_model, rnd_model, svc_model, voting_model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(model.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.4893939393939394
RandomForestClassifier 0.5060606060606061
SVC 0.4990909090909091
VotingClassifier 0.5054545454545455


## 2. Bagging

### 2.1 Introduction

In voting classifier method, we use very different training algorithm. But how about using same method for very different dataset? When **sampling** data is performed with **replacement**, this method is called **bagging**.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/image20-850x320.png">

**Bagging method** is described below:
* Split original dataset to multiple subset with replacement
* Train multiple models with each subset. Models run in parallel and independent with each other
* The final prediction are determined by combining the predictions of all models



<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/Screenshot-from-2018-05-08-13-11-49-850x642.png">

**Note**:
* The combining function is typically the statistical model for classification or the average for regression. Each individual predictor has a higher bias than if it were trained on the original dataset, but combining reduces both bias and variance. 
* Bagging is also called the Bootstrap Aggregating.
Because **bootstrapping** is a sampling technique in which we create subsets of observations from the original dataset, with replacement

*Generally, the bagging model has a similar bias but a lower variance than a single predictor trained on the original training set*

### 2.2 Example 

Let do a example about bagging method by scikit learn library 

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

import numpy as np
from sklearn.model_selection import train_test_split 

bag_model = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100, max_samples=100, bootstrap=True, n_jobs=-1
)

# fake some data 
y = np.array([1]*5000 + [0]*5000)
X = np.random.normal(loc=1, scale=0.5, size=10000).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

bag_model.fit(X_train, y_train)
y_pred = bag_model.predict(X_test)
print(accuracy_score(y_pred, y_test))

0.5087878787878788


### 2.3 Out of Bag Evaluation

We split original dataset to multiple subset with **replacement**. So did you think about what happened to some records that can not be collected. Let call them as **out of bag** (oob) instances.

Because a predictor never sees the oob instances during training, the predictor can be evaluated by these oob instances without seperating validation set or cross validation. You can set **oob_score=True** when creating **BaggingClassifier**

In [10]:
bag_model = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100, max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True
)
bag_model.fit(X_train, y_train)
print('Out of bag score: ', bag_model.oob_score_)
y_pred = bag_model.predict(X_test)
print('Accuracy: ', accuracy_score(y_pred, y_test))

Out of bag score:  0.5125373134328358
Accuracy:  0.5175757575757576
