# What are Ensemble Methods?

It is a machine learning technique that involves combining several basic models in order to produce a better model. Here by better, we mean a model that predicts the output variable of the test dataset with more accuracy. To better explain the concept, we will look at an example.

**The problem statement and data for this example has been taken from my 10-601 Introduction to Machine Learning course for Spring 2018 assignment.**

## Problem Statement:

To predict the final grade (A, not A) for high school students based on the following features/attributes: <br>
The student grades on 5 multiple choice assignments M1 through M5, 4 programming assignments P1 through P4, and the final exam F.

I will import the excel file using pandas to show how the data looks like:

In [1]:
import pandas as pd
Path = 'C:\\Users\\deeprob\\10-601\\HW2\\'
train_data = pd.read_csv(Path+'education_train.csv')
train_data.head()

Unnamed: 0,M1,M2,M3,M4,M5,P1,P2,P3,P4,F,grade
0,notA,notA,A,notA,A,A,A,notA,notA,A,A
1,notA,A,A,notA,A,notA,notA,notA,A,notA,notA
2,notA,A,A,A,A,notA,A,notA,notA,A,A
3,notA,notA,notA,notA,A,A,A,notA,notA,A,notA
4,A,notA,A,A,A,A,A,notA,notA,A,A


In the above dataset, M1,M2,M3,M4,M5 are the 5 multiple choice assignment and the grades received by the student is indicated in their respective columns. Similarly, P1,P2,P3,P4 are the programming assignments, F is the final exam and grade is the final grade of the student that we want to predict.

## Data Preprocessing:

First, we need to convert the columns to numerical values in order to make our algorithm work.

In [2]:
train_data.replace('notA',0,inplace=True)
train_data.replace('A',1,inplace=True)

**We will solve this problem using two methods.**

## Problem Solution 1:
In the first method, we will fit a simple Decision Tree classifier to the training data. The specifications of the Decision Tree are as follows:<br>
1. **criterion**: entropy or information gain
2. **max_depth**: 3

In [3]:
Xtrain = train_data.iloc[:,:-1]
ytrain = train_data.iloc[:,-1]

from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth = 3)
clf = clf.fit(Xtrain.values,ytrain.values)

After fitting, we will test the accuracy of this model on the test data.
To do that, we will import and process the test data similar to the train data.

In [4]:
test_data = pd.read_csv(Path + 'education_test.csv')
test_data.replace('notA',0,inplace=True)
test_data.replace('A',1,inplace=True)

Xtest = test_data.iloc[:,:-1]
ytest = test_data.iloc[:,-1]

score_simple = clf.score(Xtest,ytest)
print(f'The accuracy for the simple Decision Tree classifier is: \
{score_simple}')

The accuracy for the simple Decision Tree classifier is: 0.795


## Problem Solution 2:
In the second method, we will try to improve the model by using an ensemble method called the random forest. 

In [5]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators=10,criterion='entropy',
                              max_depth=3,bootstrap=True,
                              random_state=8)
clf2.fit(Xtrain.values,ytrain.values)
print(f'The accuracy of the random forest classifier is: \
{clf2.score(Xtest,ytest)}')

The accuracy of the random forest classifier is: 0.9


  from numpy.core.umath_tests import inner1d


### What exactly happened here?
As we can see, the accuracy on the test dataset increased from 0.795 to 0.9. Therefore, we can say that this model is 'better' than the previous model. 

### But how did this happen and what exactly is a Random Forest? 
Like a forest that consists of many trees, a Random Forest is nothing but a collection of Decision Trees. Each individual tree predicts an outcome for each test case like in solution 1 and the model's prediction is the outcome with the most votes. For eg. for the 1st test case, the simple Decision Tree predicts 'A'. Similarly, we have N-1 other Decision Trees that also give some predictions for that test case. A Random Forest takes into account the predictions of all N Decision Trees and finally predicts the outcome that is most likely out of those N outcomes. How it works is exactly like a jury. Maybe one individual in the jury gives a wrong verdict but the jury as a whole has more chance of giving the correct verdict because they can overrule that one person who gave the wrong verdict.  

### How do we create N-Decision Trees?
1. We take our training data and we divide it randomly into N samples. 
2. For each sample we create a Decision Tree.
Thus, we will have N Decision Trees. 

### How can N-Decision Trees be better than 1 Decision Tree?
A single decision tree may suffer from overfitting. But, N Decision Trees that consider the average of all the predictions tend to cancel each others error by *reducing the variance* as long as they all don't err in the same direction. For that, the trees need to be highly uncorrelated.

*Variance error is variability of a target function's form with respect to different training sets. Models with small variance error will not change much if you replace couple of samples in training set. Models with high variance might be affected even with small changes in training set.*

### How to make the Decision Trees highly uncorrelated?
To do that, we use a technique called **bagging** that selects a random number of samples with replacement from the original dataset and creates a Decision Tree based on these samples. Another method that is used is called **feature bagging or feature randomness** where instead of considering every possible feature while splitting a node, we only take into account a random subset of feature for each tree. This can reduce correlation because if we take into account all the features and there is one feature that is of really high importance, then all the trees will split on that feature and that will result in a bunch of trees that behave similarly. We can avoid it using feature bagging.

## Problem Solution 3:
In the third method, we go a step further and use feature bagging as well to further increase the performance.

In [6]:
from sklearn.ensemble import ExtraTreesClassifier
clf3 = ExtraTreesClassifier(n_estimators=10,criterion='entropy',
                            max_depth=3,bootstrap=True,random_state=47)
clf3.fit(Xtrain.values,ytrain.values)

print(f'The accuracy of the random forest classifier is: \
{clf3.score(Xtest,ytest)}')

The accuracy of the random forest classifier is: 0.915


# Parallel and Sequential Ensemble Methods

What we have seen till now is just one family of ensemble methods called ***parallel ensemble methods***. Here, the basic models are generated in parallel and we take advantage of their independence or uncorrelation to come up with a better model. This strategy is also called bagging. Eg. Random Forest. There is another family of methods called the ***sequential ensemble methods.***

## Sequential Ensemble Methods:
Unlike the parallel ensemble method that focussed on bagging, the sequential ensemble methods focuses on a technique called boosting. Adaboost is the most widely used form of boosting algorithm and we are going to cover it in details.

### What is AdaBoost?
It is a boosting algorithm that fits a sequence of weak algorithms on repeatedly modified versions of the data. 

### How does it work?
1. Initially we assign weights to all the samples and set those weight to 1/N.
2. For each iteration, we modify the weights. Those examples that were incorrectly classified are given more weightage while those that are correctly classified have their weights decreased. Therefore we are increasing the influence of the examples that are difficult to predict at each successive iteration.
3. We repeat the learning algorithm to the reweighted data.

### Why is it better than a simple DecisionTree?
As opposed to bagging that reduces the variance, boosting *reduces the bias*.

*Bias error is due to our assumptions about target function. The more assumptions(restrictions) we make about target functions, the more bias we introduce. Models with high bias are less flexible because we have imposed more rules on the target functions.*

### Solution 4:
Here, we will use Adaboost to increase the performance of our model.

In [7]:
from sklearn.ensemble import AdaBoostClassifier
clf4 = AdaBoostClassifier(clf,n_estimators=5,random_state=0)
clf4.fit(Xtrain.values,ytrain.values)

print(f'The accuracy of the AdaBoost classifier is: \
{clf4.score(Xtest,ytest)}')

The accuracy of the AdaBoost classifier is: 0.98


References:
1. https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
2. https://blog.statsbot.co/ensemble-learning-d1dcd548e936#targetText=Ensemble%20methods%20are%20meta%2Dalgorithms,or%20improve%20predictions%20(stacking).
3. https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249
4. https://www.datacamp.com/community/tutorials/random-forests-classifier-python#comparison
5. https://towardsdatascience.com/understanding-random-forest-58381e0602d2
6. https://stats.stackexchange.com/questions/262794/why-does-a-decision-tree-have-low-bias-high-variance
7. https://scikit-learn.org/stable/modules/ensemble.html
