# Ensembling and Random Forests

Description: 
Learn about what ensemble is, it's different methods and how it helps improving the solution of machine learning problems.


## Overview
- Introduction to Ensembling
- Soft vs Hard Voting
- Aggregation
- Stacking
- Random Forest
- Hyperparameter Tuning


## Pre-requisites

-  


## Learning outcomes

- Understanding the intuition behind ensemble methods
- Working with different types of ensemble methods
 

## Chapter 1: Introduction to Ensembling 

###    Description: 
Understand ensembling's intuition and the different types assosciated with it.

### 1.1 Problem Statement

***
The problem that we will solve is the same one we encountered in the decision tree course(link) related to the banking sector.
Just to refresh, the problem statement is as follows: 
A bank has put out a marketing campaign and wants to know how the campaign is working. Given the features of the client and the marketing campaign, we have to predict whether the customer will subscribe to a term deposit or not. To get a deeper understanding of the problem, you can read more about the problem [here](https://pdfs.semanticscholar.org/a175/aeb08734fd669beaffd3d185a424a6f03b84.pdf). 
 

We will be using the dataset with all the 17 features for the purpose of understanding ensemble methods. 
- `age (numeric)` - age of the bank customer
- `job(categorical)`- job of the bank customer
- `marital(categorical)`- marital status of the bank customer
- `education(categorical)`- Education status of the customer
- `default(categorical)` - Whether the customer has credit in default?
- `balance (numeric)` - average yearly balance in euros
- `housing (categorical)` - Whether the customer has a housing loan?
- `loan(categorical)`- Whether the customer has a personal loan?
- `contact(categorical)`- contact communication type 
- `day(numeric)`- last contact date(of the month) of the year
- `month(categorical)`- last contact month of year
- `day(categorical)`- last contact day of the week (: 'mon','tue','wed','thu','fri')
- `duration (numeric)` - last contact duration, in seconds
- `campaign (numeric)` - number of contacts performed during this campaign and for this client
- `pdays (numeric)`- number of days that passed by after the client was last contacted from a previous campaign 
- `previous (numeric)`- number of contacts performed before this campaign and for this client (numeric)

*** 
- `Target`: deposit - has the client subscribed a term deposit? (binary- 0: no, 1:yes)


We will again try to fit the decision tree model we learned on the data
***

   
**Accuracy score using Decision Tree(Note the overfitting of train data)**

```python
#Training the model
Decision_Tree.fit(data_train, label_train)

#Accuracy of the train data
Decision_Tree_Score=Decision_Tree.score(data_train,label_train)
print("Training Score: %.2f "%Decision_Tree_Score)

#Accuracy of the test data
Decision_Tree_Score=Decision_Tree.score(data_test,label_test)
print("Training Score: %.2f "%Decision_Tree_Score)

```
**Output:**
```python
Training Score: 1.00

Test Score: 0.76   
```
***

Can we do better than that? 

We know decision tree is powerful, what if we combine multiple decision tree models together?

Let's call this combination model as ensemble model

***

**Accuracy score using the Ensemble model** 

```python
Ensemble_Model.fit(data_train, label_train)

#Accuracy of the train data
Ensemble_Model_Score=Ensemble_Model.score(data_train,label_train)
print("Training Score: %.2f "%Ensemble_Model_Score)

#Accuracy of the test data
Ensemble_Model_Score=Ensemble_Model.score(data_test,label_test)
print("Training Score: %.2f "%Ensemble_Model_Score)


```

**Output:**

```
Training Score: 0.83

Test Score: 0.82   

```
***  

### 1.2. Wisdom of Crowd


It is the idea that when it comes to problem solving and decision making, collective intelligence of many often surpasses the intelligence of a single expert.

 ![Wisdom of Crowds](..\images\wisdom_of_crowds.jpg)
 
 
 
For eg: Suppose you decide you want to go to 'Rome' for your vacation. However, you are not sure if it is a good place to visit during Summer. So you ask a bunch of people

1. A travel guide, whose opinions about travel destination are 70% times similar to yours.
2. A YouTube trip vlogger, who is 80% times similar to your opinions about a destination.
3. A close friend of yours, who is 60% of times similar to your opiniions. 

Though individually each one would may have some sort of bias(For eg: Your friend said no cause of his aversion towards forts in Rome)but when taken together, the probabilty of them being wrong simultaneously is equal to-
>* P = (1 - 0.7) x (1 - 0.8) x (1 - 0.6)
>* P = 0.024

Which means that there is 97.6% chance that their opinion will be good (given their opinions are independent from each other).

Ensemble works on similar principles.

### Definition

Ensemble modeling is a machine learning technique of combining multiple machine learning models to produce one optimal model . 
Though there exists many different techniques of it, at their core they all employ the same two methods:

- Produce a cohort of predictions using simple ML algorithms.
- Combine the predictions into one "aggregated" model.

### Why ensemble modeling?
In real world scenarios, generalizing on a dataset by a single model can be challenging. Some models will be able to capture one aspect of the data well while others will do well in capturing something else. Ensemble modeling provides us a with family of techniques that help reduce errors and make predictions where the final evaluation metrics(For eg: Accuracy) are better than they are for each of the individual models.

![ensemble_method](..\images\ensemble_cat.jpg)


Let's further explore this technique of Ensembling using a mathematical thought process.


###  Strong vs Weak Learner



**Condorcet’s Jury Theorem**


 ![Jury](..\images\jury.jpg)
 
 
Let's say a jury of voters are needed to make a decision regarding a binary outcome (for example to convict a defendant or not).

If each voter has a probability p of being correct and the probability of a majority of voters being correct is L, then **L > p if p > 0.5** if the voters as independent from each other. Interestingly, **L approaches 1 as the number of voters approaches infinity**.


In human language, p > 0.5 means that the individual judgments (votes) are at least a little better than random chance.

Now, let's take this analogy to the world of ML:

* Verdict ~> classification prediction
* Jury members ~> ML models
* votes ~> individual predictions

This means that employing multiple ML models should improve the performance according to the Condorcet's theorem( and it does!)


We only need a large number of learners, whose predictive power is just slightly better than random chance (tossing a coin in case of binary classification problem!) for ensembling to work.Such learners have a special name --"weak learners".

Formally, they are defined as:

* **Weak Learner:**
Given a labeled dataset, a Weak Learner produces a classifier which is at least a little more accurate than random classification.



* **Strong Learner:**
We call a machine learning model a Strong Learner which, given a labeled dataset, can produce results arbitrarily well-correlated with the true classification.

### 1.3 Different techniques of Ensembling

Following are the different techniques, ensemble modeling is broadly divided into:


#### 1. Voting/Aggregating 

`This technique involves building multiple models(usually of differing types) and the predictions which we get after averaging(regression) or voting(classification) the results of the models are used as the final prediction.`


![Voting](..\images\VA.jpg)
 

#### 2. Stacking

`This technique involves combining multiple classification models via a meta-classifier i.e. instead of using trivial functions to aggregate the predictions , a model is trained to perform this aggregation.`

![Stacking](..\images\stacking.jpg)


#### 3. Boosting

`This technique involves a sequential process, where each subsequent model attempts to correct the errors of the previous model. More weight is given to examples that were misclassified by earlier rounds and then the final prediction is produced by combining the results using a weighted average approach.`

![Boosting](..\images\Boostingg.jpg)


For image reference-http://manish2020.blogspot.com/


We will go through each of them(except boosting) in the susbsequent chapters

## Chapter 2: Aggregation

###    Description: 
Understand in depth about the aggregation method.

### 2.1 Naive Aggregation

Suppose you want to watch a movie of your favourite actor this weekend. His last two movies had been disappointing, so you decide to watch the movie based upon the ratings given by your three friends. 
The most obvious and intuitive way then would be to average out all the three ratings and make your decision.

Similarly the most intuitive way to combine models is averaging out their indvidual predictions. 

**Naive aggregation** works by aggregating the final output through averaging (regression) or voting (classification).
It works best with algorithms which learn very differently from each other, thereby complementing each others' decisions.

##### Soft Voting vs Hard Voting
***
Since, every classification algorithm first calculates the probabilities of each outcome, and then produces the prediction, the aggregation could be done either on calculated probabilities, or final predictions.

* In **hard voting**, the voting classifier takes majority of its base learners’ predictions
* In **soft voting**, the voting classifier takes into account the probability values by its base learners 

In general, soft voting has been observed to perform better than hard voting.


##### Python Implementation of Voting

For this we will use a subset of our original dataset containing only 3000 datapoints.

```python
from sklearn.ensemble import VotingClassifier

#Four random models intitialised
log_clf_1 = LogisticRegression(random_state=0)
log_clf_2 = LogisticRegression(random_state=42)
decision_clf1 = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
decision_clf2 = DecisionTreeClassifier(criterion = 'entropy', random_state=42)


#Creating a list of models
Model_List=[('Logistic Regression 1', log_clf_1),
            ('Logistic Regression 2', log_clf_2),
            ('Decision Tree 1', decision_clf1),
            ('Decision Tree 2', decision_clf2)]


#Features
X= bank_sample.drop(['deposit'],1)

#Target variable
y=bank_sample['deposit'].copy()


#Splitting into train and test dataset
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0)


#Initialising hard voting model
voting_clf_hard = VotingClassifier(estimators = Model_List,
                                   voting = 'hard')

#Fitting the data
voting_clf_hard.fit(Data_train, label_train)

#Scoring the model for train
hard_voting_score=voting_clf_hard.score(Data_train,label_train)
print("Hard Voting Train Accuracy:%.2f"%hard_voting_score)


#Scoring the model for test
hard_voting_score=voting_clf_hard.score(Data_test,label_test)
print("Hard Voting Test Accuracy:%.2f"%hard_voting_score)

#Initialising soft voting model
voting_clf_soft = VotingClassifier(estimators = Model_List,voting = 'soft')


#Fitting the data
voting_clf_soft.fit(Data_train, label_train)

#Scoring the model for train
soft_voting_score= voting_clf_soft.score(Data_train,label_train)
print("Soft Voting Train Accuracy: %.2f"%soft_voting_score)

#Scoring the model for test
soft_voting_score= voting_clf_soft.score(Data_test,label_test)
print("Soft Voting Test Accuracy: %.2f"%soft_voting_score)

#Solution ends

```

**Output:**

```python

Hard Voting Train Accuracy:0.88

Hard Voting Test Accuracy:0.75
    
Soft Voting Train Accuracy: 1.00

Soft Voting Test Accuracy: 0.75
```

Let's now apply Soft Voting and Hard Voting on the complete bank dataset. 


#### Task 1 - Using voting method for prediction

In this task, you will apply voting method on different ML models to predict target of our `bank problem`.

***
- Load the dataset from the path using the `read_csv()` method from pandas and store it in a variable called data 

- Look at the first 10 rows of the data using the `head()` method. [For you to see the dataset features] 

- Store all the features of `'data'` in  a variable called `X`

- Store the target variable (`deposit`) of `'data'` in a variable called `y`

- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`

- Four different ML models for ensmbling has already been defined in the notebook for you

- Use the `VotingClassifier()` from sklearn to initialize a voting classifier object Pass the `'Model_List'` as input to the `estimators` parameter and `'hard'` to the `voting` parameter while initializing the object. Save the object in a variable `'voting_clf_hard'`.

- Use the `fit()` method of the `'voting_clf_hard'` to train the model on the `'X_train'` and `'y_train'`. 

- Use the `score()` method of the `voting_clf_hard` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store it in a variable called `'hard_voting_score'`. 

- Repeat the same steps for soft voting.

- Use the `VotingClassifier()` from sklearn to initialize a voting classifier object Pass the `'Model_List'` as input to the `estimators` parameter and `'soft'` to the `voting` parameter while initializing the object. Save the object in a variable `'voting_clf_soft'`.

- Use the `fit()` method of the `'voting_clf_soft'` to train the model on the `'X_train'` and `'y_train'`. 

- Use the `score()` method of the `voting_clf_soft` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store it in a variable called `'soft_voting_score'`. 


***

After the task, compare both the accuracy scores.
(Additionaly, you could also try to use different combinations of the given machine learning models)


In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

#Different models initialised
log_clf_1 = LogisticRegression(random_state=0)
log_clf_2 = LogisticRegression(random_state=42)
decision_clf1 = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
decision_clf2 = DecisionTreeClassifier(criterion = 'entropy', random_state=42)


#Creation of list of models
Model_List=[('Logistic Regression 1', log_clf_1),
            ('Logistic Regression 2', log_clf_2),
            ('Decision Tree 1', decision_clf1),
            ('Decision Tree 2', decision_clf2)]

#Solution begins


path='../data/bank_data_new.csv'
data=pd.read_csv(path)

#Features
X= data.drop(['deposit'],1)

#Target variable
y=data['deposit'].copy()


#Splitting into train and test dataset
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0)


#Initialising hard voting model
voting_clf_hard = VotingClassifier(estimators = Model_List,
                                   voting = 'hard')

#Fitting the data
voting_clf_hard.fit(X_train, y_train)

#Scoring the model for test
hard_voting_score=voting_clf_hard.score(X_test,y_test)
print("Hard Voting Test Accuracy:%.2f"%hard_voting_score)

#Initialising soft voting model
voting_clf_soft = VotingClassifier(estimators = Model_List,voting = 'soft')


#Fitting the data
voting_clf_soft.fit(X_train, y_train)

#Scoring the model for test
soft_voting_score= voting_clf_soft.score(X_test,y_test)
print("Soft Voting Test Accuracy: %.2f"%soft_voting_score)

#Solution ends

Hard Voting Test Accuracy:0.77
Soft Voting Test Accuracy: 0.79


### 2.2. Bootstrap Aggregation(Bagging)

Continuing with the movie dilemma of your favourite actor. After getting your friends' opinions, you are still not satisfied and think to yourself, 

**What could better than the wisdom of crowds?**

**Ans. Wisdom of diverse experts!**.

A classic example of it, is the **minister cabinet** of the kings in the older times, where each minister used to be an expert of a particular area and the king would ask for opinions from them before taking any major decisions.

So instead of going with the opinion of your friends, you decide to see the review of some respected critics. Now instead of getting a general opinion, you have taken the review of movie critics who have unrelated and independent views about the movie.  

**A similar approach is used in bagging. Each base learner is trained on different sample of data making each learner, a `specialist base learner`.**

#### Definition

Bagging which **B**ootstrap **AGG**regat**ING** is usually called, is an approach to ensemble learning where given a training set,  multiple different training sets (called bootstrap samples) are created, by sampling with replacement from the original dataset. Then, for each bootstrap sample, a model is built.The individual predictions are then aggregated to form a final prediction.

Unlike naive aggregator, bagging uses a single type of base learner

![Bagging](..\images\Bagging.jpg)


#### Bias-Variance trade off 
***


To better understand model predictions, it’s important to understand the prediction errors:

1. Bias 

2. Variance 

Consider the following:

Assume a dataset with features 'X' and target variable 'y' and you create a model 'F(X)' for predicting 'y'.


The expected squared error for a point x, will be then defined as :

$Error(x)=E[(y- F(x))^2]$    `#E(x)=avg(x)`

$Error(X)$ can be further broken down into the following:

$Error(x)=(E[F(x)]- y)^2  + E[(F(x)- E[F(x)])^2] + \alpha$

Which can also be written as:

$E(X)=Bias^2  + Variance + Noise$

 
**Error due to Bias:**

The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. 

**Error due to Variance:**

The error due to variance is taken as the variability of a model prediction for a given data point.


Since both are errors, you would ideally want a model having both low bias and low variance

However to achieve that is difficult and there is always a tradeoff between a model’s ability to minimize bias and variance. 

Let's understand that with an example:

Consider the following data distibution



![](../images/bv_1.jpg)

Following will be the model if we try to `reduce bias error`:

![](../images/bv_2.jpg)

Two things you can observe:

* The model has overfit the data(The model has become super complex and has fitted even noise)

* While reducing bias, the variance has increased



Similarly, following will be the model if we try to `reduce variance error`:

![](../images/bv_3.jpg)

Two things you can observe:

* The model has underfit the data(The model is too simple and has fitted the data too poorly)

* While reducing variance, the bias has increased


Your ideal model therefore becomes one having the perfect-bias variance tradeoff

![](../images/bv_4.jpg)


As clear from the example, since a model can't be both less complex and more complex at the same time, there is no escaping the relationship between bias and variance in machine learning.


* Increasing in bias = decrease in variance. 

* Increasing in variance = decrease in bias.

Following is a graph explaining the error contributed by the bias and variance

![](../images/bv_5.jpg)


Another way to look at it is using the following diagram:

![](../images/bias_variance_tradeoff_2.png)

Imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the target, our predictions get worse and worse. 

The optimum model of course is the one with low bias and low variance.

High bias(low variance) algorithm train models that are consistent but inaccurate on average.

High variance(low bias) algorithm train models that are accurate but inconsistent on average.

Having both high variance and high bias results in the least optimum model which is both inconsistent and inaccurate.




**Bias-Variance Tradeoff in Bagging**

In bagging, because of bootstrapping, each individual predictor has **a higher bias** than if it were trained on the original training set. However, a large number of such biases will get reduced when aggregated, hence the bias of the resulting bagging is only slightly higher than a comparable single predictor strong learner. 



At the same time, because bagging provides a way to **reduce overfitting** owing to less dependence on one particular subset of training data, the **variance of resulting strong learner reduces** significantly.


Generally, the net result is that the ensemble has a **similar bias** but a **lower variance** than a single predictor trained on the original training set.

##### Python Implementation of Bagging

For this we will use the same subset of our original dataset containing only 3000 datapoints.

```python

#Features
X= bank_sample.drop(['deposit'],1)

#Target variable
y=bank_sample['deposit'].copy()


#Splitting into train and test dataset
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0)




#Initialising bagging with appropriate parameters
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=100, random_state=0)

#Fitting the data
bagging_clf.fit(X_train, y_train)

#Scoring the model for train data
score_bc_dt = bagging_clf.score(X_train, y_train)
print("Training score: %.2f " % score_bc_dt)


#Scoring the model for test data
score_bc_dt = bagging_clf.score(X_test, y_test)
print("Testing score: %.2f " % score_bc_dt)

```

**Output:**

```python
Training score: 0.82
    
Testing score: 0.80
```

Let us now apply Bagging to solve the problem statement.

#### Task 2 - Using bagging for prediction

In this task, you will apply Bagging of decision trees to predict target.


***
- Use the `BaggingClassifier()` from sklearn to initialize a bagging classifier object. Pass the parameter `base_estimator`= DecisionTreeClassifier, `n_estimators`=100 ,  `max_samples`=100 and `random_state`=0, while initializing the object. Store the object in the variable `'bagging_clf'`


- Use the `fit()` method of the bagging classifier object `'bagging_clf'` on `'X_train'` and `'y_train'` to train the models on the training data. 


- Use the `score()` method of the bagging classifier object `'bagging_clf'` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store the score in a variable called `'score_bagging'`
***

After the task, compare the accuracy score with the previous voting method.
Has it improved? Why?


In [40]:
# Fitting bagging classifier with Decision Tree
# path='../data/bank_data_new.csv'
# data=pd.read_csv(path)

# bank_sample=data.sample(n=3000,random_state=0)

# #Features
# X= bank_sample.drop(['deposit'],1)

# #Target variable
# y=bank_sample['deposit'].copy()


# #Splitting into train and test dataset
# X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0)




#Initialising bagging with appropriate parameters
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=100, random_state=0)

#Fitting the data
bagging_clf.fit(X_train, y_train)


#Scoring the model for test data
score_bagging = bagging_clf.score(X_test, y_test)
print("Score: %.2f " % score_bagging)


Score: 0.81 


### 2.3.Pasting

In bagging we tried to create samples through resampling with replacement, in the same way, we can create samples resampling **without replacement** for each base learner. Ensemble on such samples is known as **Pasting**.

Replacement introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting. 

But in pasting the predictors end up being more correlated so the ensemble’s variance is increased. 

**Image**

**Overall, bagging often results in better models than pasting** 

However, given spare time and CPU power it is worth using cross- validation to evaluate both bagging and pasting and select the one that works best.

Python implementation of `pasting` is same as `bagging` with an added parameter of changing `bootstrap=False`

```python

BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=100,bootstrap=False, random_state=0)
```
Let us try and implement Pasting on our dataset


#### Task 3 - Using pasting for prediction

In this task, you will apply Pasting to predict target.

***
- Use the `BaggingClassifier()` from sklearn to initialize a bagging classifier object. Pass the parameter `base_estimator`= DecisionTreeClassifier, `n_estimators`=100 ,  `max_samples`=100, `bootstrap`=False and `random_state`=0, while initializing the object. Store the object in the variable `'pasting_clf'`.

- Use the `fit()` method of the bagging classifier object `'pasting_clf'` on `'X_train'` and `'y_train'` to train the models on the training data. 

- Use the `score()` method of the bagging classifier object `'pasting_clf'` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store the score in a variable called `'score_pasting'`
***

After the task, compare the accuracy score with the bagging method.

In [41]:
# Fitting pasting with Decision Tree
pasting_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=100,bootstrap=False, random_state=0)

pasting_clf.fit(X_train, y_train)

score_pasting = pasting_clf.score(X_test, y_test)
print("Pasting score: %.2f " % score_pasting)

Pasting score: 0.81 


### 2.3. Random Forest.

Just when you were about to decide whether you want to watch a movie, one of your friends asked you to go to a party with him. 
Now you realise in order to make your decision you need to be absolutely sure if the movie is good, 
otherwise you will regret not going to party. You start thinking, 

**How can you improve upon the wisdom of diverse experts?**

**Ans. Experts whose area of expertise is more inclined towards and suitable for solving that particular problem**

So keeping that in mind, from the many movie critics, you select the following three:  
* Critic 1: A respected critic whose opinion about movies usually resonates with your taste in movies. 

* Critic 2: A youtube critic who is a huge fan of the actor's movies

* Critic 3: A filmmaker turned critic who specialises in reviewing the particular genre the film is of.


Random forests work similarly by taking only those features that work best to find the optimum solution.

#### Definition

Random forest is an ensemble method of bagging multiple decision trees. The fundamental difference is that in Random Forests, along with bootstrap sampling, only a subset of features are selected at random out of the total features

    
           
![RandomVBagging](..\images\BvR.jpg)


Random forest is one of the most popular ensemble algorithms(or for that matter, one of the most popular ML algorithms) owing to it's 
- inherent feature selection 
- simplicity to train 
- and versatilite problem solving(it can be used for classification, regression, cluster analysis..).

Although there exists models that can beat Random Forests for a given dataset (usually boosting or neural network), it’s never by a big margin. Additionaly it also takes much longer to build those models than it takes to build the Random Forest, making them excellent benchmark models.


#### Bias variance trade off in Random Forests

Random forests results in a greater tree diversity, which trades a **higher bias**(Owing to features getting subsetted) for a **lower variance**, generally yielding an **overall better model**.


##### Python Implementation of Random Forests

For this we will use the same subset of our original dataset containing only 3000 datapoints.

```python

#Features
X= bank_sample.drop(['deposit'],1)

#Target variable
y=bank_sample['deposit'].copy()


#Splitting into train and test dataset
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0)


#Initialising Random Forest model
rf_clf=RandomForestClassifier(n_estimators=100,n_jobs=100,random_state=0, min_samples_leaf=100)

#Fitting on data
rf_clf.fit(X_train, y_train)

#Scoring the model on train data
score_rf=rf_clf.score(X_train, y_train)
print("Training score: %.2f " % score_rf)

#Scoring the model on test_data
score_rf=rf_clf.score(X_test, y_test)
print("Testing score: %.2f " % score_rf)

```
**Output:**
```python
Training score: 0.80 

Testing score: 0.79 
```

Let us now apply Random Forests to our complete bank statement problem dataset


#### Task 4 - Using Random Forest for prediction

In this task, you will apply Random Forest to predict the target.

***
- Use the `RandomForestClassifier()` from sklearn to initialize a random forest classifier object. Pass the parameter `n_estimators`=100,`n_jobs`=100, `min_samples_leaf`=100 and `random_state`=0, while initializing the object. Store the object in the variable `'rf_clf'`.

- Use the `fit()` method of the bagging classifier object `'rf_clf'` on `'X_train'` and `'y_train'` to train the models on the training data. 

- Use the `score()` method of the bagging classifier object `'rf_clf'` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store the score in a variable called `'score_rf'`


***

After the task, compare the accuracy score with the previous voting method.
Has it improved? Why?


In [42]:
#Initialising Random Forest model
rf_clf=RandomForestClassifier(n_estimators=100,n_jobs=100,random_state=0, min_samples_leaf=100)

#Fitting on data
rf_clf.fit(X_train, y_train)

#Scoring the model on test_data
score_rf=rf_clf.score(X_test, y_test)
print("Testing score: %.2f " % score_rf)

Testing score: 0.82 


## Chapter 3: Hyper parameter tuning

###    Description: 
USe hyperparameter tuning to improve the models


### 3.1 Defintion


Random Forest is indeed impressive. We were able to increase the score of the model by almost 6%(in comparision to the score of our decision tree).

**Note**: Random forests was the ensemble method used as the example while introducing the ensmble method concept in the first chapter.

The Random Forest object that we used in the previous task was the default one, more specifically the one below:

`RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=100, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=100,
            oob_score=False, random_state=0, verbose=0, warm_start=False)`
            
You can learn what each parameter in the model means from the beautifully maintained documentation of [Random Forest by sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).            

Now the question that we are interested in is this,

*If the above given parameters are the reason for model to behave a particular way, coudn't we change them to a desired combination to achieve the optimum model?*

Enter **Hyperparameter Tuning**


You already learned about hyperparameter tuning before

Just to refresh, hyperparameters are the parameters which define the architecture of model and the process of searching for the ideal hyperparameters for model optimization is referred to as hyperparameter tuning.


For this task we will be using two kinds of hyperparameter tuning method:
1. Grid Search
2. Randomised Search 


### 3.2 Grid Search

In Grid search, exhaustive search over specified parameter values for an estimator is done.

Let us try to do Grid Search on Random Forest and see if the performance of the model is improved

#### Task 5 - Using Grid Search for Random Forest

In this task, you will apply Grid Search on Random Forest to hypertune parameters.

***
- The parameter grid for hypertuning is already given.

- Create a `RandomForestClassifer()` object with `random_state=0` and store it in a variable called `'clf'`

- Use the `GridSearchCV()` from sklearn to initialize a grid search object. Pass the parameters `estimator=clf`,`param_grid =parameter grid`  while initializing the object. Store the object in a variable called `'grid_search'`

- Use the `fit()` method of the bagging classifier object `'grid_search'` on `'X_train'` and `'y_train'` to train the models on the training data. 

- Use the `score()` method of the bagging classifier object `'grid_search'` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store the score in a variable called `'score_gs'`
***

After the task, compare the accuracy score with the Random Forest method.
Has it improved?


In [43]:
#Parameter grid
parameter_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

"""Solution begin """

clf= RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(clf, param_grid=parameter_grid)
grid_search.fit(X_train, y_train)



# GS_score=grid_search.score(Data_train,label_train)
# print("Training score: %.2f " % GS_score)

score_gs=grid_search.score(X_test,y_test)
print("Testing score: %.2f " % score_gs)


Testing score: 0.84 


### 3.3 Random Search

In this method of hyper parameter tuning, Randomized search on hyper parameters is done.

Let us try to do Randomised Search on Random Forest hyperparameter, see if the performance of the model is improved and compare it with the Grid Search.

#### Task 6 - Using Randomized Search for Random Forest

In this task, you will apply Randomized Search on Random Forest to hypertune parameters.

***
- The parameter grid for hypertuning is already given (Note: It is the same as the one given for Grid Search).
- Create a `RandomForestClassifer()` object with `random_state=0` and store it in a variable called `'clf'`

- Use `RandomizedSearchCV()` from sklearn to initialize a grid search object. Pass the parameters `estimator=clf`,`param_grid =parameter grid`,`n_iter=20`   while initializing the object. Store the object in a variable called `'random_search'`

- Use the `fit()` method of the bagging classifier object `'random_search'` on `'X_train'` and `'y_train'` to train the models on the training data. 

- Use the `score()` method of the bagging classifier object `'random_search'` on `'X_test'` and `'y_test'` to find out the accuracy of the test data and store the score in a variable called `'score_gs'`
***




In [44]:
#Parameter grid
parameter_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

"""Solution begin"""

clf= RandomForestClassifier(random_state=0)

# n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=parameter_grid,
                                   n_iter=20)

random_search.fit(X_train, y_train)


# score_rs=random_search.score(Data_train,label_train)
# print("Training score: %.2f " % score_rs)

score_rs=random_search.score(X_test,y_test)
print("Testing score: %.2f " % score_rs)


Testing score: 0.84 


Theoretically grid search is the better option, compared to the randomized one because with grid search there is a guarantee of finding the most optimum model. However,it has been observed that randomized search almost always produces near optimum results, that too at a lesser time compared to grid search.
That is why random search combined with clever heuristics, is often used. 

## Chapter 4: Stacking

###   Description: 
Learn about stacking and it's implementation

So far we have been using naive methods of averaging and voting to combine the predictions, therby implicitly assigning same weights to the predictions made by all the base learners 

However, it might be possible that some base learners might be better at predicting than the others. So, a better aggregation scheme could be to assign some kind of weights to the predictions made by base learners.

We could do it manually but since we have been learning machine learning to predict things anyway,why not use a machine learning model to do it. 

That's exactly what stacking helps us achieve


### Definition

Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier(fancy name for a 'classifier of classifiers').
It is based on a simple idea: instead of using trivial functions to aggregate the predictions of all predictors in an ensemble, we train a model to perform this aggregation.

###  Steps of Stacking
***
* First, the training set is split in two subsets. 

![](..\images\Stacking_1.jpg)
* The first subset is used to train the 'n' models in the first layer

![](..\images\Stacking_2.jpg)
* Next, the first layer models are used to make predictions on the second (held-out) subset

![](..\images\Stacking_3.jpg)
* The predictions of the models are then stored along with the actual predictions as a new training set.

![](..\images\Stacking_4.jpg)

* The meta-classifier is then trained on this new training set, so it learns to predict the target value given the first layer’s models.

![](..\images\Stacking_5.jpg)


Let's now apply stacking on our bank problem dataset


#### Task 7 - Using Stacking

In this task, you will apply Stacking to predict the target.

***
- First layer machine learning models and the meta classifier are already defined for you.
- Use the `Stacking()` from  mlxtend to initialize a stacking classifier object. Pass the `'classifier_list'` to parameter `classifiers` and `'m_classifier'` as `meta_classifier`parameter , while initializing the object.
- Use the `fit()` method of the stacking classifier object to train the models on the training data. 
- Use the `score()` method of the stacking classifier object to find out the accuracy of the test data.
***


In [45]:
classifier1 = DecisionTreeClassifier(random_state=0)
classifier2= DecisionTreeClassifier(random_state=1)
classifier3 = DecisionTreeClassifier(random_state=2)
classifier4= DecisionTreeClassifier(random_state=3)
classifier_list=[classifier1,classifier2,classifier3,classifier4]

m_classifier=LogisticRegression(random_state=0)

"""Solution begin"""

sclf = StackingClassifier(classifiers=classifier_list, 
                          meta_classifier=m_classifier)

sclf.fit(X_train,y_train)


# s_score=sclf.score(X_train,y_train)
# print("Training score: %.2f " % s_score)

s_score=sclf.score(X_test,y_test)
print("Test score: %.2f " % s_score)


Test score: 0.78 


Though for this particular dataset, the stacking method was not as effective as bagging or random forest, nonetheless stacking is a powerful ensemble technique worth trying to convert weak learners to strong.

## Summary: 

Throughout this course, we have tried to understand the different ensemble techinqiues and their practical implementations.
While you take time to imbibe all that was taught, find below a quick summary of the different techniques learnt.




![Summary](..\images\Summary.png)



## Quiz

***

1. Ensemble methods is loosely based on the concept of ?

    a. 'Wisdom of crowds'
    
    b. 'Survival of the Fittest'

    c. 'Strength in numbers'
    
**ANS:** a. Wisdom of crowds

**Explaination:** Refer to the concept if you got this incorrect


2. Random forest is an example of stacking method

    a. True
    
    b. False
    
    
**ANS:** b

**Explaination:** Random forest is a type of bagging method with the additional step of subsetting of features along with bootstrapping

3. Bagging results in

    a. Reduction of bias, increase of variance 
    
    b. Increase of bias, reduction of variance
    
    c. Reduction of bias, reduction of variance
    
    d. Increase of bias, increase of variance
    
**ANS:** b

**Explaination:** Bagging due to it's inherent bootstrapping almost always reduces variance but with the cost of slightly higher bias


4. Random Forests can be used only for classification

    a. True
    
    b. False
    
**ANS:** b

**Explaination:** Random forest is a versatile problem solving machine learning model. It can be used for classification, regression, feature selection and clustering.


5. Ensembling methods can always beat a single machine learning model 
    
    a. True
    
    b. False
    
    
**ANS:** b

**Explaination:** Ensembling methods are used to combine the power of weak learners to form a strong learner. Therefore a single strong machine learning model will have comparable performance with ensemble of weak learners.