In this section we will learn how to take a bunch of models and combine them together to get a better model. 

![ensemble_1.png](pics/ensemble_1.png)

## 1. Bagging 

![bagging_1.png](pics/bagging_1.png)

We will get each of our friends to answer the test separately and now at the end we combine them. 

How do we combine them? 

There are many ways. For example. if the answers on the tests are values, we can **average** their values. Since they are yes-no questions we can consider **voting**. So for each question we will consider which option got the most answers by our friends. 

## 2. Boosting

Boosting is similar but it just tries harder to exploit our friend's strengths. So let's say our first friend is a philosopher and he answers all the philosophy questions, but didn't answer the science one's very well. So we pick another friend to answer those questions. But there is a sports question and none of them knows about sports, so we have a friend who knows about sports and he answers those questions. 
So all of them together they form a "smart friend"!


Some notation, all the friends are named "weak learners" and our "smart friend" is called "strong learner". 

## Ensembles

This whole lesson (on ensembles) is about how we can combine (or ensemble) the models you have already seen in a way that makes the combination of these models better at predicting than the individual models.

Commonly the "weak" learners you use are decision trees. In fact the default for most ensemble methods is a decision tree in sklearn. However, you can change this value to any of the models you have seen so far.

## Why Would We Want to Ensemble Learners Together?

There are two competing variables in finding a well fitting machine learning model: **Bias** and **Variance**. It is common in interviews for you to be asked about this topic and how it pertains to different modeling techniques. As a first pass, the wikipedia is quite useful. However, I will give you my perspective and examples:

**Bias**: When a model has high bias, this means that means it **doesn't do a good job of bending to the data**. An example of an algorithm that usually has high bias is linear regression. Even with completely different datasets, we end up with the same line fit to the data. When models have high bias, this is bad.

![anscombes-quartet-3.svg](pics/anscombes-quartet-3.svg)

**Variance**: When a model has high variance, this means that **it changes drastically to meet the needs of every point in our dataset**. Linear models like the one above has low variance, but high bias. An example of an algorithm that tends to have high variance and low bias is a decision tree (especially decision trees with no early stopping parameters). A decision tree, as a high variance algorithm, will attempt to split every point into its own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.

![decision-tree-sketch.png](pics/decision-tree-sketch.png)

By combining algorithms, we can often build models that perform better by meeting **in the middle in terms of bias and variance**. There are some other tactics that are used to combine algorithms in ways that help them perform better as well. These ideas are based on **minimizing** bias and variance based on mathematical theories, like the central limit theorem.

### Introducing Randomness Into Ensembles

Another method that is used to improve ensemble methods is to introduce randomness into high variance algorithms before they are ensembled together. The introduction of randomness combats the tendency of these algorithms to overfit (or fit directly to the data available). There are two main ways that randomness is introduced:

1. **Bootstrap the data** - that is, sampling the data with replacement and fitting your algorithm to the sampled data.

2. **Subset the features** - in each split of a decision tree or with each algorithm used in an ensemble, only a subset of the total possible features are used.

In fact, these are the two random components used in the next algorithm you are going to see called **random forests**.

![random_forrest_1.png](pics/random_forrest_1.png)

![random_forest_2.png](pics/random_forest_2.png)

Since we have two votes with "Whatsapp" we will go for it. 

## Bagging

We give a subset of data to the weak learners to learn. Usually those predictions are terrible, but if we have enough data they can be preform pretty well

![bagging_2.png](pics/bagging_2.png)

![bagging_3.png](pics/bagging_3.png)

## Adaboost 

Algorithm:

1. First weak learner tries to minimize the error 

![adaboost_1](pics/adaboost_1.png)

2. We take the misclassified points and we make them bigger (we increase the error), so the next weak learner will focus on those more 

![adaboost_2](pics/adaboost_2.png)

![adaboost_3](pics/adaboost_3.png)

3. Let's assume that they vote as before and we get the following 
![adaboost_4](pics/adaboost_4.png)

### Weighting the data 

Let's assign to eachc data point an initial weight of one. Before we wanted to minimize the sum of error but now we want to minimize the weighted sum of incorrectly points (whuch for now is the same). 

![adaboost_5.png](pics/adaboost_5.png)

Let's increase the weight of the missclassified points. (increase the weights enough to make the model a 50-50)

![adaboost_6.png](pics/adaboost_6.png)

Let's learn a second weak learner and increase the weights of the missclassified points again. 

![adaboost_7.png](pics/adaboost_7.png)

Let's do it once again, we could keep that going but we will stop here. Here are the three weak learners we have created: 

![adaboost_8.png](pics/adaboost_8.png)

### Weighting the Models

In order to weight the models we need to know how well they are doing. 

Which is the worst model in terms of giving us information? 

![adaboost_9.png](pics/adaboost_9.png)

**Solution** The one that tells the truth half of the time

How are we going to weight those models? Here is how:

![adaboost_10.png](pics/adaboost_10.png)

So how would the distribution of the weights look like? 

![adaboost_12.png](pics/adaboost_12.png)

Here is the formula for the weights:

![adaboost_13.png](pics/adaboost_13.png)

Can you calculate the weights of the models based on the formula above?

In [4]:
from math import log
weight_1 = log(7/1)
print(weight_1)

weight_2 = log(4/4)
print(weight_2)

weight_3 = log(2/6)
print(weight_3)

1.9459101490553132
0.0
-1.0986122886681098


What happens when the model is perfect?

![adaboost_15](pics/adaboost_15.png)

So let's see the weights of the first example 

![adaboost_16.png](pics/adaboost_16.png)

![adaboost_17.png](pics/adaboost_17.png)
![adaboost_18.png](pics/adaboost_18.png)
![adaboost_19.png](pics/adaboost_19.png)
![adaboost_20.png](pics/adaboost_20.png)
![adaboost_21.png](pics/adaboost_21.png)



### Adaboost in sklearn

Building an AdaBoost model in sklearn is no different than building any other model. You can use scikit-learn's `AdaBoostClassifier` class. This class provides the functions to define and fit the model to your data

```
>>> from sklearn.ensemble import AdaBoostClassifier
>>> model = AdaBoostClassifier()
>>> model.fit(x_train, y_train)
>>> model.predict(x_test)
```

In the example above, the `model` variable is a decision tree model that has been fitted to the data `x_train` and `y_train`. The functions `fit` and `predict` work exactly as before.

### Hyperparameters

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

`base_estimator`: The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner).
`n_estimators`: The maximum number of weak learners used.
For example, here we define a model which uses decision trees of max_depth 2 as the weak learners, and it allows a maximum of 4 of them.

```
>>> from sklearn.tree import DecisionTreeClassifier
>>> model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators = 4)
```

### Our Mission ##

You recently used Naive Bayes to classify spam in this [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). In this notebook, we will expand on the previous analysis by using a few of the new techniques you've learned throughout this lesson.


> Let's quickly re-create what we did in the previous Naive Bayes Spam Classifier notebook. We're providing the essential code from that previous workspace here, so please run this cell below.

In [5]:
# Import our libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Read in our dataset
df = pd.read_table('data/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Fix our response value
df['label'] = df.label.map({'ham':0, 'spam':1})

# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# Instantiate our model
naive_bayes = MultinomialNB()

# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

# Predict on the test data
predictions = naive_bayes.predict(testing_data)

# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Turns Out...

We can see from the scores above that our Naive Bayes model actually does a pretty good job of classifying spam and "ham."  However, let's take a look at a few additional models to see if we can't improve anyway.

Specifically in this notebook, we will take a look at the following techniques:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Another really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).

These ensemble methods use a combination of techniques you have seen throughout this lesson:

* **Bootstrap the data** passed through a learner (bagging).
* **Subset the features** used for a learner (combined with bagging signifies the two random components of random forests).
* **Ensemble learners** together in a way that allows those that perform best in certain areas to create the largest impact (boosting).


In this notebook, let's get some practice with these methods, which will also help you get comfortable with the process used for performing supervised machine learning in Python in general.

Since you cleaned and vectorized the text in the previous notebook, this notebook can be focused on the fun part - the machine learning part.

### This Process Looks Familiar...

In general, there is a five step process that can be used each time you want to use a supervised learning method (which you actually used above):

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: **BaggingClassifier**, **RandomForestClassifier**, and **AdaBoostClassifier**.

> **Step 1**: First use the documentation to `import` all three of the models.

In [6]:
# Import the Bagging, RandomForest, and AdaBoost Classifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,  AdaBoostClassifier

> **Step 2:** Now that you have imported each of the classifiers, `instantiate` each with the hyperparameters specified in each comment.  In the upcoming lessons, you will see how we can automate the process to finding the best hyperparameters.  For now, let's get comfortable with the process and our new algorithms.

In [7]:
# Instantiate a BaggingClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
bagging_classifier = BaggingClassifier(n_estimators=200)


# Instantiate a RandomForestClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
random_forest_classifier = RandomForestClassifier(n_estimators=200)

# Instantiate an a AdaBoostClassifier with:
# With 300 weak learners (n_estimators) and a learning_rate of 0.2
adaboost_classifier=AdaBoostClassifier(n_estimators=300, learning_rate=0.2)


> **Step 3:** Now that you have instantiated each of your models, `fit` them using the **training_data** and **y_train**.  This may take a bit of time, you are fitting 700 weak learners after all!

In [8]:
# Fit your BaggingClassifier to the training data
bagging_classifier.fit(training_data, y_train)

# Fit your RandomForestClassifier to the training data
random_forest_classifier.fit(training_data, y_train)


# Fit your AdaBoostClassifier to the training data
adaboost_classifier.fit(training_data, y_train)



AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.2,
                   n_estimators=300, random_state=None)

> **Step 4:** Now that you have fit each of your models, you will use each to `predict` on the **testing_data**.

In [9]:
# Predict using BaggingClassifier on the test data
bg_y_test = bagging_classifier.predict(testing_data)

# Predict using RandomForestClassifier on the test data
rf_y_test = random_forest_classifier.predict(testing_data)


# Predict using AdaBoostClassifier on the test data
ad_y_test = adaboost_classifier.predict(testing_data)



> **Step 5:** Now that you have made your predictions, compare your predictions to the actual values using the function below for each of your models - this will give you the `score` for how well each of your models is performing. It might also be useful to show the Naive Bayes model again here, so we can compare them all side by side.

In [10]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (NumPy array or pandas series)
    preds - the predictions for those values from some model (NumPy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

In [11]:
# Print Bagging scores
print_metrics(y_test, bg_y_test, model_name="Bagging")

# Print Random Forest scores
print_metrics(y_test, rf_y_test, model_name="Random Forest")


# Print AdaBoost scores
print_metrics(y_test, ad_y_test, model_name="AdaBoost")


# Naive Bayes Classifier scores
print_metrics(y_test, predictions, model_name="Naive Bayes")



Accuracy score for Bagging : 0.9755922469490309
Precision score Bagging : 0.9217877094972067
Recall score Bagging : 0.8918918918918919
F1 score Bagging : 0.9065934065934066



Accuracy score for Random Forest : 0.9798994974874372
Precision score Random Forest : 1.0
Recall score Random Forest : 0.8486486486486486
F1 score Random Forest : 0.9181286549707602



Accuracy score for AdaBoost : 0.9770279971284996
Precision score AdaBoost : 0.9693251533742331
Recall score AdaBoost : 0.8540540540540541
F1 score AdaBoost : 0.9080459770114943



Accuracy score for Naive Bayes : 0.9885139985642498
Precision score Naive Bayes : 0.9720670391061452
Recall score Naive Bayes : 0.9405405405405406
F1 score Naive Bayes : 0.9560439560439562





### Recap

Now you have seen the whole process for a few ensemble models! 

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

And that's it.  This is a very common process for performing machine learning.


### But, Wait...

You might be asking - 

* What do these metrics mean? 

* How do I optimize to get the best model?  

* There are so many hyperparameters to each of these models, how do I figure out what the best values are for each?

**This is exactly what the last two lessons of this course on supervised learning are all about.**


## Recap

In this lesson, you learned about a number of techniques used in ensemble methods. Before looking at the techniques, you saw that there are two variables with tradeoffs **Bias** and **Variance**.

**High Bias, Low Variance** models tend to **underfit** data, as they are not flexible. **Linear models** fall into this category of models.

**High Variance, Low Bias** models tend to **overfit** data, as they are **too flexible**. **Decision trees** fall into this category of models.

### Ensemble Models
In order to find a way to **optimize for both variance and bias**, we have ensemble methods. Ensemble methods have become some of the most popular methods used to compete in competitions on Kaggle and used in industry across applications.

There were two **randomization techniques** you saw to **combat overfitting**:

* **Bootstrap the data** - that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.

* **Subset the features** - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.

### Techniques
You saw a number of ensemble methods in this lesson including:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* {RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Another really useful guide for ensemble methods can be [found in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html). These methods can also all be extended to regression problems, not just classification.

Additional Resources
Additionally, here are some great resources on AdaBoost if you'd like to learn some more!

Here is the original paper from Freund and Schapire.
A follow-up paper from the same authors regarding several experiments with Adaboost.
A great tutorial by Schapire.