## Where do decision trees tend to fall on the Bias/Variance spectrum?</summary>
    
- Decision trees very easily overfit.
- They tend to suffer from **high error due to variance**.

## Bootstrapping is **random resampling with replacement**.

The idea is this:
- Take your original sample of data, with sample size $n$.
- Take many sub-samples (say $B$) of size $n$ from your sample **with replacement**. These are called **bootstrapped samples**.
- You have now generated $B$ bootstrapped samples, where each sample is of size $n$!
<br>

- Instead of building one model on our original sample, we will now build one model on each bootstrapped sample, giving us $B$ models in total!
- Experience tells us that combining the models from our bootstrapped samples will be closer to what we'd see from the population than to just get one model from our original sample.

This sets up the idea of an **ensemble model**.

- Bootstrapping is random resampling with replacement.
- We bootstrap when fitting bagged decision trees so that we can fit multiple decision trees on slightly different sets of data. **Bagged decision trees tend to outperform single decision trees.**
- Bootstrapping can also be used to conduct hypothesis tests and generate confidence intervals directly from resampled data.

## Introduction to Ensemble Methods

Different types of models we've built thus far:
- Linear Regression
- Logistic Regression
- $k$-Nearest Neighbors
- Naive Bayes Classification

Same type of process:
1. Based on our problem, we identify which model to use. (Is our problem classification or regression? Do we want an interpretable model?)
2. Fit the model using the training data.
3. Use the fit model to generate predictions.
4. Evaluate our model's performance and, if necessary, return to step 2 and make changes.

So far, we've always had **exactly one model**. Today, however, we're going to talk about **ensemble methods**. Mentally, you should think about this as if we build multiple models and then aggregate their results in some way.

## Why would we build an "ensemble model?"

Our goal is to estimate $f$, the true function. (Think about $f$ as the **true process** that dictates Ames housing prices.)

We can come up with different models $m_1$, $m_2$, and so on to get as close to $f$ as possible. (Think about $m_1$ as the model you built to predict $f$, think of $m_2$ as the model your neighbor built to predict $f$, and so on.)

## Three Benefits: Statistical, Computational, Representational
- The **statistical** benefit to ensemble methods: By building one model, our predictions are almost certainly going to be wrong. Predictions from one model might overestimate housing prices; predictions from another model might underestimate housing prices. By "averaging" predictions from multiple models, we'll see that we can often cancel our errors out and get closer to the true function $f$.
<br>

- The **computational** benefit to ensemble methods: It might be impossible to develop one model that globally optimizes our objective function. (Remember that CART reach locally-optimal solutions that aren't guaranteed to be the globally-optimal solution.) In these cases, it may be **impossible** for one CART to arrive at the true function $f$. However, generating many different models and averaging their predictions may allow us to get results that are closer to the global optimum than any individual model.
<br>

- The **representational** benefit to ensemble methods: Even if we had all the data and all the computer power in the world, it might be impossible for one model to **exactly** equal $f$. For example, a linear regression model can never model a relationship where a one-unit change in $X$ is associated with some *different* change in $Y$ based on the value of $X$. All models have some shortcomings. (See [the no free lunch theorems](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization).) While individual models have shortcomings, by creating multiple models and aggregating their predictions, we can actually create predictions that represent something that one model cannot ever represent.

## Bagging: Bootstrap Aggregating

Decision trees are powerful machine learning models. However, decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets). 

Bagging (bootstrap aggregating) mitigates this problem by exposing different trees to different sub-samples of the training set.

The process for creating bagged decision trees is as follows:
1. From the original data of size $n$, bootstrap $B$ samples each of size $n$ (with replacement!).
2. Build a decision tree on each bootstrapped sample.
3. Make predictions by passing a test observation through all $B$ trees and developing one aggregate prediction for that observation.

## "Aggregate prediction?"
As with all of our modeling techniques, we want to make sure that we can come up with one final prediction for each observation.

Suppose we want to predict whether or not a Reddit post is going to go viral, where `1` indicates viral and `0` indicates non-viral. We build 100 decision trees. Given a new Reddit post labeled `X_test`, we pass these features into all 100 decision trees.
- 70 of the trees predict that the post in `X_test` will go viral.
- 30 of the trees predict that the post in `X_test` will not go viral.

What might you expect .predict(X_test) to output?

- `.predict(X_test)` should output a 1, predicting that the post will go viral.

What might you expect .predict_proba(X_test) to output?

- `.predict_proba(X_test)` should output [0.3 0.7], indicating the probability of the post going viral is 70% and the probability of the post not going viral to be 30%.


## Hard code a bagging classifier using for loop

In [5]:
# # Instantiate dataframe.
# predictions = pd.DataFrame(index=X_test.index)

# # Generate ten decision trees.
# for i in range(1, 11):
    
#     # Bootstrap X data.
#     # Should we add a random seed?
#     X_sample = X_train.sample(n = X_train.shape[0],
#                               replace=True)
    
#     # Get y data that matches the X data.
#     y_sample = y_train[X_sample.index]
    
#     # Instantiate decision tree.
#     t = DecisionTreeClassifier()
    
#     # Fit to our sample data.
#     t.fit(X_sample, y_sample)
    
#     # Put predictions in dataframe.
#     predictions[f'Tree {i}'] = t.predict(X_test)

# # Generate aggregated predicted probabilities.
# probs = predictions.mean(axis='columns')

# accuracy_score(y_test, (probs > .5).astype(int))

## Using sklearn for bagging classifier

In [7]:
# # Instantiate BaggingClassifier.
# bag = BaggingClassifier(random_state = 42)

# # Fit BaggingClassifier.
# bag.fit(X_train, y_train)

# # Score BaggingClassifier.
# bag.score(X_test, y_test)