# Ensemble Methods

# Bagging

## Bagging (Bootstrap Aggregation)

Train multiple instances of the same model on different subsets of the training data

Each subset is created by sampling *with* replacement - bootstrapping!!

The final prediction is an average (for regression) or a majority vote (for classification) of the individual models


### Bagging with Decision Trees - Random Forests

Random Forest builds multiple decision trees and combines their predictions.

Each tree is trained on a different subset of data (bootstrap samples).

Random subsets of features are considered for splitting at each node.

The final prediction is an average (regression) or majority vote (classification) of individual tree predictions.

#### Tuning Parameters (ie hyperparameters)

The main ones include:
- **n_estimators:** Number of trees in the forest. (100)
- **max_depth:** Maximum depth of each tree. (None)
- **min_samples_split:** Minimum number of samples required to split an internal node. (2)
- **min_samples_leaf:** Minimum number of samples required to be at a leaf node. (1)
- **max_features:** Number of features to consider when looking for the best split. (sqrt(n_features))


### Grid Search

Selecting the right hyperparameters is crucial but manually tuning hyperparameters can be challenging.

**Grid search** is a systematic method for hyperparameter tuning. It involves evaluating a predefined set of hyperparameter combinations to find the best-performing configuration.

Imagine a grid where each axis represents a hyperparameter, and the points on the grid are combinations of hyperparameter values. Grid search exhaustively explores this grid, testing each combination.


Grid search involves three main components:

- **Hyperparameter Space:** The range or values to be explored for each hyperparameter.
- **Scoring Metric:** The performance metric used to evaluate each combination.
- **Cross-Validation:** The technique used to assess performance robustly.

For effective grid search:
- Start with a broad search space.
- Refine based on initial results.

### Benefits of Bagging

Why use bagging?
- Reduces overfitting: By training on different subsets, models are less likely to memorize the training data.
- Increased stability: Ensemble models are less sensitive to noise and outliers in the data.

## Bagging - a Random Forest

### A decision tree is prone to overfitting, so why settle for a single tree? 

|   |   |
|:--|:--|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/icon_definition.png?raw=true" width="50" height=""> | **Ensemble methods** aggregate the result from a set of classifiers (or regression <br> models).  Individual predictions of 'weak learners' are aggregated by the <br> majority rule (or averaging) to identify the most popular result. |

|   |   |
|:--|:--|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/icon_definition.png?raw=true" width="50" height=""> | **Bagging (or bootstrap aggregation)** is an ensemble learning method: <br> a random sample of data in a training set is selected **with replacement**. Several <br> data samples are generated, used to train models independently, and the  result <br> is aggregated by a majority  rule (classification) or averaging (regression). <br> Typically, bagging reduces variance. |

|   |   |
|:--|:--|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/icon_definition.png?raw=true" width="50" height=""> | **Feature randomness (or 'feature bagging')** generates random subsets of features. <br>  This reduces the correlations between the resulting classifiers (or regression models). |

|   |   |
|:--|:--|
| <img src="https://github.com/david-biron/DATA221imgs/blob/main/icon_definition.png?raw=true" width="50" height=""> | A **random forest** combines bagging and feature randomness to create multiple low <br> correlated decision trees. Each individual tree is a weak learner. The forest aggregates <br> their results (majority rule or average) to identify the most popular result. <br> Typically, this reduces the risk of overfitting and increases the accuracy of predictions.  |



### Hyperparameters of a random forest and how to search for them (`RandomizedSearchCV`)

* The **number of trees** (`n_estimators`): more trees may well improve performance but will require more time/memory for training. 

* The **maximum depth** of each decision tree (`max_depth`): excessively high (/low) values $\rightarrow$ overfitting (/underfitting).

* The **number of features** to consider when looking for the best split (`max_features`). 

[There are others...](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


### Pros and cons of random forests

**Pros** 
* Reduced risk of overfitting ('reduce variance'). 
* Can be used for classification or regression.
* Feature importance is simple to calculate.  

**Cons** 
* Time/memory consuming (the more trees...)
* Interpretability is lost as compared, e.g., to a single tree. 
