<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Understanding the basics for ensemble methods
© ExploreAI Academy

In this train, we'll explore ensemble methods focusing on merging multiple models for improved predictions using the Python `scikit-learn` library.

## Learning objectives

By the end of this train, you should be able to:
- Understand different ensemble methods and apply one using the `scikit-learn` library.
- Understand combining multiple models for enhanced predictions.

## Overview

### Understanding ensemble learning

Ensemble learning is a machine learning paradigm where **multiple models** (often called "weak learners") **are trained to solve the same problem and combined to get better results**. The main principle behind ensemble learning is that a group of weak learners can come together to form a strong learner, thereby increasing the accuracy of predictions. Ensemble methods can be especially powerful in reducing variance, bias, or improving predictions over single-model approaches.

### Types of ensemble methods
There are several ensemble methods, each with its unique way of combining models. The most common types include:

- **Bagging (bootstrap aggregating)**: It involves training multiple models in parallel, each on a random subset of the data (with replacement), and then averaging their predictions. **Random forest** is a popular example of bagging.

- **Boosting**: It trains models sequentially, each trying to correct its predecessor's errors. The models are weighted based on their accuracy, and predictions are made based on the weighted sum of the predictions. Examples include **AdaBoost**, **Gradient boosting**, and **XGBoost**.

- **Stacking (stacked generalisation)**: It involves training multiple models on the same data and then training a meta-model to make a final prediction based on the predictions of the previous models.

- **Voting**: In voting, multiple models are trained independently, and their predictions are combined through a majority vote (in classification problems) or an average (in regression problems) to make the final prediction. This method leverages the diversity among the models to improve the overall performance.

It's important to note that these are not the only ensemble methods available. There exist other, perhaps less commonly used, ensemble techniques that are not covered within the scope of this lesson. These methods further explore different strategies for model combination to improve prediction accuracy.

## Examples: Implementing an ensemble method Using `scikit-learn`
Let's implement a basic ensemble method using the `scikit-learn` library. We'll use the dataset provided to predict the `BiodiversityHealthIndex` using both a single model and an ensemble method for comparison. For the ensemble, we'll use a `RandomForestRegressor`, a popular bagging method.

### Step 1: Preparing the data
We'll start by preparing our data for modelling. This involves splitting the data into features (`X`) and the target (`y`), and then splitting these into training and testing sets.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/SDG_15_Life_on_Land_Dataset.csv')

# Define features and target
X = data.drop('BiodiversityHealthIndex', axis=1)
y = data['BiodiversityHealthIndex']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 2: Building individual models

Let's first build a simple decision tree regressor as our weak learner for comparison.

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Initialise and train the decision tree
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

# Predict and evaluate
tree_predictions = tree_model.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_predictions)
print(f"Decision Tree MSE: {tree_mse}")

Decision Tree MSE: 0.15674493632579156


### Step 3: Building an ensemble model

Now, let's use the `RandomForestRegressor` as our ensemble method.

In [3]:
from sklearn.ensemble import RandomForestRegressor

# Initialise and train the random forest
forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

# Predict and evaluate
forest_predictions = forest_model.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_predictions)
print(f"Random Forest MSE: {forest_mse}")

Random Forest MSE: 0.08985732245806438


## Conclusion
By comparing the Mean Squared Error (MSE) of the decision tree model with that of the random forest, we can observe the impact of using ensemble methods. Typically, the random forest (an ensemble method) should outperform the single decision tree due to its ability to reduce overfitting and variance in predictions.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>