<a href="https://colab.research.google.com/github/ck1972/Hands-On-GeoAI1/blob/main/GeoAI_Lab_5b_Explainable_Machine_Learning_for_RF_Regression_GitHub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Explainable Machine Learning for Random Forest Regression**

## Introduction
In the lab 4b, we modeled aboveground biomass density (AGBD) using GEDI Level 4A (L4A) data, Sentinel-2 (S2) spectral bands, and vegetation indices (NDVI, CCCI, and SLAVI), and a random forest regression model. The model showed high predictive power, with an R² score of 0.88.

## Setting-up Colab
### Mount your Google Drive
First, make sure that your data is loaded in Google Drive. After that mount your Google Drive using the code below.

In [None]:
# Import Google drive
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Next, import the required libraries

In [None]:
# # Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import shap
from sklearn.inspection import permutation_importance # PFI - Permutation Feature Importance
from sklearn.tree import DecisionTreeRegressor, plot_tree
import seaborn as sns
import joblib # Import joblib library

### Define variables and paths
Access the necessary datasets and then define the target (aboveground biomass density) and predictor variables (Sentinel-2 bands and vegetation indices).

In [None]:
# Define the target and predictor variables
Target = ['agbd'] # target variable
Predictors = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B11', 'B12','NDVI','CCCI','SLAVI'] # predictor variables
SAMPLE_PATH = '/content/drive/My Drive/Maf_Datasets/Cleaned_GEDI_L4A_2022_Dataset1.csv' # With filtered agbd
MODEL_PATH = '/content/drive/MyDrive/Maf_Datasets/rf_model1.pkl' # Define model path

## Import training datasets and split training data
Next, we import the sample training dataset and split it into training and test datasets.

In [None]:
# Read sample
samples = pd.read_csv(SAMPLE_PATH)[Predictors + Target]
samples

# Split into train and test
train, test = train_test_split(samples, test_size=0.2, shuffle=True, random_state=42)

# Get variables input and output
X_train = train[Predictors]
X_test = test[Predictors]
y_train = train[Target].astype(float)
y_test = test[Target].astype(float)

# Show the data shape
print(f'Train Predictors: {X_train.shape}\nTest Predictors: {X_test.shape}\nTrain Target: {y_train.shape}\nTest Target: {y_test.shape}')

## Perform Explainable Machine Learning (xML)
### Introduction
We will use xML techniques such as SHAP (Shapley Additive exPlanations) to gain insights in the random forest model.

Let's start by loading and extracting the saved random forest model from the dictionary.

In [None]:
# Load the dictionary
model_package = joblib.load(MODEL_PATH)

# Extract the actual Random Forest model
loaded_rf_model = model_package["model"]

## Feature importance
Let's start by examing variable importance. Feature importance methods highlight the contribution of each input variable to the model’s prediction.

### Mean Decrease Impurity (MDI)
Mean Decrease Impurity (MDI) is a method used in decision tree-based models, such as random forests, to assess feature importance.

In [None]:
# MDI - Mean Decrease Impurity from the loaded model
mdi_importances = loaded_rf_model.feature_importances_

# Create DataFrame for plotting
mdi_df = pd.DataFrame({'Feature': Predictors, 'Importance': mdi_importances})
mdi_df.sort_values(by='Importance', ascending=False, inplace=True)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=mdi_df, palette='viridis')
plt.title('MDI Feature Importance (Loaded Model)')
plt.tight_layout()
plt.show()

## SHAP (SHapley Additive exPlanations) method
Next, we will use the SHAP method. SHAP (SHapley Additive exPlanations) is a method that helps us understand how each feature contributes to a prediction made by a machine learning model.

In [None]:
# Now you can use SHAP with the loaded model
explainer = shap.Explainer(loaded_rf_model, X_train)
shap_values = explainer(X_train, check_additivity=False)

### Create a beeswarm plot
Next, we use the SHAP beeswarm plot function (shap.plots.beeswarm) to create a beeswarm plot that displays SHAP values.

In [None]:
# Visualize
shap.plots.beeswarm(shap_values)

### Create a layered plot
We can also use the SHAP violin plot function (shap.plots.beeswarm) to create  a layered violin plot that displays SHAP values. The ***max_display=12*** argument limits the plot to the top 12 most important features

In [None]:
# Layered violin plot
shap.plots.violin(shap_values, max_display=12)

### Scatter plots
Finally, we want to gain deeper insights into the relationship between 'B5', 'B6', 'B8', and 'B12' values with the target variable 'agbd' (Mg/ha) for the RF model.


In [None]:
# Extract B5 , B6, B8 and B12 values from X_train based on features.
B5_values = X_train.iloc[:, Predictors.index('B5')]
B6_values = X_train.iloc[:, Predictors.index('B6')]
B8_values = X_train.iloc[:, Predictors.index('B8')]
B2_values = X_train.iloc[:, Predictors.index('B2')]

# Predict the 'agbd' values using the trained RF model
predicted_values = loaded_rf_model.predict(X_train)

# Create subplots with one row and two columns
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 8))

# Scatter plot for B5 values against 'abgd' values (Mg/ha)
axes[0, 0].scatter(B5_values, y_train, color='blue', label='Reference')
axes[0, 0].scatter(B5_values, predicted_values, color='red', label='Predicted')
axes[0, 0].set_xlabel('Spectral reflectance')
axes[0, 0].set_ylabel('AGBD (Mg/ha)')
axes[0, 0].set_title('B5 Reference vs Predicted (RF Model)')
axes[0, 0].legend()

# Scatter plot for B11 values against 'abgd' values (Mg/ha)
axes[0, 1].scatter(B6_values, y_train, color='blue', label='Reference')
axes[0, 1].scatter(B6_values, predicted_values, color='red', label='Predicted')
axes[0, 1].set_xlabel('Spectral reflectance')
axes[0, 1].set_ylabel('AGBD (Mg/ha)')
axes[0, 1].set_title('B6 Reference vs Predicted (RF Model)')
axes[0, 1].legend()

# Scatter plot for B8 values against 'abgd' values (Mg/ha)
axes[1, 0].scatter(B8_values, y_train, color='green', label='Reference')
axes[1, 0].scatter(B8_values, predicted_values, color='orange', label='Predicted')
axes[1, 0].set_xlabel('Spectral reflectance')
axes[1, 0].set_ylabel('AGBD (Mg/ha)')
axes[1, 0].set_title('B8 Reference vs Predicted (RF Model)')
axes[1, 0].legend()

# Scatter plot for B2 values against 'abgd' values (Mg/ha)
axes[1, 1].scatter(B2_values, y_train, color='green', label='Reference')
axes[1, 1].scatter(B2_values, predicted_values, color='orange', label='Predicted')
axes[1, 1].set_xlabel('Spectral reflectance')
axes[1, 1].set_ylabel('AGBD (Mg/ha)')
axes[1, 1].set_title('B2 Reference vs Predicted (RF Model)')
axes[1, 1].legend()

# Adjust layout and display the plots
plt.tight_layout()
plt.show()