<a href="https://colab.research.google.com/github/ck1972/University-GeoAI/blob/main/Lab_7a_XML_Land_Cover_Classification_RF_Classifier1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 7a. Explainable Machine Learning in Geospatial Analysis: Land Cover Classification**

## Introduction
Explainable ML methods are essential for interpreting the decision-making processes of complex or "black-box" models. They help identify the most influential input features driving model predictions, thereby enhancing the transparency, interpretability, and trustworthiness of machine learning outcomes.

## Imports and Setup
### Import libraries
Import the necessary libraries (pandas, numpy, scikit-learn, rasterio, etc.).

In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from matplotlib.colors import from_levels_and_colors
import seaborn as sns
import joblib
import shap

### Mount Google Drive
Next, mount your Google Drive. You will be prompted to authorize access to your Google Drive. Once mounted, you can read/write files in /content/drive/MyDrive.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Define paths and variables
Define the the paths to access your own directory structure in Google Drive. In this tutorial, we use a CSV training dataset (Bul_TrainingData_2024.csv) containing pixel values and their corresponding classes. We will also define the paths to the training datasets and random forest model.


In [None]:
# Define path that contains the datasets
Sample_Path = '/content/drive/MyDrive/Bulawayo_Dataset_2024/Bul_TA_S2_Pal_2024.csv'
MODEL_PATH = '/content/drive/MyDrive/Bulawayo_Dataset_2024/best_rf_model.pkl'

### Define target and predictor variables
Next, define and specify the overall structure of the land cover classification task. Bands lists the Sentinel-2 spectral bands (e.g., B2, B3, B4) used as input features (predictors) for the model, while LC indicates the target column named “class.”

In [None]:
# Define target and predictor variables
features = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B11', 'B12', 'HV']  # Feature columns
label = ['class']

Classes = [0, 1, 2, 3, 4, 5]
N_Classes = 6
Names   = ["Bare area", "Built-up", "Cropland", "Grassland", "Woodland", "Water"]
Palette = [
    '#D3D3D3',  # grey for class 0 (Bare area)
    '#FF0000',  # red for class 1 (Built-up)
    '#FFD700',  # gold for class 2 (Cropland)
    '#ADFF2F',  # greenyellow for class 3 (Grassland)
    '#006400',  # darkgreen for class 4 (Woodland)
    '#0000FF'   # blue for class 5 (Water)
]

##  Load and Prepare Training Data
Next, load and prepare the training data. The training data is in a CSV format with columns for each band (B2, B3, etc.) and a class column (land cover type).

In [None]:
# Load training data as a DataFrame
df = pd.read_csv(Sample_Path)

# Inspect first few rows
print(df.head())

# Separate features (X) and label (y)
X = df[features]
y = df['class']

# Ensure no missing values
print(f"Missing values in features: {X.isnull().sum().sum()}")
print(f"Missing values in label: {y.isnull().sum()}")

# Split into training and testing subsets
# (you can also do cross-validation if you prefer)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

## Perform Explainable Machine Learning (xML)
### Introduction
We will use xML methods such as  SHAP (Shapley Additive exPlanations) model to gain insights in the random forest model.

Let's start by loading and extracting the saved random forest model from the dictionary.

In [None]:
package = joblib.load(MODEL_PATH)
rf_model = package["model"]
features = package["features"]
label = package["label"]

print("Model loaded successfully.")

Model loaded successfully.


## SHAP (SHapley Additive exPlanations) method
Next, we will use the SHAP method. The SHAP (SHapley Additive exPlanations) method is based on the shapley values, which is a concept from cooperative game theory proposed by Lloyd Shapley (1953).

Let's create an explainer for the RF model. Explainer prints useful information, especially for resolving potential errors.

In [None]:
# Sample 1,000 rows from the training data for explainability
X_train_sample = X_train.sample(n=1000, random_state=42)

# Create SHAP explainer using the smaller sample
explainer = shap.TreeExplainer(rf_model)

# Compute SHAP values for the sample only
shap_values = explainer(X_train_sample)

### SHAP summary plot
Finally, display SHAP summary plot to gain insights into the random forest model for only 1000 pixel samples.

In [None]:
# Import the ListedColormap class to create custom colormaps
from matplotlib.colors import ListedColormap

# Define a custom color palette corresponding to different land cover classes
Palette = [
    '#FFD700',  # gold - e.g., Cropland
    '#ADFF2F',  # greenyellow - e.g., Grassland
    '#FF0000',  # red - e.g., Built-up
    '#D3D3D3',  # grey - e.g., Bare area
    '#006400',  # darkgreen - e.g., Woodland
    '#0000FF'   # blue - e.g., Water
]

# Convert the list of hex colors into a matplotlib colormap
my_cmap = ListedColormap(Palette)

# Generate a SHAP summary plot to interpret feature importance
shap.summary_plot(
    shap_values,        # SHAP values for the model predictions
    X_train_sample,     # Subsample of training features
    feature_names=features,   # Names of the input features
    class_names=Names,        # Descriptive labels for each class
    color=my_cmap             # Use the custom colormap for class coloring
)