# Land Use / Land Cover Segmentation Using Sentinel-2 and Random Forest

This workflow demonstrates how to use a [Sentinel-2](https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2) [GeoMedian annual satellite imagery composite](https://github.com/digitalearthpacific/dep-geomad) for segmenting land use / land cover (LULC) using a [GPU-accelerated Random Forest classifier](https://developer.nvidia.com/blog/accelerating-random-forests-up-to-45x-using-cuml/). We will pursue this objective by integrating ground truth land use land cover data from the VBoS from 2022. To make this scalable to all of Vanuatu, we use an [administrative boundaries dataset from Pacific data hub](https://pacificdata.org/data/dataset/2016_vut_phc_admin_boundaries/resource/66ae054b-9b67-4876-b59c-0b078c31e800).

In this notebook, we will demonstrate the following:

1. **Data Acquisition**:
   - We use **Sentinel-2 L2A** data accessed via the [Digital Earth Pacific STAC catalog](http://stac.digitalearthpacific.org/). The search is filtered by parameters like a region of interest (AOI) and time range to obtain suitable imagery.
   
2. **Preprocessing**:
   - The Sentinel-2 imagery contains several spectral bands (e.g., Red, Green, Blue, Near-Infrared, Short-wave Infrared). These are extracted and combined into a single dataset for analysis. Remote sensing indices useful for land use / land cover mapping are calculated from these bands. Additionally, the imagery is masked to remove areas outside the regions of interest so as to focus on the relevant pixels. We use 5 out of 6 provinces making up the nation of Vanuatu for training, and one for testing.
  
3. **Feature Extraction**:
   - Features for the classifier are extracted from the Sentinel-2 spectral bands. Here, we will use the reflectance values from the Red, Green, Blue, Near-Infrared (NIR), and Short-wave Infrared (SWIR) bands. We will compute remote sensing indices (NDVI, MNDWI, SAVI, BSI) from these bands as the final feature set.

4. **Ground Truth Data Integration**:
   - A shapefile containing polygons attributed by land cover/land use is loaded into a [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html). This allows us to create multi-class labels for the pixels in the Sentinel-2 imagery.
  
5. **Data Splitting**:
   - To ensure correct model training, we split the features and labels into training (80%) and testing (20%) sets. A 'seed' value is used for the random number generator to ensure this random split is reproducible.

6. **Random Forest Classification**:
   - We train a **Random Forest** classifier to predict land use/land cover on a pixel-wise basis. The `n_estimators` parameter is a key hyperparameter, determining the number of decision trees in the forest. Random Forest leverages the collective wisdom of multiple decision trees to make accurate predictions.

7. **Prediction**:
   - We will use the trained classifier to predict the likelihood of lulc types for each pixel in the test image/province. 

8. **Evaluation**:
   - After making predictions on the test partition, we evaluate the model's performance using metrics such as accuracy and F1-score. This allows us to assess the performance of the Random Forest model and the effectiveness of the selected features.

9. **Visualization**:
   - We visualize the predictions by plotting the classified map, where lulc types are indicated by specific color codes.

At the end, you will have trained a model to predict land use + land cover in Vanuatu.

In [None]:
!mamba install --channel rapidsai --quiet --yes cuml

In [None]:
!pip install gdown

In [None]:
import glob

import dask.dataframe as dd
import pandas as pd
import geopandas as gpd
import hvplot.xarray
import matplotlib.pyplot as plt
import numpy as np
import odc.stac
import rasterio.features
import rioxarray as rxr
import xarray as xr
from cuml import RandomForestClassifier
from dask import compute, delayed
from dask_ml.model_selection import train_test_split
from geocube.api.core import make_geocube
from pystac_client import Client
from shapely.geometry import Polygon, box, mapping, shape
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    classification_report,
)
from tqdm import tqdm  # for progress bar

## Data Acquisition

Let's read the LULC data into a GeoDataFrame. 

A [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) is a type of data structure used to store geographic data in Python, provided by the [GeoPandas](https://geopandas.org/en/stable/) library. It extends the functionality of a pandas DataFrame to handle spatial data, enabling geospatial analysis and visualization. Like a pandas DataFrame, a GeoDataFrame is a tabular data structure with labeled axes (rows and columns), but it adds special features to work with geometric objects, such as:
- a geometry column
- a CRS
- accessibility to spatial operations (e.g.  intersection, union, buffering, and spatial joins)

In [None]:
# Download the LULC ROI data (ROIs.zip)
!gdown "https://drive.google.com/uc?id=1i0T3RqEgqcNXEUnPuDXod94EZE4IgQ7J"

In [None]:
!unzip ROIs.zip

In [None]:
# Download the administrative boundaries (2016_phc_vut_pid_4326.geojson)
!wget https://pacificdata.org/data/dataset/9dba1377-740c-429e-92ce-6a484657b4d9/resource/3d490d87-99c0-47fd-98bd-211adaf44f71/download/2016_phc_vut_pid_4326.geojson

Read and inspect the datasets.

In [None]:
lulc_gdf = gpd.read_file("./ROIs/ROIs_v5.shp") #"./ROIs_v5.shp")

In [None]:
admin_boundaries_gdf = gpd.read_file("./2016_phc_vut_pid_4326.geojson")

In [None]:
admin_boundaries_gdf.head(2)

In [None]:
len(lulc_gdf), len(admin_boundaries_gdf)

We can check out the attributes associated with this dataset:

In [None]:
lulc_gdf.columns

Let's see which classes are available to us in the most recent LULC column.

In [None]:
lulc_gdf.ROI.unique()

And view a subset of the data (shuffled for more variety in the 10 samples):

In [None]:
lulc_gdf.sample(frac=1).head(10)

We can also plot the vector dataset, and color code the polygons by the relevant LULC column.

In [None]:
lulc_gdf.plot(column='ROI')

Create raster image and label xarray datarrays for each province.

In [None]:
admin_boundaries_gdf

In [None]:
YEAR = 2022 # year matching label data
PROVINCES_TRAIN = ["TORBA", "SANMA", "PENAMA", "MALAMPA", "SHEFA"]
PROVINCE_TEST = "TAFEA"

In [None]:
admin_boundaries_gdf = admin_boundaries_gdf.set_index(keys="pname")  # set province name as the index

In [None]:
admin_boundaries_gdf

Get geometries of each province.

In [None]:
GEOMS_TRAIN = admin_boundaries_gdf.loc[PROVINCES_TRAIN].geometry.tolist()
GEOMS_TRAIN

In [None]:
GEOM_TEST = admin_boundaries_gdf.loc[PROVINCE_TEST].geometry
GEOM_TEST

Get Sentinel-2 GeoMedian composite data for 2022 for each province.

In [None]:
STAC_URL = "http://stac.digitalearthpacific.org/"
stac_client = Client.open(STAC_URL)

In [None]:
# Collect s2_data per train province in a list
s2_data_train_list = []

for pname, geom in tqdm(zip(PROVINCES_TRAIN, GEOMS_TRAIN), total=len(GEOMS_TRAIN), desc="Loading GeoMAD per province"):
    try:
        # Query STAC for this province
        s2_search = stac_client.search(
            collections=["dep_s2_geomad"], # Sentinel-2 Geometric Median and Absolute Deviations (GeoMAD) over the Pacific.
            intersects=mapping(geom),  # GeoJSON dict
            datetime=str(YEAR),
        )
        s2_items = s2_search.item_collection()

        if len(s2_items) == 0:
            print(f"No items found for {pname}")
            continue

        # Load data from items
        s2_data = odc.stac.load(
            items=s2_items,
            bands=["blue", "green", "red", "nir08", "swir16"],
            chunks={"x": 1024, "y": 1024, "bands": -1, "time": -1},
            resolution=20,
        )

        s2_data_train_list.append(s2_data)

    except Exception as e:
        print(f"Error loading {pname}: {e}")



In [None]:
gdf_test = lulc_gdf.query(expr=f"Pname == '{PROVINCE_TEST}'")

s2_search = stac_client.search(
    collections=["dep_s2_geomad"],
    intersects=GEOM_TEST, 
    datetime=str(YEAR),
)
# Retrieve all items from search results
s2_items = s2_search.item_collection()
print("len(s2_items): ", len(s2_items))

s2_data_test = odc.stac.load(
    items=s2_items,
    bands=["blue", "green", "red", "nir08", "swir16"],
    chunks={'x': 1024, 'y': 1024, 'bands': -1, 'time': -1},
    resolution=20,
)
s2_data_test

Buffer the geometries to include some coastal offshore areas to account for any classes/ROIs that might be relevant and overlapping.

In [None]:
# Keep projection aligned with raster
raster_crs = s2_data_train_list[0].rio.crs

# Reproject the full subset once
gdf_reprojected_train = admin_boundaries_gdf.loc[PROVINCES_TRAIN].to_crs(crs=raster_crs)

# Create a dictionary of buffered geometries per province
geom_buffered_train = {
    pname: gdf_reprojected_train.loc[pname].geometry.buffer(5000)
    for pname in PROVINCES_TRAIN
}

In [None]:
geom_train_buffered_list = list(geom_buffered_train.values())
geom_train_buffered_list

In [None]:
geom_train_buffered_list[0]

In [None]:
# Keep projection aligned with raster
raster_crs = s2_data_test.rio.crs

# Get only the select province and reproject
gdf_reprojected_test = admin_boundaries_gdf.loc[[PROVINCE_TEST]].to_crs(crs=raster_crs)

# Buffer in raster units (meters if UTM)
geom_buffered_test = gdf_reprojected_test.buffer(distance=5000)[PROVINCE_TEST]
geom_buffered_test

Clip the Sentinel-2 data to be within the buffered geometries only.

In [None]:
# Make sure the keys match — we'll zip province names, geometries, and s2 datasets
s2_train_clipped_list = []

for pname, geom, s2_data in zip(PROVINCES_TRAIN, geom_buffered_train.values(), s2_data_train_list):
    try:
        # Clip the dataset to the buffered province geometry
        s2_clipped = s2_data.rio.clip(
            geometries=[mapping(geom)],
            crs=s2_data.rio.crs,
            drop=True
        )
        s2_train_clipped_list.append(s2_clipped)
    except Exception as e:
        print(f"Error clipping data for {pname}: {e}")

In [None]:
s2_train_clipped_list[0]

In [None]:
# Clip test province
s2_clipped_test = s2_data_test.rio.clip(geometries=[geom_buffered_test])

In [None]:
# Plot sample train province
s2_rgb = s2_train_clipped_list[0][["red", "green", "blue"]] 
s2_rgb_array = s2_rgb.to_array("band")  # now dims: band, y, x
s2_rgb_array_squeezed = s2_rgb_array.squeeze(dim="time", drop=True)

In [None]:
s2_rgb_array_squeezed.plot.imshow(size=4, vmin=0, vmax=4000)

Calculate remote sensing indices.

In [None]:
# Calculate remote sensing indices useful for mapping LULC
def compute_indices(ds):
    red = ds["red"]
    green = ds["green"]
    blue = ds["blue"]
    nir = ds["nir08"]
    swir = ds["swir16"]
    eps = 1e-6
    return xr.Dataset({
        "NDVI": (nir - red) / (nir + red + eps),
        "MNDWI": (green - swir) / (green + swir + eps),
        "SAVI": ((nir - red) / (nir + red + eps)) * 1.5,
        "BSI": ((swir + red) - (nir + blue)) / ((swir + red) + (nir + blue) + eps),
    })

index_data_train_list = []

for s2_clipped in s2_train_clipped_list:
    index_data = compute_indices(s2_clipped).squeeze("time", drop=True)
    index_data_train_list.append(index_data)

index_data_test = compute_indices(s2_clipped_test).squeeze("time", drop=True)
print(index_data_train_list[0])

Rasterize labels from the ROIs for training and test.

In [None]:
# Rasterize labels
width_test, height_test = s2_clipped_test.x.size, s2_clipped_test.y.size
bands = ['red', 'green', 'blue', 'nir08']

#print(gdf_.ROI.unique(), gdf_.ROI_numeric.unique())
gdf_test = gdf_test.to_crs(epsg=s2_clipped_test.rio.crs.to_epsg())

# Define the resolution and bounds based on Sentinel-2 features
resolution = s2_clipped_test.rio.resolution()
bounds_test = s2_clipped_test.rio.bounds()

gdf_rpg = lulc_gdf.to_crs(s2_clipped_test.rio.crs)

unique_classes = gdf_rpg['ROI'].unique()
#class_mapping = {cls: i+1 for i, cls in enumerate(unique_classes)}
class_mapping = {cls: i for i, cls in enumerate(unique_classes)} # zero-based, assumes existence of no data

# Add numerical column
gdf_rpg['ROI_numeric'] =  gdf_rpg['ROI'].map(class_mapping)

raster_bounds_test = box(*s2_clipped_test.rio.bounds())
gdf_test_clipped = gdf_rpg[gdf_rpg.intersects(raster_bounds_test)]

print(f"Before: {len(gdf_rpg)} | After: {len(gdf_test_clipped)}")

# Rasterize the vector dataset to match Sentinel-2
rasterized_labels_test = make_geocube(
    vector_data=gdf_test_clipped,
    measurements=["ROI_numeric"], 
    like=s2_clipped_test,  # Align with the features dataset
)

print("rasterized_labels_test: ", rasterized_labels_test)

In [None]:
# Rasterize labels
rasterized_labels_train_dict = {}
gdf_train_clipped_dict = {}
metadata_dict = {}

for s2_clipped_train, pname in zip(s2_train_clipped_list, PROVINCES_TRAIN):
    print(f"\n Processing province: {pname}")

    # Sentinel-2 metadata
    width = s2_clipped_train.sizes['x']
    height = s2_clipped_train.sizes['y']
    resolution = s2_clipped_train.rio.resolution()
    bounds = s2_clipped_train.rio.bounds()
    epsg = s2_clipped_train.rio.crs.to_epsg()
    raster_bounds = box(*bounds)

    # Reproject LULC GeoDataFrame to match S2 CRS
    gdf_rpg = lulc_gdf.to_crs(s2_clipped_train.rio.crs.to_epsg())

    # Class mapping (safe per-province if needed)
    unique_classes = gdf_rpg['ROI'].unique()
    #class_mapping = {cls: i+1 for i, cls in enumerate(unique_classes)}
    class_mapping = {cls: i for i, cls in enumerate(unique_classes)} # zero-based, assumes existence of no data    
    gdf_rpg['ROI_numeric'] = gdf_rpg['ROI'].map(class_mapping)

    # Clip vector LULC to S2 bounds
    gdf_train_clipped = gdf_rpg[gdf_rpg.intersects(raster_bounds)]
    print(f"Vector features: {len(gdf_rpg)} → Clipped: {len(gdf_train_clipped)}")

    if len(gdf_train_clipped) == 0:
        print(f"No vector data found for province: {pname}, skipping rasterization.")
        continue

    # Rasterize clipped vector labels
    rasterized_labels_train = make_geocube(
        vector_data=gdf_train_clipped,
        measurements=["ROI_numeric"],
        like=s2_clipped_train
    )

    # Store outputs
    rasterized_labels_train_dict[pname] = rasterized_labels_train
    gdf_train_clipped_dict[pname] = gdf_train_clipped
    metadata_dict[pname] = {
        "width": width,
        "height": height,
        "epsg": epsg,
        "resolution": resolution,
        "bounds": bounds,
        "class_mapping": class_mapping
    }


In [None]:
rasterized_labels_train_dict["SHEFA"]

In [None]:
fig, ax = plt.subplots()
gdf_train_clipped_dict["TORBA"].plot(ax=ax, facecolor="none", edgecolor="red")
bbox = box(*s2_train_clipped_list[0].rio.bounds())
gpd.GeoSeries([bbox], crs=s2_train_clipped_list[0].rio.crs).plot(ax=ax, facecolor="none", edgecolor="blue")
plt.title("gdf_train (red) vs. Raster Bounds (blue)")
plt.show()


In [None]:
gdf_train_clipped_dict["TORBA"].ROI.unique()

In [None]:
np.unique(rasterized_labels_train_dict["TORBA"]['ROI_numeric'].values)

Flatten pixels and only retain the those that overlap with an ROI. The labels (ROIs) are sparse, so we will throw out pixels in regions between ROIs (unlabeled). 

In [None]:
features_list = []
labels_list = []

for i, prov_name in enumerate(PROVINCES_TRAIN):
    # Stack spatial dimensions first
    features_train = index_data_train_list[i].to_array().stack(flattened_pixel=("y", "x"))
    labels_train = rasterized_labels_train_dict[prov_name].to_array().stack(flattened_pixel=("y", "x"))
    
    # Compute mask for valid pixels (no NaNs across all features or labels)
    mask = (
        np.isfinite(features_train).all(dim="variable") &
        np.isfinite(labels_train).all(dim="variable")
    ).compute()
    
    # Apply the mask to drop invalid pixels
    features_train = features_train[:, mask].transpose("flattened_pixel", "variable").compute()
    labels_train = labels_train[:, mask].transpose("flattened_pixel", "variable").squeeze().compute()
    
    labels_train = labels_train.astype(int)

    features_list.append(features_train)
    labels_list.append(labels_train)

# Concatenate all provinces along the flattened_pixel dimension
features_train = xr.concat(features_list, dim="flattened_pixel")
labels_train = xr.concat(labels_list, dim="flattened_pixel")

print("Combined features_train shape:", features_train.shape)
print("Combined zero-based labels_train shape:", labels_train.shape)


In [None]:
np.unique(labels_train.values)

In [None]:
features_test = index_data_test.to_array().stack(flattened_pixel=("y", "x"))
labels_test = rasterized_labels_test.to_array().stack(flattened_pixel=("y", "x"))

test_mask = (
    np.isfinite(features_test).all(dim="variable") &
    np.isfinite(labels_test).all(dim="variable")
).compute()

features_test = features_test[:].transpose("flattened_pixel", "variable").compute()
labels_test = labels_test[:].transpose("flattened_pixel", "variable").squeeze().compute()
print("labels_test values:", np.unique(labels_test.values))

# Convert nan to 255
labels_test = labels_test.fillna(255).astype(int)
labels_test = labels_test.astype(int)
print("labels_test values:", np.unique(labels_test.values))

print("features_test shape:", features_test.shape)
print("labels_test shape:", labels_test.shape)

In [None]:
len(features_train), len(labels_train), len(features_test), len(labels_test)

In [None]:
features_train

## Data Splitting

Now that we have the arrays flattened, we can split the datasets into training and testing partitions. We will reserve 80 percent of the data for training, and 20 percent for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features_train, labels_train, test_size=0.2, random_state=42, shuffle=True
)

Ensure all labels are in each partition.

In [None]:
np.unique(y_train), np.unique(y_test) 

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

## Random Forest Classification

Now we will set up a small [random forest classifider](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with 10 trees. We use a [seed](https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431) (`random_state`) to ensure reproducibility. Calling the `.fit()` method on the classifier will initiate training.

In [None]:
%%time
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42) #n_estimators=10
clf.fit(X_train.data, y_train.data)

## Prediction

Once the classifier is finished training, we can use it to make predictions on our test dataset.

In [None]:
# Test the classifier
y_pred = clf.predict(X_test)

## Evaluation

It's important to know how well our classifier performs relative to the true labels (`y_test`). For this, we can calculate the [accuracy metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to measure agreement between the true and predicted labels.

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

We can also produce a [classification report](https://scikit-learn.org/1.7/modules/generated/sklearn.metrics.classification_report.html)
to check the precision, recall and F1 scores for each class.

In [None]:
print(classification_report(y_true=y_test, y_pred=y_pred))

As a reminder, these are what each class number represents.

In [None]:
print("Class mapping:")
for key, val in class_mapping.items():
    print(val, key)

We can also plot a confusion matrix to explore per-class performance.

In [None]:
ConfusionMatrixDisplay.from_predictions(
    y_true=y_test, y_pred=y_pred, normalize="true", values_format=".2f"
)

Notice that we see a high variability in the performance across classes. This is likely due to a class imbalance or inter-class differentiation challenge within our training dataset. It's possible that augmentations or class revision may help to address this.

## Visualization

If we want to generate predictions for the entire dataset in order to plot a map of predicted LULC for the entire area of interest, we can do this using the test province dataset.

In [None]:
y_pred = clf.predict(features_test)

In [None]:
predicted_map = y_pred.reshape((height_test, width_test))
predicted_map_xr = xr.DataArray(data=predicted_map, coords=rasterized_labels_test.coords)
print(np.unique(y_pred))

In [None]:
predicted_map_xr.hvplot.image(height=600, rasterize=True, cmap="set1_r")

In [None]:
rasterized_labels_test.ROI_numeric.hvplot.image(rasterize=True, cmap="set1_r")

In [None]:
"""
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

rasterized_labels_test["ROI_numeric"].plot(ax=axes[0], cmap="viridis")
axes[0].set_title("Ground truth")
axes[0].set_aspect('equal')

predicted_map_xr.plot(ax=axes[1], cmap="viridis")
axes[1].set_title("Predictions")
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()
"""

In [None]:
compatible_array = predicted_map_xr.astype("int32")

# Rasterize to polygons
polygons = list(
    rasterio.features.shapes(compatible_array.values, transform=compatible_array.rio.transform())
)

# Convert polygons to GeoDataFrame
prediction_gdf = gpd.GeoDataFrame(
    [{"geometry": shape(geom), "value": value} for geom, value in polygons],
    crs="EPSG:4326",
)
#print(prediction_gdf)
print(prediction_gdf.value.unique())

prediction_gdf.to_file(f"./predicted_lulc_utm_{PROVINCE_TEST}_{YEAR}.geojson", driver="GeoJSON")

In [None]:
prediction_gdf.head(10)

You can run these predictions on every province, collect the geodataframes in a list, and combine them into a final, unfiied nationwide LULC vector dataset like so (placeholder code, you need to generate the predictions first):

In [None]:
prediction_gdf_merged_nationwide = pd.concat([prediction_gdf_TORBA, prediction_gdf_SANMA, prediction_gdf_PENAMA,
                        prediction_gdf_MALAMPA, prediction_gdf_SHEFA, prediction_gdf_TAFEA], ignore_index=True)

prediction_gdf_merged_nationwide.to_file(f"./predicted_lulc_utm_nationwide_{YEAR}.geojson", driver="GeoJSON")