# Land Use / Land Cover Segmentation Using Sentinel-2 and Random Forest

This workflow demonstrates how to use [Sentinel-2](https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2) satellite imagery for segmenting land use / land cover (LULC) using a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) classifier. We will pursue this objective by integrating ground truth forest areas from the **National Forest Classification Dataset (LULC)** from 2018. To make this scalable to all of Vanuatu, we use an [administrative boundaries dataset from Pacific data hub](https://pacificdata.org/data/dataset/2016_vut_phc_admin_boundaries/resource/66ae054b-9b67-4876-b59c-0b078c31e800).

In this notebook, we will demonstrate the following:

1. **Data Acquisition**:
   - We use **Sentinel-2 L2A** data (Level-2A provides surface reflectance) accessed via the [AWS STAC catalog](https://registry.opendata.aws/). The search is filtered by parameters like a region of interest (AOI), time range, and cloud cover percentage to obtain suitable imagery.
   
2. **Preprocessing**:
   - The Sentinel-2 imagery contains several spectral bands (e.g., Red, Green, Blue, and Near-Infrared). These are extracted and combined into a single dataset for analysis. Additionally, the imagery is masked to remove areas outside the AOI and focus on the relevant pixels.
  
3. **Feature Extraction**:
   - Features for the classifier are extracted from the Sentinel-2 spectral bands. Here, we will use the reflectance values from the Red, Green, Blue, and Near-Infrared (NIR) bands. We will mask out clouds from these bands before further analysis.

4. **Ground Truth Data Integration**:
   - A shapefile containing polygons attributed by land cover/land use is loaded into a [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html). This allows us to create multi-class labels for the pixels in the Sentinel-2 imagery.
  
5. **Data Splitting**:
   - To ensure correct model training, we split the features and labels into training (80%) and testing (20%) sets. A 'seed' value is used for the random number generator to ensure this random split is reproducible.

6. **Random Forest Classification**:
   - We train a **Random Forest** classifier to predict planted forest areas. The `n_estimators` parameter is a key hyperparameter, determining the number of decision trees in the forest. Random Forest leverages the collective wisdom of multiple decision trees to make accurate predictions.

7. **Prediction**:
   - We will use the trained classifier to predict the likelihood of lulc types for each pixel in the image. 

8. **Evaluation**:
   - After making predictions on the test set, we evaluate the model's performance using metrics such as accuracy and F1-score. This allows us to assess the performance of the Random Forest model and the effectiveness of the selected features.

9. **Visualization**:
   - We visualize the predictions by plotting the classified map, where lulc types are indicated by specific color codes.

At the end, you will have trained a model to predict land use + land cover in Vanuatu.

![result](https://github.com/user-attachments/assets/6794df2b-45b4-4c6a-923b-98b33e305a39)

In [None]:
!mamba install --channel rapidsai --quiet --yes cuml

In [2]:
import dask.dataframe as dd
import geopandas as gpd
import glob
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import odc.stac
import rasterio.features
from shapely.geometry import box
import xarray as xr
# import xgboost as xgb
from cuml import RandomForestClassifier
from dask import delayed, compute
from geocube.api.core import make_geocube
from pystac_client import Client
from shapely.geometry import Polygon, shape
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
#from sklearn.model_selection import train_test_split
from dask_ml.model_selection import train_test_split

## Data Acquisition

Let's read the LULC data into a GeoDataFrame. 

A [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) is a type of data structure used to store geographic data in Python, provided by the [GeoPandas](https://geopandas.org/en/stable/) library. It extends the functionality of a pandas DataFrame to handle spatial data, enabling geospatial analysis and visualization. Like a pandas DataFrame, a GeoDataFrame is a tabular data structure with labeled axes (rows and columns), but it adds special features to work with geometric objects, such as:
- a geometry column
- a CRS
- accessibility to spatial operations (e.g.  intersection, union, buffering, and spatial joins)

In [None]:
lulc_gdf = gpd.read_file("./ROIs_v5.shp")

In [None]:
admin_boundaries_gdf = gpd.read_file("./2016_phc_vut_pid_4326.geojson")

In [None]:
admin_boundaries_gdf.head(2)

In [None]:
len(lulc_gdf), len(admin_boundaries_gdf)

We can check out the attributes associated with this dataset:

In [None]:
lulc_gdf.columns

Let's see which classes are available to us in the most recent LULC column.

In [None]:
lulc_gdf.ROI.unique()

And view a subset of the data (shuffled for more variety in the 10 samples):

In [None]:
lulc_gdf.sample(frac=1).head(10)

We can also plot the vector dataset, and color code the polygons by the relevant LULC column.

In [None]:
lulc_gdf.plot(column='ROI')

Break down the LULC into subsets per admin boundary.

Create raster image and label xarray datarrays for each admin boundary.

In [None]:
admin_boundaries_gdf

In [None]:
YEAR = 2018
PROVINCE_TRAIN = "SHEFA"
PROVINCE_TEST = "TORBA"

In [None]:
admin_boundaries_gdf = admin_boundaries_gdf.set_index(keys="pname")  # set province name as the index

In [None]:
GEOM_TRAIN = admin_boundaries_gdf.loc[PROVINCE_TRAIN].geometry
GEOM_TRAIN

GEOM_TEST = admin_boundaries_gdf.loc[PROVINCE_TEST].geometry
GEOM_TEST

In [None]:
STAC_URL = "http://stac.digitalearthpacific.org/"
stac_client = Client.open(STAC_URL)

In [None]:
# geos = glob.glob("clipped_lulc/*.geojson")
# gdf_train = gpd.read_file(geos[4])
gdf_train = lulc_gdf.query(expr=f"Pname == '{PROVINCE_TRAIN}'")

s2_search = stac_client.search(
    collections=["dep_s2_geomad"], # Sentinel-2 Geometric Median and Absolute Deviations (GeoMAD) over the Pacific.
    intersects=GEOM_TRAIN, 
    datetime=str(YEAR),
)
# Retrieve all items from search results
s2_items = s2_search.item_collection()
print("len(s2_items): ", len(s2_items))

s2_data_train = odc.stac.load(
    items=s2_items,
    bands=["blue", "green", "red", "nir08", "swir16"],
    chunks={'x': 1024, 'y': 1024, 'bands': -1, 'time': -1},
    resolution=20,
)
print("s2_data: ", s2_data_train)

In [None]:
# gdf_test = gpd.read_file(geos[0])
gdf_test = lulc_gdf.query(expr=f"Pname == '{PROVINCE_TEST}'")

s2_search = stac_client.search(
    collections=["dep_s2_geomad"], # Sentinel-2 Geometric Median and Absolute Deviations (GeoMAD) over the Pacific.
    intersects=GEOM_TEST, 
    datetime=str(YEAR),
)
# Retrieve all items from search results
s2_items = s2_search.item_collection()
print("len(s2_items): ", len(s2_items))

s2_data_test = odc.stac.load(
    items=s2_items,
    bands=["blue", "green", "red", "nir08", "swir16"],
    chunks={'x': 1024, 'y': 1024, 'bands': -1, 'time': -1},
    resolution=20,
)
print("s2_data: ", s2_data_test)

In [None]:
# Keep projection aligned with raster
raster_crs = s2_data_train.rio.crs

# Get only the select province and reproject
gdf_reprojected_train = admin_boundaries_gdf.loc[[PROVINCE_TRAIN]].to_crs(crs=raster_crs)
gdf_reprojected_test = admin_boundaries_gdf.loc[[PROVINCE_TEST]].to_crs(crs=raster_crs)

# Buffer in raster units (meters if UTM)
geom_buffered_train = gdf_reprojected_train.buffer(distance=5000)[PROVINCE_TRAIN]
geom_buffered_test = gdf_reprojected_test.buffer(distance=5000)[PROVINCE_TEST]
geom_buffered_test

In [None]:
# Clip
s2_clipped_train = s2_data_train.rio.clip(geometries=[geom_buffered_train])
s2_clipped_test = s2_data_test.rio.clip(geometries=[geom_buffered_test])

In [None]:
s2_clipped_train

In [None]:
s2_rgb = s2_clipped_train[["red", "green", "blue"]] 
s2_rgb_array = s2_rgb.to_array("band")  # now dims: band, y, x
s2_rgb_array_squeezed = s2_rgb_array.squeeze(dim="time", drop=True)

In [None]:
s2_rgb_array_squeezed.plot.imshow(size=4, vmin=0, vmax=4000)

In [None]:
def compute_indices(ds):
    red = ds["red"]
    green = ds["green"]
    blue = ds["blue"]
    nir = ds["nir08"]
    swir = ds["swir16"]
    eps = 1e-6
    return xr.Dataset({
        "NDVI": (nir - red) / (nir + red + eps),
        "MNDWI": (green - swir) / (green + swir + eps),
        "SAVI": ((nir - red) / (nir + red + eps)) * 1.5,
        "BSI": ((swir + red) - (nir + blue)) / ((swir + red) + (nir + blue) + eps),
    })

index_data_train = compute_indices(s2_clipped_train).squeeze("time", drop=True)
index_data_test = compute_indices(s2_clipped_test).squeeze("time", drop=True)
print(index_data_train)

In [None]:
width_train, height_train = s2_clipped_train.x.size, s2_clipped_train.y.size
width_test, height_test = s2_clipped_test.x.size, s2_clipped_test.y.size
epsg = s2_clipped_train.rio.crs.to_epsg()
bands = ['red', 'green', 'blue', 'nir08']
unique_classes = gdf_train['ROI'].unique()
class_mapping = {cls: i+1 for i, cls in enumerate(unique_classes)}

# Add numerical column
gdf_train['ROI_numeric'] = gdf_train['ROI'].map(class_mapping)
gdf_test['ROI_numeric'] = gdf_test['ROI'].map(class_mapping)

#print(gdf_.ROI.unique(), gdf_.ROI_numeric.unique())
gdf_train = gdf_train.to_crs(epsg=epsg)
gdf_test = gdf_test.to_crs(epsg=epsg)
# Define the resolution and bounds based on Sentinel-2 features
resolution = s2_clipped_train.rio.resolution()
bounds_train = s2_clipped_train.rio.bounds()
bounds_test = s2_clipped_test.rio.bounds()

gdf_rpg = lulc_gdf.to_crs(s2_clipped_train.rio.crs)

unique_classes = gdf_rpg['ROI'].unique()
class_mapping = {cls: i+1 for i, cls in enumerate(unique_classes)}

# Add numerical column
gdf_rpg['ROI_numeric'] =  gdf_rpg['ROI'].map(class_mapping)

raster_bounds_train = box(*s2_clipped_train.rio.bounds())
gdf_train_clipped = gdf_rpg[gdf_rpg.intersects(raster_bounds_train)]

raster_bounds_test = box(*s2_clipped_test.rio.bounds())
gdf_test_clipped = gdf_rpg[gdf_rpg.intersects(raster_bounds_test)]

print(f"Before: {len(gdf_rpg)} | After: {len(gdf_train_clipped)}")

# Rasterize the vector dataset to match Sentinel-2
rasterized_labels_train = make_geocube(
    vector_data=gdf_train_clipped,
    measurements=["ROI_numeric"], 
    like=s2_clipped_train,  # Align with the features dataset
)

# Rasterize the vector dataset to match Sentinel-2
rasterized_labels_test = make_geocube(
    vector_data=gdf_test_clipped,
    measurements=["ROI_numeric"], 
    like=s2_clipped_test,  # Align with the features dataset
)

print("rasterized_labels_train: ", rasterized_labels_train)

In [None]:
gdf_test_clipped.head(10)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
gdf_train_clipped.plot(ax=ax, facecolor="none", edgecolor="red")
bbox = box(*s2_clipped_train.rio.bounds())
gpd.GeoSeries([bbox], crs=s2_clipped_train.rio.crs).plot(ax=ax, facecolor="none", edgecolor="blue")
plt.title("gdf_train (red) vs. Raster Bounds (blue)")
plt.show()


In [None]:
gdf_test.ROI.unique()

In [None]:
np.unique(rasterized_labels_train['ROI_numeric'].values)

In [None]:
# The rasterized output is an xarray.Dataset
rasterized_labels_train = rasterized_labels_train.where(~np.isnan(rasterized_labels_train), other=0).astype(int)  # Replace NaNs with 0
rasterized_labels_test = rasterized_labels_test.where(~np.isnan(rasterized_labels_test), other=0).astype(int)  # Replace NaNs with 0
#rasterized_labels = rasterized_labels.astype(int)
features_train = index_data_train.to_array().stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "variable")
features_test = index_data_test.to_array().stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "variable")
print("features: ", features_train)
labels_train = rasterized_labels_train.to_array().stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "variable").squeeze()
labels_test = rasterized_labels_test.to_array().stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "variable").squeeze()
print("labels: ", labels_train)

In [None]:
len(features_train), len(labels_train), len(features_test), len(labels_test)

In [None]:
features_train

In [None]:
#features_flattened = xr.concat(features_, dim="flattened_pixel")
#labels_flattened = xr.concat(labels_, dim="flattened_pixel")

In [None]:
#features_flattened.shape, labels_flattened.shape

In [None]:
features_train.shape, labels_train.shape

## Data Splitting

Now that we have the arrays flattened, we can split the datasets into training and testing partitions. We will reserve 80 percent of the data for training, and 20 percent for testing.

In [None]:
#features_flattened.shape, labels_flattened.shape

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(
#    features_flattened, labels_flattened, test_size=0.2, random_state=42, shuffle=True
#)

In [None]:
X_train, X_test, y_train, y_test = features_train, features_test, labels_train, labels_test

Ensure all labels are in each partition.

In [None]:
np.unique(y_train), np.unique(y_test) 

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

## Random Forest Classification

Now we will set up a small [random forest classifider](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with 10 trees. We use a [seed](https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431) (`random_state`) to ensure reproducibility. Calling the `.fit()` method on the classifier will initiate training.

In [53]:
%%time
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42) #n_estimators=10
clf.fit(X_train.data, y_train.data)

CPU times: user 2min 5s, sys: 2min 41s, total: 4min 47s
Wall time: 1min 17s


## Prediction

Once the classifier is finished training, we can use it to make predictions on our test dataset.

In [58]:
# Test the classifier
y_pred = clf.predict(X_test.data.compute())

## Evaluation

It's important to know how well our classifier performs relative to the true labels (`y_test`). For this, we can calculate the [accuracy metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to measure agreement between the true and predicted labels.

In [59]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9993


In [None]:
# Evaluate the performance (you can use metrics like accuracy, F1-score, etc.)
#print("Accuracy:", accuracy_score(y_test, y_pred)) #0.60 # 0.6130508977954351 for 50

We can also plot a confusion matrix to explore per-class performance.

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=y_pred)

Notice that we see a high variability in the performance across classes. This is likely due to a class imbalance or inter-class differentiation challenge within our training dataset. It's possible that augmentations or class revision may help to address this.

## Visualization

If we want to generate predictions for the entire dataset in order to plot a map of predicted LULC for the entire area of interest, we can do this using the full (un-partitioned) features dataset.

In [None]:
#y_pred_full = clf.predict(features_test)
#predicted_map = y_pred_full.reshape((height_test, width_test))
predicted_map = y_pred.reshape((height_test, width_test))
predicted_map_xr = xr.DataArray(data=predicted_map, coords=rasterized_labels_test.coords)
print(np.unique(y_pred))
    

In [None]:
predicted_map_xr, rasterized_labels_test

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

rasterized_labels_test["ROI_numeric"].plot(ax=axes[0], cmap="viridis")
axes[0].set_title("Ground truth")
axes[0].set_aspect('equal')

predicted_map_xr.plot(ax=axes[1], cmap="viridis")
axes[1].set_title("Predictions")
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

In [None]:
compatible_array = predicted_map_xr.astype("int32")

# Rasterize to polygons
polygons = list(
    rasterio.features.shapes(compatible_array.values, transform=compatible_array.rio.transform())
)

# Convert polygons to GeoDataFrame
prediction_gdf = gpd.GeoDataFrame(
    [{"geometry": shape(geom), "value": value} for geom, value in polygons],
    crs="EPSG:4326",
)
#print(prediction_gdf)
print(prediction_gdf.value.unique())

prediction_gdf.to_file(f"./predicted_lulc_utm_{PROVINCE_TEST}_{YEAR}.geojson", driver="GeoJSON")