# Modelling dam volumes using DE Africa waterbodies
# Section 02  : *Model Training*
**Products used:** 
[DE Africa Waterbodies](https://docs.digitalearthafrica.org/en/latest/data_specs/Waterbodies_specs.html), 
[Department of Water Affairs and Sanitation, South Africa Dam Level and Volume Data](https://www.dws.gov.za/Hydrology/Verified/hymain.aspx)

## Background

### Digital Twin (DT)
The CGIAR Digital Twin initiative creates dynamic virtual models that combine real-time data, AI, and simulations to improve decision-making. Its prototype for the Limpopo River Basin focuses on enhancing water resource management and conservation.


## Description
This notebook presents a workflow for predicting dam levels and volumes using water surface area data from DE Africa's Waterbodies product, integrating data preprocessing, feature extraction, and Gradient Boosting modeling [(Retief et al., 2025)](https://arxiv.org/abs/2502.19989). 

As part of the CGIAR Initiative on Digital Innovation, this work contributes to a prototype [Digital Twin](https://digitaltwins.demos-only.iwmi.org/) for the Limpopo River Basin, designed to support real-time decision-making in water management. The Digital Twin leverages AI-driven tools to visualize and simulate the impact of decisions on the basin's ecosystem. To enhance prediction reliability, the model includes a correction mechanism to address unrealistic large drops in dam volume estimates.


## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages
Import Python packages that are used for the analysis.

In [None]:
%matplotlib inline

import os
import pandas as pd
import numpy as np
import xarray as xr
import seaborn as sns
import datacube
import joblib

from scipy import interpolate
from scipy.optimize import curve_fit
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor

import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
import plotly.graph_objects as go

from deafrica_tools.waterbodies import get_waterbody, get_time_series, display_time_series
from IPython.display import Image

## Load Model Training Datasets


### Sample Data Overview

This dataset contains raw water levels data collected from DEA (Department of Environmental Affairs) in South Africa. The cell below reads this ancillary data necessary to conduct the volume prediction.

### Data Preparation and Feature Selection
This section involves selecting the relevant features and handling missing values.

In [None]:
merged_data = pd.read_csv("data/preprocess_data.csv")

In [None]:
features = merged_data[['calculated_level', 'water_area_ha']]
target = merged_data['Dam_Level']
print(f"Initial data shape: {features.shape}")
print(f"Missing values in features:\n{features.isnull().sum()}")

imputer = SimpleImputer(strategy='mean')
features_imputed = imputer.fit_transform(features)

features_imputed_df = pd.DataFrame(features_imputed, columns=features.columns)
print(f"Data shape after imputing missing values: {features_imputed_df.shape}")

### Splitting Data for Training and Testing
Here, we split the data into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_imputed_df, target, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]} rows")
print(f"Test set size: {X_test.shape[0]} rows")


### Performing Cross-Validation
This step involves evaluating the model using cross-validation.

In [None]:
print("Performing cross-validation...")
gradient_boosting = GradientBoostingRegressor(random_state=42)
cv_scores = cross_val_score(gradient_boosting, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores).mean()
print(f"Cross-validated RMSE: {cv_rmse:.4f}")

### Hyperparameter Tuning
We perform hyperparameter tuning using GridSearchCV to find the best combination of parameters for the Gradient Boosting Regressor.

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}
print("Performing hyperparameter tuning using GridSearchCV...")
grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")


### Training the Best Model
We now train the model with the optimal hyperparameters found in the previous step.

In [None]:
print("\nTraining the best model with optimal hyperparameters...")
best_model.fit(X_train, y_train)
print("Model training completed.")

### Model Evaluation
We evaluate the trained model on the test dataset using RMSE, MAPE, and R² Score as performance metrics.

In [None]:
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"\nModel evaluation results:\n - RMSE: {rmse:.4f}\n - MAPE: {mape:.4f}\n - R² Score: {r2:.4f}")


### Feature Importances
We analyze the importance of each feature to understand how much each one contributed to the model's predictions.

In [None]:
print("\nFeature Importances:")
feature_importances = best_model.feature_importances_
for feature, importance in zip(features.columns, feature_importances):
    print(f"Feature: {feature}, Importance: {importance:.4f}")

### Saving the Trained Model
We save the trained model for future use.

In [None]:
print("\nSaving the trained model...")
os.makedirs("trained_models", exist_ok=True)
model_path = "trained_models/gradient_boosting_model.pkl"
joblib.dump(best_model, model_path)
print(f"Trained model saved successfully at: {model_path}")
y_pred_full = best_model.predict(features_imputed_df)
prediction = pd.Series(y_pred_full)

#saving test and train data
prediction.to_csv("data/prediction_data.csv")
target.to_csv("data/test_data.csv")
print("test and train data is saved....")


------------
## Additional information

<b> License </b> The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

<b> Contact </b> If you need assistance, please post a question on the [DE Africa Slack channel](https://digitalearthafrica.slack.com/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).

If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

<b> Compatible datacube version </b>

**References:**
- Retief, H., Kayathri, V., Ghosh, S., Garcia Andarcia, M., & Dickens, C. (2025) ‘Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa’, arXiv, submitted 28 July 2025.https://doi.org/10.48550/arXiv.2502.19989

- Garcia Andarcia, M., Dickens, C., Silva, P., Matheswaran, K., & Koo, J. (2024). Digital Twin for management of water resources in the Limpopo River Basin: a concept. Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 4p.https://hdl.handle.net/10568/151898

- Chambel-Leitão, P.; Santos, F.; Barreiros, D.; Santos, H.; Silva, Paulo; Madushanka, Thilina; Matheswaran, Karthikeyan; Muthuwatta, Lal; Vickneswaran, Keerththanan; Retief, H.; Dickens, Chris; Garcia Andarcia, Mariangel. 2024. Operational SWAT+ model: advancing seasonal forecasting in the Limpopo River Basin. Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 97p. https://hdl.handle.net/10568/155533

- Maity, R., Srivastava, A., Sarkar, S. and Khan, M.I., 2024. Revolutionizing the future of hydrological science: Impact of machine learning and deep learning amidst emerging explainable AI and transfer learning. Applied Computing and Geosciences, 24, p.100206.https://doi.org/10.1016/j.acags.2024.100206

- Pimenta, J., Fernandes, J.N. and Azevedo, A., 2025. Remote Sensing Tool for Reservoir Volume Estimation. Remote Sensing, 17(4), p.619.https://doi.org/10.3390/rs17040619

## Project background: 
The CGIAR Digital Innovation Initiative accelerates the transformation towards sustainable and inclusive agrifood systems by generating research-based evidence and innovative digital solutions. It is one of 32 initiatives of CGIAR, a global research partnership for a food-secure future, dedicated to transforming food, land, and water systems in a climate crisis.

### Contributors

**Hugo Retief**  
*Researcher*  
Email: [hugo@award.org.za](mailto:hugo@award.org.za)  

**Surajith Ghosh**  
*Researcher*  
Email: [S.Ghosh@cgiar.org](mailto:S.Ghosh@cgiar.org)  

**Victoria Neema**  
*Earth Observation Scientist*  
Email: [victoria.neema@digitalearthafrica.org](mailto:victoria.neema@digitalearthafrica.org)  

**Kayathri Vigneswaran**  
*Junior Data Scientist*  
Email: [v.kayathri@cgiar.org](mailto:v.kayathri@cgiar.org)  


In [None]:
print(datacube.__version__)

**Last Tested:**

In [None]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')