Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI
Official code and dataset repository for the Remote Sensing 2024 journal paper Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI. Also presented (oral) at the 2nd ML-for-RS Workshop at ICLR 2024, and presented as a poster at EUMETSAT 2023.
Journal paper | arXiv | ML4RS workshop paper
In June 2024, the cloud optical thickness models developed in this work were sent into orbit in space, in collaboration with Unibap and D-orbit.
In this work, two novel datasets are introduced (see our paper for details):
- A synthetic dataset for cloud optical thickness estimation, which can be downloaded here.
- A dataset of real satellite images, each of which is labeled 'clear' or 'cloudy'. This dataset can be downloaded here.
On a Ubuntu work station, the below should be sufficient for running the code in this repository.
conda create -n cot_env python=3.8
conda activate cot_env
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install scipy
pip install matplotlib
pip install xarray
pip install scikit-image
pip install netCDF4
This will create the cot_env
environment which you will then activate with
conda activate cot_env
The main files of importance are cot_synth_train.py
and cot_synth_eval.py
, where the workflow is oriented around first training (and at the end of training, saving) models using cot_synth_train.py
, followed by evaluting said models using cot_synth_eval.py
.
Begin by creating a folder ../data
, and in this folder you should put the data folder synthetic-cot-data
that you can download here (you must also unzip the file). Then also create a folder ../log
(this folder should be side-by-side with the folder ../data
; not one inside the other).
After the above, to train a model simply run
python cot_synth_train.py
Model weights will then be automatically saved in a log-folder, e.g. ../log/data-stamp-of-folder/model_it_xxx
. Note that, by default, cot_synth_train.py
trains models on the training set,
where each input data point is assumed to be a 12-dimensional vector corresponding to the 12 spectral bands in the synthetic dataset (all of the 13 standard bands, except for B1), and where 3% noise
is added to the inputs during training (see the flag INPUT_NOISE_TRAIN
). To train models which also omit band B10 (e.g. to do evaluations on Swedish Forest Agency data;
see "Evaluating trained COT estimation model on use-case by the Swedish Forest Agency" below), set SKIP_BAND_10
to True instead of False.
To evaluate model(s), update the flag MODEL_LOAD_PATH
in cot_synth_eval.py
so that it points to the model checkpoint file induced by running the above training command. After that, simply run
python cot_synth_eval.py
in order to evaluate the model that you trained using cot_synth_train.py
. Note: By default, the evaluation occurs for the training split (can be changed with the flag SPLIT_TO_USE
). Also,
by default, the flag INPUT_NOISE
is set as the list [0.00, 0.01, 0.02, 0.03, 0.04, 0.05]
. In this case, the evaluation script will show average results across different input noise levels
(typically results get better at lower noise levels compared to higher levels).
It is possible to evaluate ensemble models as well. To do this for an ensmble of N models, first train N models by running the cot_synth_train.py
script N times (using different SEED
each time
so that the models do not become identical). Then, when running cot_synth_eval.py
, ensure MODEL_LOAD_PATH becomes a list where each list element is a model log path. Examples of this type of
path specification are already available within cot_synth_eval.py
.
Pretrained model weights are already available here. Download the contents of this folder (should be 10 folders
in total), unzip it, and ensure the resulting 10 folders land inside ../log/
. The model weights that are available are 10 five-layer MLPs that were trained using 3% additive noise, and where
each input data point is assumed to be a 12-dimensional vector corresponding to the 12 spectral bands in the synthetic dataset (all of the 13 standard bands, except for B1).
They can be run in ensemble-mode as described in the previous paragraph.
Evaluating trained COT estimation model on use-case by the Swedish Forest Agency (SFA, Skogsstyrelsen)
The main file of importance is swe_forest_agency_cls.py
.
If you haven't already, begin by creating a folder ../data
, and in this folder you should create a folder skogsstyrelsen
. Within ../data/skogsstyrelsen/
, put the data that you can download from here. Then, if you haven't already, also create a folder ../log
(i.e. the data
and log
folders should be next to
each other; not one of them within the other).
To evaluate model(s) on the SFA cloudy / clear image binary classification setup, first ensure that MODEL_LOAD_PATH
points to model / models that have been trained on the synthetic
data by SMHI (see "Training and evaluating ML models for synthetic cloud optical thickness (COT) data provided by SMHI" above), AND/OR first download pretrained models as described below. Then run
python swe_forest_agency_cls.py
The above will by default run the model on the whole train-val set in the provided train-val-test split. You can change what split to run the model on using the flag SPLIT_TO_USE
in
swe_forest_agency_cls.py
. Various results such as F1-scores will be shown after running the code.
Pretrained model weights are already available here. Download the folder, unzip it, and ensure the resulting
10 folders land inside ../log/
. The model weights that are available are 10 five-layer MLPs that were trained using 3% additive noise, and where each input data point is assumed to be an
11-dimensional vector corresponding to 11 out of the 12 spectral bands in the synthetic dataset (all of the 13 standard bands, except for B1 and B10). They can be run in ensemble-mode, as has been explained previously.
As described in our paper, we also compare our COT estimation-based approach with an image classification-based approach on the SFA dataset. To train a ResNet-18-based such classifier, simply run
python binary_cls_skogs.py
Once the model has finished training, ensure that MODEL_LOAD_PATH
points to the saved model weights and set EVAL_ONLY = True
within binary_cls_skogs.py
in order to evaluate the trained model. The split on which the model is evaluated can be changed using the flag SPLIT_TO_USE
.
The KappaZeta dataset we have used in our paper was downloaded from here. To train FCN-based models for cloud type segmentation on KappaZeta, see the file kappa_cloud_train.py
. In particular, a model is trained via the command
python kappa_cloud_train.py
Note that the model is trained on data corresponding to the months April, May and June (see the flag MONTHS
). To evaluate a trained model, ensure that MODEL_LOAD_PATH
points to the saved model weights and set EVAL_ONLY = True
within kappa_cloud_train.py
. In the paper, the models are evaluated on data corresponding to the months July, August and September (thus change MONTHS
accordingly).
To instead train / refine / evaluate MLP-based models on KappaZeta, please refer instead to the file kappa_cloud_opt_thick.py
, which in many ways works in the same way as kappa_cloud_train.py
.
If you find our dataset(s), code, and/or our paper interesting or helpful, please consider citing:
@article{pirinen2024creating,
title={Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI},
author={Pirinen, Aleksis and Abid, Nosheen and Paszkowsky, Nuria Agues and Timoudas, Thomas Ohlson and Scheirer, Ronald and Ceccobello, Chiara and Kov{\'a}cs, Gy{\"o}rgy and Persson, Anders},
journal={Remote Sensing},
volume={16},
number={4},
pages={694},
year={2024},
publisher={MDPI}
}
This work was funded by Vinnova (Swedish Space Data Lab 2.0, grant number 2021-03643), the Swedish National Space Agency and the Swedish Forest Agency.
There is a typo in the journal version of the paper, where "RMSE" (root mean squared error) should instead be read as "MAE" (mean absolute error) in Section 4.1. This has been fixed in the arXiv version.