In [1]:
%load_ext autoreload
%autoreload 2

# Data Curation

As we have seen before, datasets and models in EOTDL are categorized into different quality levels:

- **Q0**: datasets in the form of an archive with arbitary files without curation. This level is ideal for easy and fast upload/download of small datasets.
- **Q1**: datasets with STAC metadata but no QA. These datasets can leverage a limited set of EOTDL features.
- **Q2**: datasets with STAC metadata with the EOTDL custom extensions and automated QA. These datasets can leverage the full potential of the EOTDL.
- **Q3**: Q2 datasets that are manually curated. These datasets are the most reliable and can be used as benchmark datasets.

Up until now we have been focused on Q0 datasets and models, but in this notebook we are going to explore higher quality level datasets.

> Q1+ models are not yet supported, this is a work in progress. You can track the status by joining our Discord server.

## STAC metadata

The [STAC](https://stacspec.org/en) specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered. Is a standarized way to expose, host, ingest and access geospatial collections that has been adopted as the EOTDL standard metadata format. For further information, check their [website](https://stacspec.org/en).

> Slido: What is your experience with STAC? 

> Slido: What other metadata formats or specifications do you use?

In order to facilitate the STAC generation, which can be painful and time-consuming, the EOTDL offers some tools tto streamline this process. Let's dive into them!

The first thing we have to understand is that the process starts with a `STACDataFrame`. This `STACDataFrame` is an interface between the datasets and STAC (catalogs, collections and items), with some variables that we can define and customise to ensure that the STAC metadata has the information we want, like `extensions`, which defines the [STAC extensions](https://stac-extensions.github.io/) that the image must have, or `bands`, with the bands we want to get from the image, if any.

Let's see the example below.

In [1]:
import pandas as pd

sample_df = pd.read_csv('workshop_data/sample_stacdataframe.csv')
sample_df

Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/sentinel_2/Boadella_2.tif,Boadella,0,data/sentinel_2/source,,
1,data/sentinel_2/Boadella_3.tif,Boadella,0,data/sentinel_2/source,,
2,data/sentinel_2/Boadella_1.tif,Boadella,0,data/sentinel_2/source,,
3,data/sentinel_2/Boadella_4.tif,Boadella,0,data/sentinel_2/source,,
4,data/sentinel_2/Boadella_5.tif,Boadella,0,data/sentinel_2/source,,


This is a sample `STACDataFrame` already generated for our workshop. Here we can see a lot of information:
- image: the path to every image.
- label: the label assigned to every image. 
- ix: the index of the label.
- collection: the collection which the image belongs to. 
- extensions: a list with the STAC extensions we want the image STAC item to have.
- bands: a list of band we want the image STAC item to have.

Now we have seen this, let's generate the STAC metadata for our dataset. Don't worry, we are going to explain it step by step!

First of all, we need to import the `STACGenerator` class.

In [1]:
from eotdl.curation.stac.stac import STACGenerator

The `STACGenerator` class is the entry point and the STAC generation class, where magic happens. Before we declare it, we need to understand the parameters we can give to it:
- `image_format`: the extension of the images. Could be `png`, `jpg` and so on. By default is `tiff`.
- `catalog_type`: the STAC Catalog type. It is a specification defined [here](https://pystac.readthedocs.io/en/0.4/concepts.html#catalog-types). By default is `SELF_CONTAINED`. 
- `item_parser`: the item_parser defines the strategy that must be followed to search for satellite images within the folder. We have defined 2 item_parser strategies, and new ones can be added as needed. The strategies that are implemented right now are the following.
    - `StructuredParser`: this strategy is used when the images are each contained within a folder, so that the name of the item will be the name of the folder.
    
    <p align="center">
        <img src="images/structured_parser.png" alt="Structured parser typical folder structure" style="height:170px; width:200px;"/>
    </p>
    
    - `UnestructuredParser`: this strategy is used when there are multiple images contained in the same folder. We will use this strategy when using the EOTDL to download the dataset images, as it will always format the folder structure the same way. As this is what we have done, it is the strategy that we will use for the use case of this workshop, as all the images are in the same folder.

    <p align="center">
        <img src="images/unestructured_parser.png" alt="Structured parser typical folder structure" style="height:200px; width:200px;"/>
    </p>
    
- `assets_generator`: the assets_generator parameter defines the strategy to follow with the generation of assets from each image. In this way, it could be the case that from a Sentinel-2 image we want to extract all its bands as assets, or simply extract the RGB bands, or not extract any as assets. By default, three strategies have been established, which can be expanded according to needs.

    - `STACAssetGenerator`: does not extract new assets from the image bands, so a single asset is generated for the image.
    - `BandsAssetGenerator`: from the original image it creates a new file for each band established in the 'bands` column, deleting the original file. An asset is added to the STAC item for each band.
    - `ExtractedAssets`: indicates that the bands of an image have already been extracted as independent files, so it creates an item for each image taking the files as assets.
    
    

- `labeling_strategy`: the `labeling_strategy` parameter defines the strategy to extract a label from the filename of an image, to assign a label to it. By default, we have implemented 2 strategies:

    - `UnlabeledStrategy`: we will use it when the images do not have a label that identifies them or that has been placed on purpose. It is the one we will use in our case, since the filename is simply the name of the constellation, and it is the default option.
    - `LabeledStrategy`: we will use it when the images are labeled with labels in their filenames. An example would be that in a folder the images were called, for example, River_1.png, River_2.png, River_3.png, and so on. The file name must be the pattern <label>_<number>. This is the option we are going to use in the workshop, as all the images filenames are `boadella_<id>`.

    > If you plan to use SCANEO to label a Q1+ dataset, we suggest you first convert the dataset to Q1+ (generating STAC metadata) and then label the dataset with SCANEO. This way, SCANEO will work on STAC mode and the process will be simpler. However, it is still possilbe to upgrade a Q0 dataset with labels as we will later in this section.


For the specific case of our workshop, we will take into account the following:
- The images have been downloaded using the EOTDL and are each in the same folder, so as `item_parser` we will use `UnestructuredParser`.
- In this case we do not want to extract bands from the image as assets, so as `asset_generator` we will use `STACAssetGenerator`, which is the default option.
- As said, although the images do not have labels in their filenames we can use the `LabeledStrategy`, as all the images filenames are `boadella_<id>`.

In [2]:
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.assets import STACAssetGenerator
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.dataframe_labeling import LabeledStrategy

stac_generator = STACGenerator(item_parser=UnestructuredParser, 
                               assets_generator=STACAssetGenerator, 
                               labeling_strategy=LabeledStrategy,
                               image_format='tif'
                               )

If we now decide to generate a `STACDataFrame` from the folder with the images it will do fine, but it will be incomplete.

In [3]:
df = stac_generator.get_stac_dataframe('workshop_data/sentinel_2')
df.head()

Unnamed: 0,image,label,ix,collection,extensions,bands
0,workshop_data/sentinel_2/Boadella_2021-07-16.tif,Boadella,0,workshop_data/sentinel_2/source,,
1,workshop_data/sentinel_2/Boadella_2020-06-21.tif,Boadella,0,workshop_data/sentinel_2/source,,
2,workshop_data/sentinel_2/Boadella_2020-02-02.tif,Boadella,0,workshop_data/sentinel_2/source,,
3,workshop_data/sentinel_2/Boadella_2020-01-28.tif,Boadella,0,workshop_data/sentinel_2/source,,
4,workshop_data/sentinel_2/Boadella_2022-03-08.tif,Boadella,0,workshop_data/sentinel_2/source,,


A key feature is the `label` column. Using the label of every image we are going to assign parameters like the STAC extensions that this image's item is going to have, or the bands we want to extract using the `BandsAssetGenerator`. We can obtain the existing labels in the STACDataFrame before adding new information.

In [4]:
labels = df.label.unique().tolist()
labels

['Boadella']

Starting from the `Boadella` label we are going to define the STAC extensions. As STAC extensions we are going to implement the [proj](https://github.com/stac-extensions/projection), [raster](https://github.com/stac-extensions/raster) and [eo](https://github.com/stac-extensions/eo) STAC extensions. 

> Note: the supported extensiones are `('eo', 'sar', 'proj', 'raster')`.

On the other hand, although we don't want to extract the image bands, we can define them to see their metadata using the `eo` STAC extension. To simplify, let's only define the bands `B04`, `B03` and `B02`, which are the RGB bands.

To define these parameters for each label, we simply have to declare a dictionary.

In [5]:
extensions = {'Boadella': ('proj', 'raster', 'eo')}
bands = {'Boadella': ('B02', 'B03', 'B04')}

> Note: if we wanted to extract the bands as assets, we should create a dict following the same convention, just as follows. As said it is not required for this workshop, but could be useful for other use cases.

```
bands = {'sentinel-2-l2a': ('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12')}
```

Now we are ready to generate a `STACDataFrame` with relevant information. Some extra parameters to take into account:
- `path`: is the root path where the images are located at. In our case is `data/sentinel_2`.
- `collections`: we can use this parameter to define the STAC collection to which we want each item with a specific label to go. There are several options:
    - The default option puts all the STAC items in a single collection called `source`.
    - The `*` option will consider folders located directly under the root folder as collections, so it will create a collection for each of them.

    <p align="center">
        <img src="images/collection.png" alt="* collection" style="height:170px; width:200px;"/>
    </p>

    - You can decide the collection you want an image to go to through its label, as we have seen in the case of extensions and bands. To give an example, we are going to define it like this.

In [6]:
collection = {'Boadella': 'boadella-sentinel-2'}

Let's generate the complete STACDataFrame!

In [7]:
df = stac_generator.get_stac_dataframe('workshop_data/sentinel_2', collections=collection, extensions=extensions, bands=bands)
df.head()

Unnamed: 0,image,label,ix,collection,extensions,bands
0,workshop_data/sentinel_2/Boadella_2021-07-16.tif,Boadella,0,workshop_data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)","(B02, B03, B04)"
1,workshop_data/sentinel_2/Boadella_2020-06-21.tif,Boadella,0,workshop_data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)","(B02, B03, B04)"
2,workshop_data/sentinel_2/Boadella_2020-02-02.tif,Boadella,0,workshop_data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)","(B02, B03, B04)"
3,workshop_data/sentinel_2/Boadella_2020-01-28.tif,Boadella,0,workshop_data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)","(B02, B03, B04)"
4,workshop_data/sentinel_2/Boadella_2022-03-08.tif,Boadella,0,workshop_data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)","(B02, B03, B04)"


It looks good! We have all we need to generate the STAC metadata. We only have to give the catalog an `id`, a `description` and an `output_folder`!

In [8]:
stac_generator.generate_stac_metadata(id='boadella-dataset',
                                      description='Boadella dataset',
                                      output_folder='workshop_dataa/sentinel_2_stac')

Generating boadella-sentinel-2 collection...


100%|██████████| 10/10 [00:00<00:00, 460.95it/s]
Exception while validating Catalog href: /home/juan/Desktop/eotdl/tutorials/workshops/bids23/workshop_dataa/sentinel_2_stac/catalog.json
__init__() got an unexpected keyword argument 'registry'
Traceback (most recent call last):
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/pystac/validation/stac_validator.py", line 197, in _validate_from_uri
    validator = cls(schema, registry=self.registry)
TypeError: __init__() got an unexpected keyword argument 'registry'


Validating and saving catalog...


TypeError: __init__() got an unexpected keyword argument 'registry'

Voilà! We have generated the STAC metadata from our Q0 dataset, converting it into a Q1 dataset! But we are not done yet, as we have to generate the STAC labels item of every image, using the GeoJSON files that we have generated when labeling our dataset using SCANEO.

In order to generate the labels collection of a source collection (understanding 'source' collection as the source where are the STAC items belonging to the images) we have implementated a customizable class named `LabelExtensionObject`. With this class you can decide how to create the labels of your dataset, wether you want to develop your own implementation or use the implementations we have already developed. Let's explain them!

- `ImageNameLabeler`: this implementation should be used when the images of the dataset are named with the corresponding labels, such as `River_1`, `Forest_1`, and so on. We are not going to use this implementation, as it's not our use case.
- `ScaneoLabeler`: this implementation should be used when the labels have been generated using SCANEO, so we have a folder with the `geoJSON` label files and their corresponding images. As seen, this is the implementation we are going to use. Let's check the parameters we should use:
    - `catalog`: the path to the STAC catalog, or the pystac Catalog itself, we want to add the labels collection. In our case, `data/sentinel_2_stac/catalog.json`.
    - `root_folder`: the path to the folder containing the `geoJSON` files.
    - `collection`: the STAC collection we want to add the labels to. By default is `source`, but in our case is `boadella-sentinel-2`.
    - Extra properties can be added using `kwargs`, such as `label:methods` or `label:overviews`. You can check them [here](https://github.com/stac-extensions/label#item-properties). We are going to add `label_methods` as `manual`.

Knowing this, we can generate our labels collection!

In [9]:
from eotdl.curation.stac.extensions import ScaneoLabeler

labeler = ScaneoLabeler()

catalog = 'workshop_data/sentinel_2_stac/catalog.json'
labels_extra_properties = {'label_methods': ["manual"]}
labeler.generate_stac_labels(
    catalog=catalog,
    root_folder='workshop_data/sentinel_2',
    collection='boadella-sentinel-2',
    **labels_extra_properties
)

FileNotFoundError: [Errno 2] No such file or directory: '/home/juan/Desktop/eotdl/tutorials/workshops/bids23/workshop_data/sentinel_2_stac/catalog.json'

## Dataset Q1 ingestion

Once the STAC metadata is generated and we have our Q1 dataset, we can ingest the dataset into EOTDL as seen before.

In [None]:
from eotdl.datasets import ingest_dataset

ingest_dataset('workshop_data/sentinel_2_stac')

## Dataset Q2 

Training Datasets (TDS) in EOTDL are categorized into different [quality levels](https://eotdl.com/docs/datasets/quality), which in turn will impact the range of functionality that will be available for each dataset.

In this section we will learn about Q2 datsets, datasets with STAC metadata and EOTDL's custom STAC extensions. 

### The ML-Dataset Extension

The main extension used by EOTDL for Q2 datasets is the ML-Dataset extension. It enhances the STAC metadata of a dataset including information such as data splits (train, validation, test), quality metrics, etc.

Let's see how to generate a Q2 dataset using the EOTDL library for the EuroSAT dataset. Q2 datasets are generated from Q1 datasets, datasets with STAC metadata. We already showed how to generate a Q1 dataset in the previous section.

The addition of the `ml-dataset` STAC extension to a STAC catalog is pretty straightforward, so it can be done with a simple function called `add_ml_extension`. Sounds easy, right? Let's see what we need.
- `catalog`: the path to the STAC catalog, or the pystac Catalog itself, we want to add the extension. In our case, `data/sentinel_2_stac/catalog.json`.
- `destination`: if we want we can define an output folder to save the catalog, but by default the function generates it in the same folder of the given catalog. In our case, `data/sentinel_2_q2`.
- `splits`: we should put is as `True` if we want to split the labels. By default is `False`, and the default values for the splits are `Train`, `Test` and `Validation` in a `80, 10, 10` proportion. In our case is `True`, and we are fine with the default proportions.
- `splits_collection_id`: the id of the collection we want to make the splits to. In our case, `labels`, which is the default option.
- `name`: the name of the dataset. In our case, `Boadella Q2 Dataset`, but feel free to customize it at your own.
- `tasks`: the tasks of the dataset. In our case, `[segmentation]`.
- `inputs_type`: the type of the dataset inputs. In our case, `[satellite imagery]`.
- `annotations_type`: the type of the annotations. In our case, `raster`.
- `version`: the version of our dataset. In our case, `0.1.0`.

Let's add the extension!

In [2]:
from eotdl.curation.stac.extensions import add_ml_extension

catalog = 'workshop_data/sentinel_2_stac/catalog.json'

add_ml_extension(
	catalog,
	destination='workshop_data/sentinel_2_q2',
	splits=True,
	splits_collection_id="labels",
	name='Boadella Q2 Dataset',
	tasks=['segmentation'],
	inputs_type=['satellite imagery'],
	annotations_type='raster',
	version='0.1.0'
)

Generating splits...
Total size: 10
Train size: 8
Test size: 1
Validation size: 1
Generating Training split...


100%|██████████| 8/8 [00:00<00:00, 20140.72it/s]


Generating Validation split...


100%|██████████| 1/1 [00:00<00:00, 6898.53it/s]


Generating Test split...


100%|██████████| 1/1 [00:00<00:00, 5497.12it/s]

Success on splits generation!
Validating and saving...
Success!





When ingesting a Q2 dataset, EOTDL will automatically compute quality metrics on your dataset, that will be reported in the metadata. Optionally, you can compute them to analyse your dataset before ingesting it with the `MLDatasetQualityMetrics` class.

In [1]:
from eotdl.curation.stac.extensions import MLDatasetQualityMetrics

catalog = 'workshop_data/sentinel_2_q2/catalog.json'

MLDatasetQualityMetrics.calculate(catalog)

Looking for spatial duplicates...


40it [00:00, 139926.74it/s]
Calculating classes balance...: 40it [00:00, 95001.22it/s]

Validating and saving...





Success!


Remember, however, that the metrics will be computed automatically when ingesting the dataset, so you don't need to do it yourself. These metrics incude aspects such as the number of samples, duplicates, missing values, class imbalance, etc.

## Dataset Q2 ingestion

Finally, ingest our Q2 dataset!

In [None]:
from eotdl.datasets import ingest_dataset

ingest_dataset('workshop_data/sentinel_2_q2')

An that's all! You have successfully created your very own dataset for AI4EO techniques! From downloading the images and labeling them to creating the STAC metadata and ingesting the catalog, we have covered the entire dataset creation process. We hope you have enjoyed and it have been profitable!

Don't forget to check the [EOTDL](https://www.eotdl.com/) website for further news and improvements, and the [EarthPulse](https://earthpulse.ai/) website to stay updated about our adventures!