In [2]:
%load_ext autoreload
%autoreload 2

%load_ext dotenv
%dotenv

# Dataset ingestion and STAC metadata generation

In this notebook we are going to ingest and generate the STAC metadata of our dataset, converting it from a Q0 dataset to a [Q2 dataset](../00_eotdl.ipynb). 

## Dataset Q0 ingestion

As a logged EOTDL user, you can ingest your own datasets into EOTDL. This will allow you to use them in your own projects and share them with other users. This can be done both in the user interface, visitting [datasets](https://www.eotdl.com/datasets) and clicking on the `INGEST` button, or using the API, CLI and library, which we recommend to use the CLI. We encourage you to check the [documentation](/docs/datasets/ingest) to learn more.

In this workshop, we are going to ingest our very own dataset using the library. So, let's get started!

In order to ingest a Q0 dataset we will need two things:
1. Obviously, the folder with the data that we want to upload, which in our case is `data/sentinel_2`. Is must be compressed in a `zip` file.
2. A `metadata.yml` file, required with the following structure:

```yaml
name: dataset-name
authors: 
  - author 1
  - author 2
license: dataset-license
source: http://link-to-source
```

Let's generate the required files in a new folder, which we can name `data/sentinel_2_q0`. Here we must put our zipped dataset, which can be named `boadella.zip`, and a `metadata.yml`, which can be as follows. Feel free to edit at your own.

```yaml
name: my-own-dataset   # Change as you wish
authors: 
  - Juan B. Pedro      # Put your name here
license: MIT
source: https://www.eotdl.com/
```

In [3]:
import os 

os.listdir('data/sentinel_2_q0')

['.DS_Store', 'boadella.zip', 'metadata.yml']

Now we have the required files, let's ingest our Q0 dataset!

In [5]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/sentinel_2_q0')

Uploading directory (only files, not recursive)


Exception: [{'type': 'missing', 'loc': ['body', 'file'], 'msg': 'Field required', 'input': None, 'url': 'https://errors.pydantic.dev/2.4/v/missing'}]

## STAC generation

The [STAC](https://stacspec.org/en) specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered. Is a standarized way to expose, host, ingest and access geospatial collections that has been adopted as the EOTDL standard metadata format. For further information, check their [website](https://stacspec.org/en).

In order to facilitate the STAC generation, which can be painful and time-expensive, the EOTDL environment has several open source tools that make the process much more straightforward. Let's dive into them!

The first thing we have to understand is that the process starts with a `STACDataFrame`. This `STACDataFrame` is an interface between the images and the STAC catalogs and collections, with some variables that we can define and customise to ensure that the STAC metadata has the information we want, like `extensions`, which defines the [STAC extensions](https://stac-extensions.github.io/) that the image must have, or `bands`, with the bands we want to get from the image, if any.

Let's see the example below.

In [2]:
import pandas as pd

sample_df = pd.read_csv('data/sample_stacdataframe.csv')
sample_df

Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/sentinel_2/Boadella_2019-06-07/sentinel-2...,sentinel-2-l2a,0,data/sentinel_2/source,"('proj', 'raster')","('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B0..."
1,data/sentinel_2/Boadella_2019-06-02/sentinel-2...,sentinel-2-l2a,0,data/sentinel_2/source,"('proj', 'raster')","('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B0..."
2,data/sentinel_2/Boadella_2019-07-02/sentinel-2...,sentinel-2-l2a,0,data/sentinel_2/source,"('proj', 'raster')","('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B0..."
3,data/sentinel_2/Boadella_2019-06-17/sentinel-2...,sentinel-2-l2a,0,data/sentinel_2/source,"('proj', 'raster')","('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B0..."
4,data/sentinel_2/Boadella_2019-06-27/sentinel-2...,sentinel-2-l2a,0,data/sentinel_2/source,"('proj', 'raster')","('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B0..."


This is a sample `STACDataFrame` already generated for our workshop. Here we can see a lot of information:
- image: the path to every image.
- label: the label assigned to every image. 
- ix: the index of the label.
- collection: the collection which the image belongs to. 
- extensions: a list with the STAC extensions we want the image STAC item to have.
- bands: a list of band we want the image STAC item to have.

Now we have seen this, let's generate the STAC metadata for our dataset. Don't worry, we are going to explain it step by step!

First of all, we need to import the `STACGenerator` class.

In [6]:
from eotdl.curation.stac.stac import STACGenerator

The `STACGenerator` class is the entry point and the STAC generation class, where magic happens. Before we declare it, we need to understand the parameters we can give to it:
- `image_format`: the extension of the images. Could be `png`, `jpg` and so on. By default is `tiff`.
- `catalog_type`: the STAC Catalog type. It is a specification defined [here](https://pystac.readthedocs.io/en/0.4/concepts.html#catalog-types). By default is `SELF_CONTAINED`. 
- `item_parser`: the item_parser defines the strategy that must be followed to search for satellite images within the folder. We have defined 2 item_parser strategies, and new ones can be added as needed. The strategies that are implemented right now are the following.
    - `StructuredParser`: this strategy is used when the images are each contained within a folder, so that the name of the item will be the name of the folder.
    
    <p align="center">
        <img src="images/structured_parser.png" alt="Structured parser typical folder structure" style="height:170px; width:200px;"/>
    </p>
    
    - `UnestructuredParser`: this strategy is used when there are multiple images contained in the same folder. We will use this strategy when using the EOTDL to download the dataset images, as it will always format the folder structure the same way. As this is what we have done, it is the strategy that we will use for the use case of this workshop, as all the images are in the same folder.

    <p align="center">
        <img src="images/unestructured_parser.png" alt="Structured parser typical folder structure" style="height:200px; width:200px;"/>
    </p>
    
- `assets_generator`: the assets_generator parameter defines the strategy to follow with the generation of assets from each image. In this way, it could be the case that from a Sentinel-2 image we want to extract all its bands as assets, or simply extract the RGB bands, or not extract any as assets. By default, three strategies have been established, which can be expanded according to needs.

    - `STACAssetGenerator`: does not extract new assets from the image bands, so a single asset is generated for the image.
    - `BandsAssetGenerator`: from the original image it creates a new file for each band established in the 'bands` column, deleting the original file. An asset is added to the STAC item for each band.
    - `ExtractedAssets`: indicates that the bands of an image have already been extracted as independent files, so it creates an item for each image taking the files as assets.
    

- `labeling_strategy`: the `labeling_strategy` parameter defines the strategy to extract a label from the filename of an image, to assign a label to it. By default, we have implemented 2 strategies:

    - `UnlabeledStrategy`: we will use it when the images do not have a label that identifies them or that has been placed on purpose. It is the one we will use in our case, since the filename is simply the name of the constellation, and it is the default option.
    - `LabeledStrategy`: we will use it when the images are labeled with labels in their filenames. An example would be that in a folder the images were called, for example, River_1.png, River_2.png, River_3.png, and so on. The file name must be the pattern <label>_<number>. This is the option we are going to use in the workshop, as all the images filenames are `boadella_<id>`.


For the specific case of our workshop, we will take into account the following:
- The images have been downloaded using the EOTDL and are each in the same folder, so as `item_parser` we will use `UnestructuredParser`.
- In this case we do not want to extract bands from the image as assets, so as `asset_generator` we will use `STACAssetGenerator`, which is the default option.
- As said, although the images do not have labels in their filenames we can use the `LabeledStrategy`, as all the images filenames are `boadella_<id>`.

In [9]:
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.assets import STACAssetGenerator
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.labeling import LabeledStrategy

stac_generator = STACGenerator(item_parser=UnestructuredParser, 
                               assets_generator=STACAssetGenerator, 
                               labeling_strategy=LabeledStrategy,
                               image_format='tif'
                               )

If we now decide to generate a `STACDataFrame` from the folder with the images it will do fine, but it will be incomplete.

In [10]:
df = stac_generator.get_stac_dataframe('data/sentinel_2')
df.head()

Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/sentinel_2/Boadella_2.tif,Boadella,0,data/sentinel_2/source,,
1,data/sentinel_2/Boadella_3.tif,Boadella,0,data/sentinel_2/source,,
2,data/sentinel_2/Boadella_1.tif,Boadella,0,data/sentinel_2/source,,
3,data/sentinel_2/Boadella_4.tif,Boadella,0,data/sentinel_2/source,,
4,data/sentinel_2/Boadella_5.tif,Boadella,0,data/sentinel_2/source,,


A key feature is the `label` column. Using the label of every image we are going to assign parameters like the STAC extensions that this image's item is going to have, or the bands we want to extract using the `BandsAssetGenerator`. We can obtain the existing labels in the STACDataFrame before adding new information.

In [11]:
labels = df.label.unique().tolist()
labels

['Boadella']

Starting from the `Boadella` label we are going to define the STAC extensions. As STAC extensions we are going to implement the [proj](https://github.com/stac-extensions/projection), [raster](https://github.com/stac-extensions/raster) and [eo](https://github.com/stac-extensions/eo) STAC extensions. Note: the supported extensiones are `('eo', 'sar', 'proj', 'raster')`.

To define these parameters for each label, we simply have to declare a dictionary.

In [18]:
extensions = {'Boadella': ('proj', 'raster', 'eo')}

> Note: if we wanted to extract the bands as assets, we should create a dict following the same convention, just as follows. As said it is not required for this workshop, but could be useful for other use cases.

```
bands = {'sentinel-2-l2a': ('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12')}
```

Now we are ready to generate a `STACDataFrame` with relevant information. Some extra parameters to take into account:
- `path`: is the root path where the images are located at. In our case is `data/sentinel_2`.
- `collections`: we can use this parameter to define the STAC collection to which we want each item with a specific label to go. There are several options:
    - The default option puts all the STAC items in a single collection called `source`.
    - The `*` option will consider folders located directly under the root folder as collections, so it will create a collection for each of them.

    <p align="center">
        <img src="images/collection.png" alt="* collection" style="height:170px; width:200px;"/>
    </p>

    - You can decide the collection you want an image to go to through its label, as we have seen in the case of extensions and bands. To give an example, we are going to define it like this.

In [16]:
collection = {'Boadella': 'boadella-sentinel-2'}

Let's generate the complete STACDataFrame!

In [19]:
df = stac_generator.get_stac_dataframe('data/sentinel_2', collections=collection, extensions=extensions)
df.head()

Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/sentinel_2/Boadella_2.tif,Boadella,0,data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)",
1,data/sentinel_2/Boadella_3.tif,Boadella,0,data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)",
2,data/sentinel_2/Boadella_1.tif,Boadella,0,data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)",
3,data/sentinel_2/Boadella_4.tif,Boadella,0,data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)",
4,data/sentinel_2/Boadella_5.tif,Boadella,0,data/sentinel_2/boadella-sentinel-2,"(proj, raster, eo)",


It looks good! We have all we need to generate the STAC metadata. We only have to give the catalog an `id`, a `description` and an `output_folder`!

In [10]:
stac_generator.generate_stac_metadata(id='boadella-dataset',
                                      description='Boadella dataset',
                                      output_folder='data/sentinel_2_stac')

Generating boadella-sentinel-2 collection...


100%|██████████| 5/5 [00:00<00:00, 47.99it/s]

Validating and saving catalog...
Success!





explicar...

In [1]:
from eotdl.curation.stac.extensions import ScaneoLabeler

labeler = ScaneoLabeler()

catalog = 'data/sentinel_2_stac/catalog.json'
labeler.generate_stac_labels(
    catalog=catalog,
    root_folder='data/sentinel_2',
    collection='boadella-sentinel-2'
)

Generating labels collection...


5it [00:00, 677.97it/s]


Voilà! We have generated the STAC metadata from our Q0 dataset, converting it into a Q1 dataset!