# STAC generation

The [STAC](https://stacspec.org/en) specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered. Is a standarized way to expose, host, ingest and access geospatial collections that has been adopted as the EOTDL standard metadata format. For further information, check their [website](https://stacspec.org/en).

In order to facilitate the STAC generation, which can be painful and time-expensive, the EOTDL environment has several open source tools that make the process much more straightforward. 

Uncomment the following line to install eotdl if needed.

In [None]:
# !pip install eotdl

The first thing we have to understand is that the process starts with a `STACDataFrame`. This `STACDataFrame` is an interface between the images and the STAC catalogs and collections, with some variables that we can define and customise to ensure that the STAC metadata has the information we want, like `extensions`, which defines the [STAC extensions](https://stac-extensions.github.io/) that the image must have, or `bands`, with the bands we want to get from the image, if any.

Let's see the example below.

In [None]:
import pandas as pd

sample_df = pd.read_csv('example_data/sample_stacdataframe.csv')
sample_df

This is a sample `STACDataFrame` already generated for our workshop. Here we can see a lot of information:
- image: the path to every image.
- label: the label assigned to every image. 
- ix: the index of the label.
- collection: the collection which the image belongs to. 
- extensions: a list with the STAC extensions we want the image STAC item to have.
- bands: a list of band we want the image STAC item to have.

Now we have seen this, let's generate the STAC metadata for our dataset. Don't worry, we are going to explain it step by step!

First of all, we need to import the `STACGenerator` class.

In [None]:
from eotdl.curation.stac.stac import STACGenerator

The `STACGenerator` class is the entry point and the STAC generation class, where magic happens. Before we declare it, we need to understand the parameters we can give to it:
- `image_format`: the extension of the images. Could be `png`, `jpg` and so on. By default is `tiff`.
- `catalog_type`: the STAC Catalog type. It is a specification defined [here](https://pystac.readthedocs.io/en/0.4/concepts.html#catalog-types). By default is `SELF_CONTAINED`. 
- `item_parser`: the item_parser defines the strategy that must be followed to search for satellite images within the folder. We have defined 2 item_parser strategies, and new ones can be added as needed. The strategies that are implemented right now are the following.
    - `StructuredParser`: this strategy is used when the images are each contained within a folder, so that the name of the item will be the name of the folder.
    
    <p align="center">
        <img src="assets/structured_parser.png" alt="Structured parser typical folder structure" style="height:170px; width:200px;"/>
    </p>
    
    - `UnestructuredParser`: this strategy is used when there are multiple images contained in the same folder. We will use this strategy when using the EOTDL to download the dataset images, as it will always format the folder structure the same way. As this is what we have done, it is the strategy that we will use for the use case of this workshop, as all the images are in the same folder.

    <p align="center">
        <img src="assets/unestructured_parser.png" alt="Structured parser typical folder structure" style="height:200px; width:200px;"/>
    </p>
    
- `assets_generator`: the assets_generator parameter defines the strategy to follow with the generation of assets from each image. In this way, it could be the case that from a Sentinel-2 image we want to extract all its bands as assets, or simply extract the RGB bands, or not extract any as assets. By default, three strategies have been established, which can be expanded according to needs.

    - `STACAssetGenerator`: does not extract new assets from the image bands, so a single asset is generated for the image.
    - `BandsAssetGenerator`: from the original image it creates a new file for each band established in the 'bands` column, deleting the original file. An asset is added to the STAC item for each band.
    - `ExtractedAssets`: indicates that the bands of an image have already been extracted as independent files, so it creates an item for each image taking the files as assets.
    

- `labeling_strategy`: the `labeling_strategy` parameter defines the strategy to extract a label from the filename of an image, to assign a label to it. By default, we have implemented 2 strategies:

    - `UnlabeledStrategy`: we will use it when the images do not have a label that identifies them or that has been placed on purpose. It is the one we will use in our case, since the filename is simply the name of the constellation, and it is the default option.
    - `LabeledStrategy`: we will use it when the images are labeled with labels in their filenames. An example would be that in a folder the images were called, for example, River_1.png, River_2.png, River_3.png, and so on. The file name must be the pattern <label>_<number>. This is the option we are going to use in the workshop, as all the images filenames are `boadella_<id>`.


For the specific case of our workshop, we will take into account the following:
- The images have been downloaded using the EOTDL and are each in the same folder, so as `item_parser` we will use `UnestructuredParser`.
- In this case we do not want to extract bands from the image as assets, so as `asset_generator` we will use `STACAssetGenerator`, which is the default option.
- As said, although the images do not have labels in their filenames we can use the `LabeledStrategy`, as all the images filenames are `boadella_<id>`.

In [None]:
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.assets import STACAssetGenerator
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.dataframe_labeling import LabeledStrategy

stac_generator = STACGenerator(item_parser=UnestructuredParser, 
                               assets_generator=STACAssetGenerator, 
                               labeling_strategy=LabeledStrategy,
                               image_format='tif'
                               )

In [None]:
df = stac_generator.get_stac_dataframe('data/sentinel_2')
df.head()

It looks good! We have all we need to generate the STAC metadata. We only have to give the catalog an `id`, a `description` and an `output_folder`!

In [None]:
stac_generator.generate_stac_metadata(id='jaca-dataset',
                                      description='Jaca dataset',
                                      output_folder='data/sentinel_2_stac')