# Ingesting a Q0 dataset

The database is now ready to be ingested to EOTDL. We just need to add the general metadata of the dataset in a README file.

In [1]:
text = """---
name: Sentinel-2-Ships
authors: 
  - Pierre-Jean Coquard
license: free
source: https://github.com/earthpulse/eotdl/tree/main/tutorials/usecases/useCaseD
---

# Sentinel-2-Ships

This is an example dataset created for the use case D.
"""

with open("data/sentinel_2/README.md", "w") as outfile:
    outfile.write(text)

In [2]:
from eotdl.datasets import ingest_dataset

ingest_dataset("data/sentinel_2")

  from .autonotebook import tqdm as notebook_tqdm


Uploading directory data/sentinel_2...
generating list of files to upload...


100%|██████████| 321/321 [00:00<00:00, 10730.63it/s]


Exception: No new files to upload

# Q1 dataset

We can upgrade this dataset to a Q1 dataset by adding STAC metadata. We use the `STACCGenerator` class to automaticaly generate the STAC metadata for the whole dataset.


In [3]:
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.assets import STACAssetGenerator
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.dataframe_labeling import UnlabeledStrategy, LabeledStrategy

stac_generator = STACGenerator(item_parser=UnestructuredParser, 
                               assets_generator=STACAssetGenerator, 
                               labeling_strategy=LabeledStrategy,
                               image_format='tif'
                               )


In [4]:
extensions = {'ship': ('proj', 'raster', 'eo')}
bands = {'ship': ('B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12')}
collection = {'ship': 'sentinel-2-ships'}

df = stac_generator.get_stac_dataframe('data/sentinel_2', collections=collection, extensions=extensions, bands=bands)
df.head()

Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/sentinel_2/ship_211479160_2022-08-12.tif,ship,0,data/sentinel_2/sentinel-2-ships,"(proj, raster, eo)","(B01, B02, B03, B04, B05, B06, B07, B08, B09, ..."
1,data/sentinel_2/ship_219000836_2022-08-12.tif,ship,0,data/sentinel_2/sentinel-2-ships,"(proj, raster, eo)","(B01, B02, B03, B04, B05, B06, B07, B08, B09, ..."
2,data/sentinel_2/ship_259222000_2022-08-12.tif,ship,0,data/sentinel_2/sentinel-2-ships,"(proj, raster, eo)","(B01, B02, B03, B04, B05, B06, B07, B08, B09, ..."
3,data/sentinel_2/ship_352335000_2022-08-25.tif,ship,0,data/sentinel_2/sentinel-2-ships,"(proj, raster, eo)","(B01, B02, B03, B04, B05, B06, B07, B08, B09, ..."
4,data/sentinel_2/ship_219000733_2022-08-12.tif,ship,0,data/sentinel_2/sentinel-2-ships,"(proj, raster, eo)","(B01, B02, B03, B04, B05, B06, B07, B08, B09, ..."


We can then generate the Stac metadata from the `STACDataframe` generated during the previous step.

In [5]:
stac_generator.generate_stac_metadata(stac_id='ship-segmentation-dataset',
                                      description='Ship segmentation dataset',
                                      output_folder='data/sentinel_2_stac')

Generating sentinel-2-ships collection...


100%|██████████| 106/106 [00:00<00:00, 290.90it/s]


Validating and saving catalog...
Success!


We also add the STAC metadata for the labels :

In [6]:
from eotdl.curation.stac.extensions import ScaneoLabeler

labeler = ScaneoLabeler()

catalog = 'data/sentinel_2_stac/catalog.json'
labels_extra_properties = {'label_methods': ["automated"]}
labeler.generate_stac_labels(
    catalog=catalog,
    root_folder='data/sentinel_2',
    collection='sentinel-2-ships',
    label_type="raster",
    **labels_extra_properties
)

Generating labels collection...: 106it [00:00, 1362.59it/s]


Success on labels generation!


Once the STAC metadata is successfully generated, we can ingest the Q1 dataset into EOTDL.

In [7]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/sentinel_2_stac')

Loading STAC catalog...
New version created, version: 63


100%|██████████| 212/212 [00:48<00:00,  4.41it/s]


Ingesting STAC catalog...
Done


# Q2 dataset

In [1]:
%load_ext autoreload
%autoreload 2

from eotdl.curation.stac.extensions import add_ml_extension
import pystac
catalog = 'data/sentinel_2_stac/catalog.json'

add_ml_extension(
	catalog,
	destination='data/sentinel_2_q2',
	splits=True,
	splits_collection_id="labels",
	name='Ship Segmentation Q2',
	tasks=['segmentation'],
	inputs_type=['satellite imagery'],
	annotations_type='raster',
	version='0.1.0',
)

  from .autonotebook import tqdm as notebook_tqdm


Generating splits...
Total size: 106
Train size: 84
Test size: 10
Validation size: 10
Generating Training split...


100%|██████████| 84/84 [00:00<00:00, 3457.49it/s]


Generating Validation split...


100%|██████████| 10/10 [00:00<00:00, 2417.61it/s]


Generating Test split...


100%|██████████| 10/10 [00:00<00:00, 2033.11it/s]

Success on splits generation!





Validating and saving...
Success!


Let's compute the quality metrics for the dataset to ensure that it can be ingested

In [2]:
from eotdl.curation.stac.extensions import MLDatasetQualityMetrics

catalog = 'data/sentinel_2_q2/catalog.json'

MLDatasetQualityMetrics.calculate(catalog)

Looking for spatial duplicates...: 0it [00:00, ?it/s]

Looking for spatial duplicates...: 424it [00:00, 4564.31it/s]
Calculating classes balance...: 424it [00:00, 157504.64it/s]


Validating and saving...
Success!


We can finally ingest the Q2 dataset into EOTDL

In [3]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/sentinel_2_q2')


Loading STAC catalog...
New version created, version: 64


100%|██████████| 424/424 [01:32<00:00,  4.57it/s]


Ingesting STAC catalog...


Exception: HREF: '/home/coquarpj/git_eotdl/eotdl/tutorials/usecases/useCaseD/data/sentinel_2_q2/catalog.json' does not resolve to a STAC object