In [1]:
%load_ext autoreload
%autoreload 2

# Q2 Training Datasets

Training Datasets (TDS) in EOTDL are categorized into different [quality levels](https://eotdl.com/docs/datasets/quality), which in turn will impact the range of functionality that will be available for each dataset.

In this tutorial you will learn about Q2 datsets, datasets with STAC metadata and EOTDL's custom STAC extensions. 

## The ML-Dataset Extension

The main extension used by EOTDL for Q2 datasets is the ML-Dataset extension. It enhances the STAC metadata of a dataset including information such as data splits (train, validation, test), quality metrics, etc.

Let's see how to generate a Q2 dataset using the EOTDL library for the EuroSAT dataset. Q2 datasets are generated from Q1 datasets, datasets with STAC metadata. We already showed how to generate a Q1 dataset in the previous tutorial.

In [1]:
import os 

os.listdir('data')

['sen12floods-use-case',
 'EuroSAT-Q2',
 'sen12floods-eotdl',
 'EuroSAT',
 'EuroSAT-STAC',
 'sen12floods.zip']

In [13]:
from eotdl.curation.stac.ml_dataset import add_ml_extension

catalog = 'data/EuroSAT-STAC/catalog.json'

add_ml_extension(
	catalog,
	destination='data/EuroSAT-Q2',
	splits=True,
	splits_collection_id="labels",
	name='EuroSAT Q2 Dataset',
	tasks=['image classification'],
	inputs_type=['satellite imagery'],
	annotations_type='raster',
	version='0.1.0'
)

Generating splits...
Total size: 99
Train size: 79
Test size: 9
Validation size: 9
Generating Training split...


  0%|          | 0/79 [00:00<?, ?it/s]

100%|██████████| 79/79 [00:00<00:00, 4887.60it/s]


Generating Validation split...


100%|██████████| 9/9 [00:00<00:00, 3773.74it/s]


Generating Test split...


100%|██████████| 9/9 [00:00<00:00, 4846.42it/s]

Success on splits generation!
Validating and saving...
Success!





When ingesting a Q2 dataset, EOTDL will automatically compute quality metrics on your dataset, that will be reported in the metadata. Optionally, you can compute them to analyse your dataset before ingesting it.

In [14]:
from eotdl.curation.stac.ml_dataset import MLDatasetQualityMetrics

catalog = 'data/EuroSAT-Q2/catalog.json'

MLDatasetQualityMetrics.calculate(catalog)

Looking for spatial duplicates...


0it [00:00, ?it/s]

198it [00:00, 3601.09it/s]


Calculating classes balance...


198it [00:00, 201424.25it/s]

Validating and saving...
Success!





Remember, however, that the metrics will be computed automatically when ingesting the dataset, so you don't need to do it yourself. These metrics incude aspects such as the number of samples, duplicates, missing values, class imbalance, etc.

## Ingesting Q2 datasets

Once the metadata has been generated, you can ingest, explore and download a Q2 dataset as any other dataset.

In [3]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/EuroSAT-Q2')

Loading STAC catalog...
Uploading assets...


 50%|█████     | 99/198 [00:27<00:27,  3.58it/s]


Error uploading asset 102: [Errno 2] No such file or directory: '/home/juan/Desktop/eotdl/tutorials/data/EuroSAT-Q2/labels/Industrial_1743/vector_labels.geojson'
Ingesting STAC catalog...
Done


{'uid': 'auth0|616b0057af0c7500691a026e',
 'id': '6503f994d05a1b62cc273fdd',
 'name': 'eurosat-rgb-q2',
 'description': '',
 'tags': [],
 'createdAt': '2023-09-15T06:10:21.544',
 'updatedAt': '2023-09-15T08:29:04.656',
 'likes': 0,
 'downloads': 0,
 'quality': 2,
 'size': 453353,
 'catalog': {'type': 'Catalog',
  'id': 'eurosat-rgb-q2',
  'stac_version': '1.0.0',
  'description': 'EuroSAT-RGB dataset',
  'links': [{'rel': 'self',
    'href': '/home/juan/Desktop/eotdl/tutorials/data/EuroSAT-Q2/catalog.json',
    'type': 'application/json'},
   {'rel': 'root', 'href': './catalog.json', 'type': 'application/json'},
   {'rel': 'child',
    'href': './source/collection.json',
    'type': 'application/json'},
   {'rel': 'child',
    'href': './labels/collection.json',
    'type': 'application/json'}],
  'stac_extensions': ['https://raw.githubusercontent.com/earthpulse/ml-dataset/main/json-schema/schema.json'],
  'ml-dataset:name': 'EuroSAT Q2 Dataset',
  'ml-dataset:tasks': ['image classific