In [1]:
%load_ext autoreload
%autoreload 2

# Q1 Training Datasets

Training Datasets (TDS) in EOTDL are categorized into different [quality levels](https://eotdl.com/docs/datasets/quality), which in turn will impact the range of functionality that will be available for each dataset.

In this tutorial you will learn about Q1 datsets, datasets with STAC metadata. 

## Ingesting Q1 datasets

To ingest a Q1 datasets you will need its STAC metadata.


Some datasets already have STAC metadata, and can be ingested directly into EOTDL. However, in case that your dataset does not have STAC metadata but you want to ingest it as a Q1 dataset, the EOTDL library also offers functionality to create the metadata. Let's see an example using the EuroSAT dataset. You can download the dataset [here](https://www.eotdl.com/datasets/EuroSAT-RGB). Then, extract it and put it in the `data` folder.

In [2]:
import os 

os.listdir('data')

['.DS_Store', 'Eurosat', 'EuroSAT-STAC', 'EuroSAT-Q2']

The EuroSAT dataset contains satellite images for classification, i.e. each image has one label associated. In this case, the label can be extracted from the folder structure.

In [10]:
labels = os.listdir('data/EuroSAT/2750')
labels

['Forest',
 'River',
 'Highway',
 'AnnualCrop',
 'SeaLake',
 'HerbaceousVegetation',
 'Industrial',
 'Residential',
 'PermanentCrop',
 'Pasture']

For faster processing, we will generate a copy of the dataset with only 10 images per class.

In [11]:
import shutil 

os.makedirs('data/EuroSAT-small/', exist_ok=True)
for label in labels:
    os.makedirs('data/EuroSAT-small/' + label, exist_ok=True)
    images = os.listdir('data/EuroSAT/2750/' + label)[:10]
    for image in images:
        shutil.copy('data/EuroSAT/2750/' + label + '/' + image, 'data/EuroSAT-small/' + label + '/' + image)

You can use the `STACGenerator` to create the STAC metadata for your dataset in the form of a dataframe. The item parser will depend on the structure of your dataset. We offer some predefined parsers for common datasets, but you can also create your own parser.

> TODO: How to create a parser.

In [4]:
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.dataframe_labeling import LabeledStrategy

stac_generator = STACGenerator(image_format='jpg', item_parser=UnestructuredParser, labeling_strategy=LabeledStrategy)

df = stac_generator.get_stac_dataframe('data/EuroSAT-small')
df.head()


Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/EuroSAT-small/Forest/Forest_864.jpg,Forest,0,data/EuroSAT-small/source,,
1,data/EuroSAT-small/Forest/Forest_2917.jpg,Forest,0,data/EuroSAT-small/source,,
2,data/EuroSAT-small/Forest/Forest_2903.jpg,Forest,0,data/EuroSAT-small/source,,
3,data/EuroSAT-small/Forest/Forest_870.jpg,Forest,0,data/EuroSAT-small/source,,
4,data/EuroSAT-small/Forest/Forest_680.jpg,Forest,0,data/EuroSAT-small/source,,


Now we save the STAC metadata. The `id` given to the STAC catalog will be used as the name of the dataset in EOTDL (which has the same requirements than can be found in the [documentation](/docs/datasets/ingest)).

In [5]:
output = 'data/EuroSAT-STAC'
stac_generator.generate_stac_metadata(id='eurosat-rgb', description='EuroSAT-RGB dataset', stac_dataframe=df, output_folder=output)

  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)


Generating source collection...


  0%|          | 0/100 [00:00<?, ?it/s]

  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
100%|██████████| 100/100 [00:00<00:00, 784.62it/s]

Validating and saving catalog...
Success!





And, optionally, the labels using the labels extension.

In [None]:
from eotdl.curation.stac.extensions.label import ImageNameLabeler

catalog = output + '/catalog.json'
labels_extra_properties = {'label_properties': ["label"],
                          'label_methods': ["manual"],
                          'label_tasks': ["classification"]}

labeler = ImageNameLabeler()
labeler.generate_stac_labels(catalog, stac_dataframe=df, **labels_extra_properties)

Once the STAC metadata is generated, we can ingest the dataset into EOTDL.

In [1]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/EuroSAT-STAC')

Loading STAC catalog...
Uploading assets...


100%|██████████| 200/200 [00:10<00:00, 18.29it/s]


Ingesting STAC catalog...
Done


{'uid': 'auth0|642adbfdb3da3ab51492d60a',
 'id': '651a8cbb98ed69fa11fec60d',
 'name': 'eurosat-rgb',
 'description': '',
 'tags': [],
 'createdAt': '2023-10-02T11:24:48.008000',
 'updatedAt': '2023-10-02T11:26:29.981000',
 'likes': 0,
 'downloads': 0,
 'quality': 1,
 'size': 355176,
 'catalog': {'type': 'Catalog',
  'id': 'eurosat-rgb',
  'stac_version': '1.0.0',
  'description': 'EuroSAT-RGB dataset',
  'links': [{'rel': 'self',
    'href': '/Users/fran/Documents/Projects/eotdl/tutorials/data/EuroSAT-STAC/catalog.json',
    'type': 'application/json'},
   {'rel': 'root', 'href': './catalog.json', 'type': 'application/json'},
   {'rel': 'child',
    'href': './source/collection.json',
    'type': 'application/json'},
   {'rel': 'child',
    'href': './labels/collection.json',
    'type': 'application/json'}],
  'extent': None,
  'license': None,
  'stac_extensions': None,
  'summaries': None,
  'properties': None,
  'assets': None,
  'bbox': None,
  'collection': None}}

After the ingestion, you can explore and download your dataset like shown in the previous tutorial.

In [2]:
from eotdl.datasets import list_datasets

datasets = list_datasets()
datasets

['eurosat-rgb', 'asd']

In [18]:
from eotdl.datasets import download_dataset

dst_path = download_dataset('eurosat-rgb')
dst_path

Downloading STAC metadata...
To download assets, set assets=True or -a in the CLI.


'/home/juan/.cache/eotdl/datasets/eurosat-rgb'

By default it will only download the STAC metadata. In case you also want to download the actual data, you can use the `assets` parameter. 

> The `force` parameter will overwrite the dataset if it already exists.

In [20]:
from eotdl.datasets import download_dataset

dst_path = download_dataset('eurosat-rgb', force=True, assets=True)
dst_path

Downloading STAC metadata...
Downloading assets...


100%|██████████| 200/200 [00:34<00:00,  5.85it/s]


'/home/juan/.cache/eotdl/datasets/eurosat-rgb'

You will find the data in  the `assets` subfolder, where a subfolder for each items with its `id` will contain all the assets for that item.

In [25]:
from glob import glob

glob(dst_path + '/assets/**/*.jpg')[:3]

['/home/juan/.cache/eotdl/datasets/eurosat-rgb/assets/River_1655/River_1655.jpg',
 '/home/juan/.cache/eotdl/datasets/eurosat-rgb/assets/AnnualCrop_1142/AnnualCrop_1142.jpg',
 '/home/juan/.cache/eotdl/datasets/eurosat-rgb/assets/Industrial_435/Industrial_435.jpg']

Alternatively, you can download an asset using its url.

In [27]:
import json

with open(dst_path + '/eurosat-rgb/source/Highway_594/Highway_594.json', 'r') as f:
	data = json.load(f)

data['assets']

{'Highway_594': {'href': 'https://api.eotdl.com/datasets/6503f8a3d05a1b62cc273ea4/download/Highway_594.jpg',
  'type': 'image/jpeg',
  'title': 'Highway_594',
  'roles': ['data']}}

In [28]:
from eotdl.datasets import download_file_url

url = data['assets']['Highway_594']['href']
download_file_url(url, 'data')

100%|██████████| 4.07k/4.07k [00:00<00:00, 743kiB/s]


'data/Highway_594.jpg'