In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Q1 Training Datasets

Training Datasets (TDS) in EOTDL are categorized into different [quality levels](https://eotdl.com/docs/datasets/quality), which in turn will impact the range of functionality that will be available for each dataset.

In this tutorial you will learn about Q1 datsets, datasets with STAC metadata. 

To ingest a Q1 datasets you will need its STAC metadata.

Some datasets already have STAC metadata, and can be ingested directly into EOTDL. However, in case that your dataset does not have STAC metadata but you want to ingest it as a Q1 dataset, the EOTDL library also offers functionality to create the metadata. Let's see an example using the EuroSAT dataset. 

In [11]:
from eotdl.datasets import download_dataset

download_dataset("EuroSAT-RGB", version=1, path="data")

Exception: Dataset `EuroSAT-RGB v1` already exists at data/EuroSAT-RGB/v1. To force download, use force=True or -f in the CLI.

In [2]:
!ls data/EuroSAT-RGB/v1

EuroSAT-RGB.zip


In [4]:
!unzip -q data/EuroSAT-RGB/v1/EuroSAT-RGB.zip -d data/EuroSAT-RGB

The EuroSAT dataset contains satellite images for classification, i.e. each image has one label associated. In this case, the label can be extracted from the folder structure.

In [6]:
import os 

labels = os.listdir('data/EuroSAT-RGB/2750')
labels

['Industrial',
 'Forest',
 'HerbaceousVegetation',
 'PermanentCrop',
 'Highway',
 'Residential',
 'SeaLake',
 'River',
 'AnnualCrop',
 'Pasture']

For faster processing, we will generate a copy of the dataset with only 10 images per class.

In [12]:
import shutil 

os.makedirs('data/EuroSAT-RGB-small/', exist_ok=True)
for label in labels:
    os.makedirs('data/EuroSAT-RGB-small/' + label, exist_ok=True)
    images = os.listdir('data/EuroSAT-RGB/2750/' + label)[:10]
    for image in images:
        shutil.copy('data/EuroSAT-RGB/2750/' + label + '/' + image, 'data/EuroSAT-RGB-small/' + label + '/' + image)

NameError: name 'labels' is not defined

You can use the `STACGenerator` to create the STAC metadata for your dataset in the form of a dataframe. The item parser will depend on the structure of your dataset. We offer some predefined parsers for common datasets, but you can also create your own parser.

In [2]:
from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.dataframe_labeling import LabeledStrategy

stac_generator = STACGenerator(image_format='jpg', item_parser=UnestructuredParser, labeling_strategy=LabeledStrategy)

df = stac_generator.get_stac_dataframe('data/EuroSAT-RGB-small')
df.head()


Unnamed: 0,image,label,ix,collection,extensions,bands
0,data/EuroSAT-RGB-small/Industrial/Industrial_1...,Industrial,0,data/EuroSAT-RGB-small/source,,
1,data/EuroSAT-RGB-small/Industrial/Industrial_1...,Industrial,0,data/EuroSAT-RGB-small/source,,
2,data/EuroSAT-RGB-small/Industrial/Industrial_1...,Industrial,0,data/EuroSAT-RGB-small/source,,
3,data/EuroSAT-RGB-small/Industrial/Industrial_1...,Industrial,0,data/EuroSAT-RGB-small/source,,
4,data/EuroSAT-RGB-small/Industrial/Industrial_1...,Industrial,0,data/EuroSAT-RGB-small/source,,


Now we save the STAC metadata. The `id` given to the STAC catalog will be used as the name of the dataset in EOTDL (which has the same requirements than can be found in the [documentation](/docs/datasets/ingest)).

In [3]:
output = 'data/EuroSAT-RGB-small-STAC'
stac_generator.generate_stac_metadata(stac_id='eurosat-rgb', description='EuroSAT-RGB dataset', stac_dataframe=df, output_folder=output)

  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)


Generating source collection...


  0%|          | 0/100 [00:00<?, ?it/s]

  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
100%|██████████| 100/100 [00:00<00:00, 1004.50it/s]

Validating and saving catalog...
Success!





And, optionally, the labels using the labels extension.

In [4]:
from eotdl.curation.stac.extensions.label import ImageNameLabeler

catalog = output + '/catalog.json'
labels_extra_properties = {'label_properties': ["label"],
                          'label_methods': ["manual"],
                          'label_tasks': ["classification"]}

labeler = ImageNameLabeler()
labeler.generate_stac_labels(catalog, stac_dataframe=df, **labels_extra_properties)

Generating labels collection...


100it [00:00, 2450.82it/s]


Success on labels generation!


Once the STAC metadata is generated, we can ingest the dataset into EOTDL.

In [6]:
from eotdl.datasets import ingest_dataset

ingest_dataset('data/EuroSAT-RGB-small-STAC')

Loading STAC catalog...
New version created, version: 1


100%|██████████| 200/200 [00:03<00:00, 57.19it/s]


Ingesting STAC catalog...
Done


After the ingestion, you can explore and download your dataset like shown in the previous tutorial.

In [7]:
from eotdl.datasets import retrieve_datasets

datasets = retrieve_datasets()
datasets

['eurosat-rgb']

In [10]:
from eotdl.datasets import download_dataset

dst_path = download_dataset('eurosat-rgb')
dst_path

Downloading STAC metadata...
To download assets, set assets=True or -a in the CLI.


'/home/juan/.cache/eotdl/datasets/eurosat-rgb/v1'

By default it will only download the STAC metadata. In case you also want to download the actual data, you can use the `assets` parameter. 

> The `force` parameter will overwrite the dataset if it already exists.

In [2]:
from eotdl.datasets import download_dataset

dst_path = download_dataset('eurosat-rgb', force=True, assets=True)
dst_path

100%|██████████| 200/200 [00:01<00:00, 111.22it/s]


'/home/juan/.cache/eotdl/datasets/eurosat-rgb/v1'

You will find the data in  the `assets` subfolder, where a subfolder for each items with its `id` will contain all the assets for that item.

In [3]:
from glob import glob

glob(dst_path + '/assets/*.jpg')[:3]

['/home/juan/.cache/eotdl/datasets/eurosat-rgb/v1/assets/AnnualCrop_1033.jpg',
 '/home/juan/.cache/eotdl/datasets/eurosat-rgb/v1/assets/HerbaceousVegetation_1743.jpg',
 '/home/juan/.cache/eotdl/datasets/eurosat-rgb/v1/assets/HerbaceousVegetation_1977.jpg']

Alternatively, you can download an asset using its url.

In [4]:
import json

with open(dst_path + '/eurosat-rgb/source/Highway_594/Highway_594.json', 'r') as f:
	data = json.load(f)

data['assets']

{'Highway_594': {'href': 'http://localhost:8010/datasets/6544efac6b6fb48e5294be7f/download/Highway_594.jpg',
  'type': 'image/jpeg',
  'title': 'Highway_594',
  'roles': ['data']}}

In [7]:
from eotdl.datasets import download_file_url

url = data['assets']['Highway_594']['href']
download_file_url(url, 'data')

'data/assets/Highway_594.jpg'