In [1]:
%load_ext autoreload
%autoreload 2

import os
os.environ["EOTDL_API_URL"] = "http://localhost:8000/"

# Ingesting Datasets and Models

In this notebook we are going to showcase how to ingest an existing dataset or model into EOTDL.

Once it is ingested, you can use it in the same way as any other dataset or model in EOTDL (exploring, staging, etc.).

The recommended way to ingest a dataset is using the CLI.

In [2]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
 Ingest a dataset to the EOTDL.asdf                                             
 [2mThis command ingests the dataset to the EOTDL. The dataset must be a folder [0m   
 [2mwith the dataset files, and at least a README.md file (and a catalog.json file[0m 
 [2mfor Q1+). If these files are missing, the ingestion will not work. All the [0m    
 [2mfiles in the folder will be uploaded to the EOTDL.[0m                             
                                                                                
 [2mThe following constraints apply to the dataset name:[0m                           
 [2m- It must be unique[0m                                                            
 

There are several ways in which you can ingest a dataset:

1. From a local folder in your system with the data you want to upload (with or without a STAC catalog).
2. From a list of links to assets in another repository (cloud bucket, huggingface, etc.).

## Ingesting a local dataset

In [3]:
!ls example_data

EuroSAT-RGB-small  EuroSAT-RGB-small-STAC  EuroSAT-small


For this tutorial we are going to work with a subsample of the [EuroSAT](https://www.eotdl.com/datasets/EuroSAT-RGB) dataset.

In [4]:
!rm -rf example_data/EuroSAT-small/catalog.parquet
!rm -rf example_data/EuroSAT-small/README.md

In [5]:
from glob import glob 

path = "example_data/EuroSAT-small"
files = glob(f'{path}/**/*.*', recursive=True)
files

['example_data/EuroSAT-small/Forest/Forest_3.tif',
 'example_data/EuroSAT-small/Forest/Forest_1.tif',
 'example_data/EuroSAT-small/Forest/Forest_2.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif']

For all cases, a `README.md` file is required in order to ingest datasets and models, containing some basic required information (dataset authors, licens, link to source and dataset name)

In [6]:
# create README.md

text = """---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [7]:
!cat {path}/README.md

---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.


The `name` property in the `README.md` file is used for the name of the dataset or model in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `README.md` file will fail.

If everything is correct, the ingestion process should suceed.

In [8]:
!eotdl datasets ingest -p example_data/EuroSAT-small

Ingesting directory: example_data/EuroSAT-small
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy
current version:  1
Ingesting files: 100%|███████████████████████████| 7/7 [00:00<00:00, 144.96it/s]
No new version was created, your dataset has not changed.


And now your dataset is avilable at EOTDL

In [9]:
!eotdl datasets list -n eurosat-small

['EuroSAT-small']


> Since the `EuroSAT-small` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data).

### Ingesting a local STAC catalog

Before the ingestion, the CLI will create a STAC-compliant `parquet` file with the metadata of the dataset.

In [10]:
import geopandas as gpd

catalog = f"{path}/catalog.parquet"

gdf = gpd.read_parquet(catalog)
gdf.head()

Unnamed: 0,type,stac_version,stac_extensions,datetime,id,bbox,geometry,assets,links,repository
0,Feature,1.0.0,[],2025-02-06 17:37:26.603699,README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'a6bb30a57d0f5ff0aaa65b...,[],eotdl
1,Feature,1.0.0,[],2025-02-06 17:37:26.603847,Forest/Forest_3.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '3e7bb982f9db5f7dabc556...,[],eotdl
2,Feature,1.0.0,[],2025-02-06 17:37:26.604006,Forest/Forest_1.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'f3b8b9fef6b2df6f24792e...,[],eotdl
3,Feature,1.0.0,[],2025-02-06 17:37:26.604157,Forest/Forest_2.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '2e38dab64435bfbab25bab...,[],eotdl
4,Feature,1.0.0,[],2025-02-06 17:37:26.604312,AnnualCrop/AnnualCrop_3.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '59330fce6d0bf01078db3d...,[],eotdl


However, if your local dataset already contains a STAC catalog, the available schema will be used to create the EOTDL `parquet` catalog (including the different STAC extensions or properties that might be present).

In [11]:
path = 'example_data/EuroSAT-RGB-small-STAC'

files = os.listdir(path)
assert 'catalog.json' in files, "catalog.json not found"

!cat data/EuroSAT-RGB-small-STAC/catalog.json

{
  "type": "Catalog",
  "id": "EuroSAT-RGB-Q1",
  "stac_version": "1.0.0",
  "description": "EuroSAT-RGB dataset",
  "links": [
    {
      "rel": "root",
      "href": "./catalog.json",
      "type": "application/json"
    },
    {
      "rel": "child",
      "href": "./source/collection.json",
      "type": "application/json"
    },
    {
      "rel": "child",
      "href": "./labels/collection.json",
      "type": "application/json"
    }
  ]
}

In [12]:
# create README.md

text = """---
name: EuroSAT-RGB-small-STAC
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-STAC

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [13]:
!eotdl datasets ingest -p example_data/EuroSAT-RGB-small-STAC

Ingesting items from collection source: 100it [00:00, 273601.04it/s]
Ingesting items from collection labels: 100it [00:00, 279993.59it/s]
current version:  1
Ingesting files:   0%|                                  | 0/200 [00:00<?, ?it/s]Error uploading asset 0: 'checksum'
Ingesting files:   0%|                                  | 0/200 [00:00<?, ?it/s]
No new version was created, your dataset has not changed.


The resulting `catalog.parquet` file contains the same information as the STAC catalog for all the items.

In [14]:
import geopandas as gpd

gdf = gpd.read_parquet(path + "/catalog.parquet")
gdf.head()

Unnamed: 0,assets,bbox,collection,geometry,id,links,stac_extensions,stac_version,type,datetime,label:classes,label:description,label:methods,label:properties,label:tasks,label:type
0,{'asset': {'href': '/home/juan/Desktop/eotdl/t...,"{'xmin': 0, 'ymin': 0, 'xmax': 0, 'ymax': 0}",source,"POLYGON ((0.00000 0.00000, 0.00000 0.00000, 0....",Industrial_1743,[{'href': '/home/juan/Desktop/eotdl/tutorials/...,[],1.0.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
1,{'asset': {'href': '/home/juan/Desktop/eotdl/t...,"{'xmin': 0, 'ymin': 0, 'xmax': 0, 'ymax': 0}",source,"POLYGON ((0.00000 0.00000, 0.00000 0.00000, 0....",Industrial_1273,[{'href': '/home/juan/Desktop/eotdl/tutorials/...,[],1.0.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
2,{'asset': {'href': '/home/juan/Desktop/eotdl/t...,"{'xmin': 0, 'ymin': 0, 'xmax': 0, 'ymax': 0}",source,"POLYGON ((0.00000 0.00000, 0.00000 0.00000, 0....",Industrial_1117,[{'href': '/home/juan/Desktop/eotdl/tutorials/...,[],1.0.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
3,{'asset': {'href': '/home/juan/Desktop/eotdl/t...,"{'xmin': 0, 'ymin': 0, 'xmax': 0, 'ymax': 0}",source,"POLYGON ((0.00000 0.00000, 0.00000 0.00000, 0....",Industrial_1121,[{'href': '/home/juan/Desktop/eotdl/tutorials/...,[],1.0.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
4,{'asset': {'href': '/home/juan/Desktop/eotdl/t...,"{'xmin': 0, 'ymin': 0, 'xmax': 0, 'ymax': 0}",source,"POLYGON ((0.00000 0.00000, 0.00000 0.00000, 0....",Industrial_1641,[{'href': '/home/juan/Desktop/eotdl/tutorials/...,[],1.0.0,Feature,2000-01-01 00:00:00+00:00,,,,,,


## Ingesting with the library

You can also ingest datasets with the library (you will need to create a `README.md` file as well).

In [15]:
from eotdl.datasets import ingest_dataset

ingest_dataset("example_data/EuroSAT-small")

  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


Ingesting directory: example_data/EuroSAT-small
current version:  1


Ingesting files: 100%|██████████| 7/7 [00:00<00:00, 127.46it/s]

No new version was created, your dataset has not changed.





## Ingesting a virtual dataset

Option 2 consists on creating a `virtual dataset` from a list of links to assets in another repository (cloud bucket, huggingface, etc.), and is only available through the library.

In [16]:
links = [
	'https://link1.com',
	'https://link2.com',
	'https://link3.com',
]

metadata = {
	'name': 'Test-links',
	'authors': ['Juan B. Pedro'],
	'license': 'free',
	'source': 'https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb',
	'description': """# Test links

Testing the ingestion of a dataset from a list of links.
"""
}

In [17]:
from eotdl.datasets import ingest_virutal_dataset

path = 'data/test-links'

ingest_virutal_dataset(path, links, metadata)

  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


current version:  1
ERROR generate_presigned_url File `catalog.v1.parquet` does not exist


Ingesting files: 100%|██████████| 4/4 [00:00<00:00, 244.10it/s]
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


PosixPath('data/test-links/catalog.parquet')

If you already have a `README.md` file.

In [24]:
ingest_virutal_dataset(path, links)

  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


current version:  1


Ingesting files: 100%|██████████| 4/4 [00:00<00:00, 175.19it/s]

A new version was created, your dataset has changed.
Num changes: 1



  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


PosixPath('data/test-links/catalog.parquet')

The `catalog.parquet` file will be created in the provided path.

In [25]:
import geopandas as gpd

catalog = f"{path}/catalog.parquet"

gdf = gpd.read_parquet(catalog)
gdf.head()

Unnamed: 0,type,stac_version,stac_extensions,datetime,id,bbox,geometry,assets,links,repository
0,Feature,1.0.0,[],2025-02-06 17:38:12.598549,https://link1.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
1,Feature,1.0.0,[],2025-02-06 17:38:12.598606,https://link2.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
2,Feature,1.0.0,[],2025-02-06 17:38:12.598615,https://link3.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
3,Feature,1.0.0,[],2025-02-06 17:38:12.598636,README.md-287835,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '3ff7a31e45ae25456d787e...,[],eotdl


## Ingesting models

You can ingest a model exactly in the same way

In [26]:
# !eotdl models ingest --help

## Versioning

By default, every time you re-upload a dataset or model a new version is created if any changes are detected (new files, modified files, removed files).

When you download a dataset, the latest version is used by default.

In [27]:
!eotdl datasets get EuroSAT-small

Data available at /home/juan/.cache/eotdl/datasets/EuroSAT-small


In [28]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

catalog.v1.parquet


However, you can specify the version

In [29]:
!eotdl datasets get EuroSAT-small -v 2

Version 2 not found


Let's make some changes and reingest the dataset.

In [42]:
!rm -rf data/EuroSAT-small-modified
!cp -r example_data/EuroSAT-small data/EuroSAT-small-modified

In [43]:
from glob import glob 

path = "data/EuroSAT-small-modified"
files = glob(f'{path}/**/*.*', recursive=True)
files

['data/EuroSAT-small-modified/catalog.parquet',
 'data/EuroSAT-small-modified/README.md',
 'data/EuroSAT-small-modified/Forest/Forest_3.tif',
 'data/EuroSAT-small-modified/Forest/Forest_1.tif',
 'data/EuroSAT-small-modified/Forest/Forest_2.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_1.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_2.tif']

In [44]:
# mofidy README.md

text = """---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [45]:
# add a new file

with open(f"{path}/test.txt", "w") as outfile:
    outfile.write("This is a new file!")
    
files = glob(f'{path}/**/*.*', recursive=True)
files

['data/EuroSAT-small-modified/test.txt',
 'data/EuroSAT-small-modified/catalog.parquet',
 'data/EuroSAT-small-modified/README.md',
 'data/EuroSAT-small-modified/Forest/Forest_3.tif',
 'data/EuroSAT-small-modified/Forest/Forest_1.tif',
 'data/EuroSAT-small-modified/Forest/Forest_2.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_1.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_2.tif']

In [47]:
!eotdl datasets ingest -p data/EuroSAT-small-modified

Ingesting directory: data/EuroSAT-small-modified
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy
current version:  1
Ingesting files: 100%|███████████████████████████| 8/8 [00:00<00:00, 115.22it/s]
A new version was created, your dataset has changed.
Num changes: 1
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


In [52]:
!eotdl datasets get EuroSAT-small -f -v 2

Version 2 not found


In [53]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

catalog.v1.parquet


We apply versioning at dataset/model and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.

You can explore the different versions in the user interface.

## Ingesting through the Library

You can also ingest datasets and models using the library

> TODO: example with STAC catalog and models