In [1]:
%load_ext autoreload
%autoreload 2

import os
os.environ["EOTDL_API_URL"] = "http://localhost:8000/"

# Ingesting Datasets and Models

In this notebook we are going to showcase how to ingest an existing dataset or model into EOTDL.

Once it is ingested, you can use it in the same way as any other dataset or model in EOTDL (exploring, staging, etc.).

The recommended way to ingest a dataset is using the CLI.

In [2]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
 Ingest a dataset to the EOTDL.asdf                                             
 [2mThis command ingests the dataset to the EOTDL. The dataset must be a folder [0m   
 [2mwith the dataset files, and at least a README.md file (and a catalog.json file[0m 
 [2mfor Q1+). If these files are missing, the ingestion will not work. All the [0m    
 [2mfiles in the folder will be uploaded to the EOTDL.[0m                             
                                                                                
 [2mThe following constraints apply to the dataset name:[0m                           
 [2m- It must be unique[0m                                                            
 

There are several ways in which you can ingest a dataset:

1. From a local folder in your system with the data you want to upload (with or without a STAC catalog).
2. From a list of links to assets in another repository (cloud bucket, huggingface, etc.).

## Ingesting a local dataset

In [3]:
!ls example_data

[1m[36mEuroSAT-RGB-small[m[m      [1m[36mEuroSAT-small[m[m          [1m[36mRoadSegmentation[m[m
[1m[36mEuroSAT-RGB-small-STAC[m[m [1m[36mEuroSAT-small-private[m[m


For this tutorial we are going to work with a subsample of the [EuroSAT](https://www.eotdl.com/datasets/EuroSAT-RGB) dataset.

In [4]:
!rm -rf example_data/EuroSAT-small/catalog.parquet
!rm -rf example_data/EuroSAT-small/README.md

In [5]:
from glob import glob 

path = "example_data/EuroSAT-small"
files = glob(f'{path}/**/*.*', recursive=True)
files

['example_data/EuroSAT-small/Forest/Forest_1.tif',
 'example_data/EuroSAT-small/Forest/Forest_2.tif',
 'example_data/EuroSAT-small/Forest/Forest_3.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif']

For all cases, a `README.md` file is required in order to ingest datasets and models, containing some basic required information (dataset authors, licens, link to source and dataset name)

In [6]:
# create README.md

text = """---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [7]:
!cat {path}/README.md

---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.


The `name` property in the `README.md` file is used for the name of the dataset or model in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `README.md` file will fail.

If everything is correct, the ingestion process should suceed.

In [9]:
!eotdl datasets ingest -p example_data/EuroSAT-small

Ingesting directory: example_data/EuroSAT-small
Ingesting files: 100%|███████████████████████████| 7/7 [00:00<00:00, 187.36it/s]
No new version was created, your dataset has not changed.


And now your dataset is avilable at EOTDL

In [10]:
!eotdl datasets list -n eurosat-small

['EuroSAT-small']


> Since the `EuroSAT-small` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data).

### Ingesting a local STAC catalog

Before the ingestion, the CLI will create a STAC-compliant `parquet` file with the metadata of the dataset.

In [11]:
import geopandas as gpd

catalog = f"{path}/catalog.parquet"

gdf = gpd.read_parquet(catalog)
gdf.head()

Unnamed: 0,type,stac_version,stac_extensions,datetime,id,bbox,geometry,assets,links,repository
0,Feature,1.0.0,[],2025-05-23 16:58:37.949512,README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'a6bb30a57d0f5ff0aaa65b...,[],eotdl
1,Feature,1.0.0,[],2025-05-23 16:58:37.949680,Forest/Forest_1.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'f3b8b9fef6b2df6f24792e...,[],eotdl
2,Feature,1.0.0,[],2025-05-23 16:58:37.949825,Forest/Forest_2.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '2e38dab64435bfbab25bab...,[],eotdl
3,Feature,1.0.0,[],2025-05-23 16:58:37.949961,Forest/Forest_3.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '3e7bb982f9db5f7dabc556...,[],eotdl
4,Feature,1.0.0,[],2025-05-23 16:58:37.950093,AnnualCrop/AnnualCrop_2.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'c406cb8920858b98898b9e...,[],eotdl


However, if your local dataset already contains a STAC catalog, the available schema will be used to create the EOTDL `parquet` catalog (including the different STAC extensions or properties that might be present).

In [12]:
import os

path = 'example_data/EuroSAT-RGB-small-STAC'

files = os.listdir(path)
assert 'catalog.json' in files, "catalog.json not found"

!cat example_data/EuroSAT-RGB-small-STAC/catalog.json

{
  "type": "Catalog",
  "id": "EuroSAT-RGB-Q1",
  "stac_version": "1.0.0",
  "description": "EuroSAT-RGB dataset",
  "links": [
    {
      "rel": "root",
      "href": "./catalog.json",
      "type": "application/json"
    },
    {
      "rel": "child",
      "href": "./source/collection.json",
      "type": "application/json"
    },
    {
      "rel": "child",
      "href": "./labels/collection.json",
      "type": "application/json"
    }
  ]
}

In [12]:
# create README.md

text = """---
name: EuroSAT-RGB-small-STAC
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-STAC

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [13]:
!eotdl datasets ingest -p example_data/EuroSAT-RGB-small-STAC

Ingesting items from collection source: 100it [00:00, 11314.86it/s]
Ingesting items from collection labels: 100it [00:00, 15101.00it/s]
Ingesting files: 100%|████████████████████████| 200/200 [00:02<00:00, 90.34it/s]
A new version was created, your dataset has changed.
Num changes: 100


The resulting `catalog.parquet` file contains the same information as the STAC catalog for all the items.

In [14]:
import geopandas as gpd

gdf = gpd.read_parquet(path + "/catalog.parquet")
gdf.head()

Unnamed: 0,assets,collection,geometry,id,links,stac_extensions,stac_version,type,datetime,label:classes,label:description,label:methods,label:properties,label:tasks,label:type
0,{'asset': {'checksum': '582fb1e054885a609c1e25...,source,"POLYGON ((0 0, 0 0, 0 0, 0 0))",Industrial_1743,[{'href': '/Users/juan/Desktop/eotdl/tutorials...,[],1.1.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
1,{'asset': {'checksum': '2d267caf0ef060780fec89...,source,"POLYGON ((0 0, 0 0, 0 0, 0 0))",Industrial_1273,[{'href': '/Users/juan/Desktop/eotdl/tutorials...,[],1.1.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
2,{'asset': {'checksum': '0204dd4a3296ea8be3b388...,source,"POLYGON ((0 0, 0 0, 0 0, 0 0))",Industrial_1117,[{'href': '/Users/juan/Desktop/eotdl/tutorials...,[],1.1.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
3,{'asset': {'checksum': 'a6d23a61d5c20d4b953117...,source,"POLYGON ((0 0, 0 0, 0 0, 0 0))",Industrial_1121,[{'href': '/Users/juan/Desktop/eotdl/tutorials...,[],1.1.0,Feature,2000-01-01 00:00:00+00:00,,,,,,
4,{'asset': {'checksum': 'ed45a188a146a26eae0f2a...,source,"POLYGON ((0 0, 0 0, 0 0, 0 0))",Industrial_1641,[{'href': '/Users/juan/Desktop/eotdl/tutorials...,[],1.1.0,Feature,2000-01-01 00:00:00+00:00,,,,,,


## Ingesting with the library

You can also ingest datasets with the library (you will need to create a `README.md` file as well).

In [19]:
from eotdl.datasets import ingest_dataset

ingest_dataset("example_data/EuroSAT-small")

Ingesting directory: example_data/EuroSAT-small


Ingesting files: 100%|██████████| 7/7 [00:00<00:00, 278.85it/s]

No new version was created, your dataset has not changed.





## Ingesting a virtual dataset

Option 2 consists on creating a `virtual dataset` from a list of links to assets in another repository (cloud bucket, huggingface, etc.), and is only available through the library.

In [22]:
links = [
	'https://link1.com',
	'https://link2.com',
	'https://link3.com',
]

metadata = {
	'name': 'Test-links',
	'authors': ['Juan B. Pedro'],
	'license': 'free',
	'source': 'https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb',
	'description': """# Test links

Testing the ingestion of a dataset from a list of links.
"""
}

In [23]:
from eotdl.datasets import ingest_virtual_dataset

path = 'data/test-links'

ingest_virtual_dataset(path, links, metadata)

Ingesting files: 100%|██████████| 4/4 [00:00<00:00, 919.30it/s]

No new version was created, your dataset has not changed.





If you already have a `README.md` file.

In [24]:
ingest_virtual_dataset(path, links)

Ingesting files: 100%|██████████| 4/4 [00:00<00:00, 931.96it/s]

No new version was created, your dataset has not changed.





The `catalog.parquet` file will be created in the provided path.

In [25]:
import geopandas as gpd

catalog = f"{path}/catalog.parquet"

gdf = gpd.read_parquet(catalog)
gdf.head()

Unnamed: 0,type,stac_version,stac_extensions,datetime,id,bbox,geometry,assets,links,repository
0,Feature,1.0.0,[],2025-04-22 14:49:53.975592,https://link1.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
1,Feature,1.0.0,[],2025-04-22 14:49:53.975642,https://link2.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
2,Feature,1.0.0,[],2025-04-22 14:49:53.975650,https://link3.com,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,"{'asset': {'checksum': None, 'href': 'https://...",[],eotdl
3,Feature,1.0.0,[],2025-04-22 14:49:53.975673,README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '197885c67d6fca3c301d8e...,[],eotdl


## Ingesting models

You can ingest a model exactly in the same way

In [26]:
!eotdl models ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl models ingest [OPTIONS][0m[1m                                          [0m[1m [0m
[1m                                                                                [0m
 Ingest a model to the EOTDL.                                                   
 [2mThis command ingests the model to the EOTDL. The model must be a folder with [0m  
 [2mthe model files, and at least a metadata.yml file or a catalog.json file. If [0m  
 [2mthere are not these files, the ingestion will not work. All the files in the [0m  
 [2mfolder will be uploaded to the EOTDL.[0m                                          
                                                                                
 [2mThe following constraints apply to the model name:[0m                             
 [2m- It must be unique[0m                                                            
 

In [27]:
from glob import glob 

path = "example_data/RoadSegmentation"
files = glob(f'{path}/**/*.*', recursive=True)
files

['example_data/RoadSegmentation/catalog.parquet',
 'example_data/RoadSegmentation/README.md',
 'example_data/RoadSegmentation/model.onnx']

In [28]:
# create README.md

text = """---
name: RoadSegmentation
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# RoadSegmentation

This is an ONNX model for road segmentation.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [29]:
!eotdl models ingest -p example_data/RoadSegmentation

Ingesting directory: example_data/RoadSegmentation
Ingesting files: 100%|████████████████████████████| 2/2 [00:00<00:00, 95.82it/s]
No new version was created, your dataset has not changed.


## Versioning

By default, every time you re-upload a dataset or model a new version is created if any changes are detected (new files, modified files, removed files).

When you stage a dataset, the latest version is used by default.

In [30]:
!eotdl datasets get EuroSAT-small

Dataset `EuroSAT-small` already exists at /Users/juan/.cache/eotdl/datasets/EuroSAT-small. To force download, use force=True or -f in the CLI.


In [31]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

README.md          catalog.v1.parquet catalog.v2.parquet


However, you can specify the version

In [32]:
!eotdl datasets get EuroSAT-small -v 2

Dataset `EuroSAT-small` already exists at /Users/juan/.cache/eotdl/datasets/EuroSAT-small. To force download, use force=True or -f in the CLI.


Let's make some changes and reingest the dataset.

In [33]:
!rm -rf data/EuroSAT-small-modified
!cp -r example_data/EuroSAT-small data/EuroSAT-small-modified

In [34]:
from glob import glob 

path = "data/EuroSAT-small-modified"
files = glob(f'{path}/**/*.*', recursive=True)
files

['data/EuroSAT-small-modified/catalog.parquet',
 'data/EuroSAT-small-modified/README.md',
 'data/EuroSAT-small-modified/Forest/Forest_1.tif',
 'data/EuroSAT-small-modified/Forest/Forest_2.tif',
 'data/EuroSAT-small-modified/Forest/Forest_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_2.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_1.tif']

In [35]:
# mofidy README.md

text = """---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [36]:
# add a new file

with open(f"{path}/test.txt", "w") as outfile:
    outfile.write("This is a new file!")
    
files = glob(f'{path}/**/*.*', recursive=True)
files

['data/EuroSAT-small-modified/catalog.parquet',
 'data/EuroSAT-small-modified/README.md',
 'data/EuroSAT-small-modified/test.txt',
 'data/EuroSAT-small-modified/Forest/Forest_1.tif',
 'data/EuroSAT-small-modified/Forest/Forest_2.tif',
 'data/EuroSAT-small-modified/Forest/Forest_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_2.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_3.tif',
 'data/EuroSAT-small-modified/AnnualCrop/AnnualCrop_1.tif']

In [37]:
!eotdl datasets ingest -p data/EuroSAT-small-modified

Ingesting directory: data/EuroSAT-small-modified
Ingesting files: 100%|███████████████████████████| 8/8 [00:00<00:00, 189.71it/s]
No new version was created, your dataset has not changed.


In [38]:
!eotdl datasets get EuroSAT-small -f -v 2

Data available at /Users/juan/.cache/eotdl/datasets/EuroSAT-small


In [39]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

README.md          catalog.v1.parquet catalog.v2.parquet


We apply versioning at dataset/model and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.

You can explore the different versions in the user interface.