# Ingest an existing Dataset or Model


In this notebook we are going to showcase how to ingest the dataset we created and labeled in previous notebooks

Once it is ingested, you can use it in the same way as any other dataset or model in EOTDL (exploring, staging, etc.).


## Ingesting through the CLI


The recommended way to ingest a dataset is using the CLI.


In [1]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
 Ingest a dataset to the EOTDL.asdf                                             
                                                                                
 [2mThis command ingests the dataset to the EOTDL. The dataset must be a folder [0m   
 [2mwith the dataset files, and at least a README.md file (and a catalog.json file[0m 
 [2mfor Q1+). If these files are missing, the ingestion will not work. All the [0m    
 [2mfiles in the folder will be uploaded to the EOTDL.[0m                             
                                                                                
 [2mThe following constraints apply to the dataset name:[0m                           
 [2m- It 

In order to ingest a dataset you will need a folder in your system with the data you want to upload.


In [2]:
!ls SCANEO

README.md          catalog.v1.parquet [1m[36mmodels[m[m             [1m[36msamples[m[m


A `README.md` file is required for datasets and models, containing some basic required information (dataset authors, licens, link to source and dataset name).


In [3]:
!cat SCANEO/README.md

---
name: SCANEO
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/scaneo
---

Models and sample data for the SCANEO labelling tool.

We can create another `README.md` file with running the following cell:


In [4]:
text = """---
name: scaneo-bids25
authors: 
  - Fran Martín
license: open
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/lps25/02_training.ipynb
---

# SCANEO-bids25

This is a toy dataset created and labeled with SCANEO in the BiDS 2025 tutorial.
"""

with open("SCANEO/README.md", "w") as outfile:
    outfile.write(text)

The chosen name is the one that will appear in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `README.md` file will fail.

If everything is correct, the ingestion process should work.


In [5]:
!eotdl datasets ingest -p SCANEO

Ingesting folder
Ingesting directory: SCANEO
Preparing files: 100%|██████████████████████████| 15/15 [00:00<00:00, 65.56it/s]
Ingesting files: 100%|██████████████████████████| 13/13 [00:19<00:00,  1.47s/it]


And now your dataset is avilable at EOTDL


In [6]:
!eotdl datasets list -n scaneo-bids25

['scaneo-bids25']


> Since the `scaneo-bids25` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data). In any case, you can always delete the dataset from the EOTDL using the `DELETE` button in the UI.


During the ingestion process, a `catalog.parquet` file is created with STAC metadata. If your dataset already has STAC metadata (a `catalog.json` file exists at the root of the dataset), the metadata will be parsed and added to the `catalog.parquet` file. Otherwise, the `CLI` will create a STAC-compatible metadata from the directory structure.


In [7]:
import geopandas as gpd

catalog = gpd.read_parquet("SCANEO/catalog.parquet")

catalog

Unnamed: 0,type,stac_version,stac_extensions,datetime,id,bbox,geometry,assets,links,repository
0,Feature,1.0.0,[],2025-09-28 19:12:02.597560,README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '37cff8869dbac8b8894148...,[],eotdl
1,Feature,1.0.0,[],2025-09-28 19:12:02.598612,catalog.v1.parquet,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '1e6ea39da4a5df0bde136f...,[],eotdl
2,Feature,1.0.0,[],2025-09-28 19:12:02.598848,models/hr-roads.onnx,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '99581a42cd56f66b1649f5...,[],eotdl
3,Feature,1.0.0,[],2025-09-28 19:12:02.735016,models/s2-roads.onnx,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'f313a4bf02afd1a6b55829...,[],eotdl
4,Feature,1.0.0,[],2025-09-28 19:12:02.816535,samples/15928855_15.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'ea9975a5a26681e17d2e4f...,[],eotdl
5,Feature,1.0.0,[],2025-09-28 19:12:02.817615,samples/17878735_15.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '618fafa07c048835e078fa...,[],eotdl
6,Feature,1.0.0,[],2025-09-28 19:12:02.818592,samples/22528900_15.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '6ebd2a950152c5c0bc130b...,[],eotdl
7,Feature,1.0.0,[],2025-09-28 19:12:02.819825,samples/deep_globe.jpg,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': '58f3fcd558510ad5888e2e...,[],eotdl
8,Feature,1.0.0,[],2025-09-28 19:12:02.821376,samples/12328750_15.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'e569c8d909d29f2c33105e...,[],eotdl
9,Feature,1.0.0,[],2025-09-28 19:12:02.822408,samples/21929020_15.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,{'asset': {'checksum': 'b4b880b2f84360df9dd13f...,[],eotdl


## Versioning


By default, every time you re-upload a dataset or model a new version is created. We apply versioning at dataset/model and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.


In [8]:
!echo "hello" > SCANEO/hello.txt
!eotdl datasets ingest -p SCANEO

Ingesting folder
Ingesting directory: SCANEO
Preparing files: 100%|██████████████████████████| 16/16 [00:00<00:00, 72.89it/s]
Ingesting files: 100%|██████████████████████████| 14/14 [00:03<00:00,  4.37it/s]
A new version was created, your dataset has changed.
Num changes: 1


When you stage a dataset, the latest version is used by default.


In [9]:
!eotdl datasets get scaneo-bids25

Data available at /Users/fran/.cache/eotdl/datasets/scaneo-bids25


However, you can specify the version


In [10]:
!eotdl datasets get scaneo-bids25 -v 1 -f

Data available at /Users/fran/.cache/eotdl/datasets/scaneo-bids25


In [11]:
!ls $HOME/.cache/eotdl/datasets/scaneo-bids25

README.md          catalog.v1.parquet catalog.v2.parquet


You can explore the different versions in the user interface.


## Ingesting through the Library


You can ingest datasets and models using the library


In [12]:
from eotdl.datasets import ingest_dataset

try:
    ingest_dataset("SCANEO")
except Exception as e:
    print(e)

Ingesting folder
Ingesting directory: SCANEO


Preparing files: 100%|██████████| 16/16 [00:00<00:00, 95.56it/s]
Ingesting files: 100%|██████████| 14/14 [00:02<00:00,  5.05it/s]

No new version was created, your dataset has not changed.





## Discussion and Contribution opportunities

Feel free to ask questions now (live or through Discord) and make suggestions for future improvements.

- How to do you rate the user experience ingesting a dataset?
- What features concerning ingestion for datasets would you like to see?
- What other features concerning versioning for datasets would you like to see?
