# Ingest an existing Dataset or Model

In this notebook we are going to showcase how to ingest an existing dataset or model into EOTDL.

Once it is ingested, you can use it in the same way as any other dataset or model in EOTDL (exploring, downloading, etc.).

## Ingesting through the CLI

The recommended way to ingest a dataset is using the CLI.

In [1]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
 Ingest a dataset to the EOTDL.                                                 
 [2mThis command ingests the dataset to the EOTDL. The dataset must be a folder [0m   
 [2mwith the dataset files, and at least a metadata.yml file or a catalog.json [0m    
 [2mfile. If there are not these files, the ingestion will not work. All the files[0m 
 [2min the folder will be uploaded to the EOTDL.[0m                                   
                                                                                
 [2mThe following constraints apply to the dataset name:[0m                           
 [2m- It must be unique[0m                                                            
 

In order to ingest a dataset you will need a folder in your system with the data you want to upload.

In [2]:
!ls workshop_data

boadella_bbox.geojson  dates.csv      sample_stacdataframe.csv	sentinel_2_bck
boadella.geojson       EuroSAT-small  sentinel_2		sentinel_2_stac


For this tutorial we are going to work with a subsample of the [EuroSAT](https://www.eotdl.com/datasets/EuroSAT-RGB) dataset.

In [3]:
from glob import glob 

files = glob('workshop_data/EuroSAT-small/**/*.*', recursive=True)
files

['workshop_data/EuroSAT-small/metadata.yml',
 'workshop_data/EuroSAT-small/Forest/Forest_3.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_1.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_2.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif']

A `metadata.yml` file is required for Q0 datasets and models, containing some basic required information (dataset authors, licens, link to source and dataset name)

In [4]:
!cat workshop_data/EuroSAT-small/metadata.yml

authors:
- Patrick Helber
license: open
source: http://madm.dfki.de/downloads
name: EuroSAT-small


The chosen name is the one that will appear in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `metadata.yml` file will fail.

If everything is correct, the ingestion process should suceed.

In [6]:
!eotdl datasets ingest -p workshop_data/EuroSAT-small/

Uploading directory workshop_data/EuroSAT-small...
generating list of files to upload...
100%|███████████████████████████████████████████| 7/7 [00:00<00:00, 4459.31it/s]
7 new files will be ingested
0 files already exist in dataset
0 large files will be ingested separately
New version created, version: 1
generating batches...
100%|█████████████████████████████████████████| 7/7 [00:00<00:00, 184654.89it/s]
Uploading 7 small files in 1 batches...
Uploading batches: 100%|█████████████████████| 1/1 [00:00<00:00,  2.03batches/s]


And now your dataset is avilable at EOTDL

In [7]:
!eotdl datasets list -n eurosat-small

['EuroSAT-small']


> Since the `EuroSAT-small` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data).

In order to ingest Q1+ datasets, a valid STAC catalog is required instead of the `metadata.yml` file. We will explore this in the [data curation](tutorials/workshops/bids23/05_STAC_metadata.ipynb) notebook.

Let's now ingest the model that we trained in the previous notebook.

In [8]:
!ls data/*.onnx

data/model.onnx


In [9]:
!mkdir -p data/EuroSAT-RGB-BiDS-model
!cp data/model.onnx data/EuroSAT-RGB-BiDS-model/model.onnx

First, we need to create a folder with the model and the `metadata.yml` file (and any other file that you want).

In [10]:
import yaml

metadata = {
	"name": "EuroSAT-RGB-BiDS23",
	"authors": ["Juan B. Pedro"],
	"license": "open",
	"source": "https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/bids23/02_training.ipynb"
}

with open('data/EuroSAT-RGB-BiDS-model/metadata.yml', 'w') as outfile:
	yaml.dump(metadata, outfile, default_flow_style=False)

Now we can ingest the model to EOTDL

In [12]:
!eotdl models ingest -p data/EuroSAT-RGB-BiDS-model

Uploading directory data/EuroSAT-RGB-BiDS-model...
generating list of files to upload...
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 55.63it/s]
2 new files will be ingested
0 files already exist in dataset
1 large files will be ingested separately
New version created, version: 1
ingesting large files...
ingesting file: model.onnx
models
42.65/42.65 MB: : 5it [00:06,  1.24s/it]                                        
generating batches...
100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 34952.53it/s]
Uploading 1 small files in 1 batches...
Uploading batches: 100%|█████████████████████| 1/1 [00:00<00:00,  1.76batches/s]


## Versioning

By default, every time you re-upload a dataset or model a new version is created.

When you download a dataset, the latest version is used by default.

In [13]:
!eotdl datasets get EuroSAT-small

100%|███████████████████████████████████████████| 7/7 [00:04<00:00,  1.74file/s]
Data available at /home/juan/.cache/eotdl/datasets/EuroSAT-small/v1


However, you can specify the version

In [14]:
!eotdl datasets get EuroSAT-small -v 1

Dataset `EuroSAT-small v1` already exists at /home/juan/.cache/eotdl/datasets/EuroSAT-small/v1. To force download, use force=True or -f in the CLI.


In [15]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

v1  v10


We apply versioning at dataset/model and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.



In [18]:
!eotdl datasets ingest -p workshop_data/EuroSAT-small/

Uploading directory workshop_data/EuroSAT-small...
generating list of files to upload...
100%|███████████████████████████████████████████| 7/7 [00:00<00:00, 4241.57it/s]
No new files to upload


In [19]:
!touch workshop_data/EuroSAT-small/test.txt

In [20]:
!eotdl datasets ingest -p workshop_data/EuroSAT-small/

Uploading directory workshop_data/EuroSAT-small...
generating list of files to upload...
100%|███████████████████████████████████████████| 8/8 [00:00<00:00, 4639.72it/s]
1 new files will be ingested
7 files already exist in dataset
0 large files will be ingested separately
New version created, version: 2
generating batches...
100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 35848.75it/s]
Uploading 1 small files in 1 batches...
Uploading batches: 100%|█████████████████████| 1/1 [00:00<00:00,  6.52batches/s]
100%|█████████████████████████████████████████| 7/7 [00:00<00:00, 257544.98it/s]
Ingesting existing files: 100%|██████████████| 1/1 [00:00<00:00,  1.96batches/s]


In [21]:
!eotdl datasets ingest -p workshop_data/EuroSAT-small/

Uploading directory workshop_data/EuroSAT-small...
generating list of files to upload...
100%|███████████████████████████████████████████| 8/8 [00:00<00:00, 4788.02it/s]
No new files to upload


You can explore the different versions in the user interface.

## Ingesting through the Library

You can ingest datasets and models using the library

In [24]:
from eotdl.datasets import ingest_dataset

ingest_dataset("workshop_data/EuroSAT-small");

Uploading directory workshop_data/EuroSAT-small...
generating list of files to upload...


100%|██████████| 7/7 [00:00<00:00, 3549.77it/s]


Exception: No new files to upload

In [31]:
from eotdl.models import ingest_model

ingest_model("data/EuroSAT-RGB-BiDS-model");

Uploading directory data/EuroSAT-RGB-BiDS-model...
generating list of files to upload...


100%|██████████| 2/2 [00:00<00:00, 52.63it/s]


Exception: No new files to upload

## Ingesting through the API

Ingesting a dataset or model through the API is a multi step (and error prone) process:

1. Create/Retrieve a dataset
2. Create a version
3. Ingest files to version
	1. Ingest small files in batches
	2. Ingest large files in chunks as multipart upload
		1. Create multipart upload
		2. Ingest chunks
		3. Complete multipart upload
	3. Ingest existing files in batches to new version

The library/CLI will take care of these steps, so it is the recommended way to ingest a dataset. 

However, if you still want to ingest datasets with the API, we recommend following the previous steps using the API [documentation](https://api.eotdl.com/docs) or reading the implementation of the ingestion functions in the library. If you need further help, reach out to us at the Discord server.

This is a process we would like to simplify in the future.

## Discussion and Contribution opportunities

Feel free to ask questions now (live or through Discord) and make suggestions for future improvements.

- What features concerning ingestion would you like to see?
- What other features concerning versioning would you like to see?