# Ingest an existing Dataset

In this notebook we are going to showcase how to ingest an existing dataset into EOTDL.

Once your dataset is ingested, you can use it in the same way as any other dataset in EOTDL (exploring, training, etc.).

## Ingesting through the CLI

The recommended version to ingest a dataset is using the CLI.

In [14]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [31m*[0m  [1;36m-[0m[1;36m-path[0m     [1;32m-p[0m      [1;33mPATH[0m  Path to dataset [2m[default: None][0m [2;31m[required][0m       [2m│[0m
[2m│[0m    [1;36m-[0m[1;36m-verbose[0m          [1;33m    [0m  Verbose output                                   [2m│[0m
[2m│[0m    [1;36m-[0m[1;36m-help[0m             [1;33m    [0m  Show this message and exit.                      [2m│[0m
[2m╰──────────────────────────────────────────────────────────────────────────────╯[0m



In order to ingest a dataset you will need a folder in your system with the data you want to upload.

In [4]:
!ls workshop_data

boadella.geojson  dates.csv  EuroSAT-small  sample_stacdataframe.csv


For this tutorial we are going to work with a subsample of the [EuroSAT](https://www.eotdl.com/datasets/EuroSAT-RGB) dataset.

In [15]:
from glob import glob 

files = glob('workshop_data/EuroSAT-small/**/*.*', recursive=True)
files

['workshop_data/EuroSAT-small/metadata.yml',
 'workshop_data/EuroSAT-small/Forest/Forest_3.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_1.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_2.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif']

A `metadata.yml` file is required for Q0 datasets, containing some basic required information (dataset authors, licens, link to source and dataset name)

In [11]:
!cat workshop_data/EuroSAT-small/metadata.yml

authors:
- Patrick Helber
license: open
source: http://madm.dfki.de/downloads
name: EuroSAT-small


The chosen name is the one that will appear in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `metadata.yml` file will fail.

If everything is correct, the ingestion process should suceed.

In [12]:
!eotdl datasets ingest -p workshop_data/EuroSAT-small/

Uploading directory workshop_data/EuroSAT-small...
Uploading files: 100%|█████████████████████████| 6/6 [00:15<00:00,  2.67s/files]


And now your dataset is avilable at EOTDL

In [13]:
!eotdl datasets list -n eurosat-small

['EuroSAT-small']


> Since the `EuroSAT-small` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data).

In order to ingest Q1+ datasets, a valid STAC catalog is required instead of the `metadata.yml` file. We will explore this in the [data curation](tutorials/workshops/bids23/05_STAC_metadata.ipynb) notebook.

## Versioning

By default, every time you re-upload a dataset a new version is created. 

When you download a dataset, the latest version is used by default.

In [15]:
!eotdl datasets get EuroSAT-small

100%|███████████████████████████████████████████| 6/6 [00:15<00:00,  2.62s/file]
Data available at /home/juan/.cache/eotdl/datasets/EuroSAT-small/v3/AnnualCrop


However, you can specify the version

In [16]:
!eotdl datasets get EuroSAT-small -v 1

100%|███████████████████████████████████████████| 6/6 [00:22<00:00,  3.73s/file]
Data available at /home/juan/.cache/eotdl/datasets/EuroSAT-small/v1/AnnualCrop


In [17]:
!ls $HOME/.cache/eotdl/datasets/EuroSAT-small

v1  v2	v3


We apply versioning at dataset and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.

## Ingesting through the Library

You can ingest datasets using the library

In [21]:
from eotdl.datasets import ingest_dataset

ingest_dataset("workshop_data/EuroSAT-small");

Uploading directory workshop_data/EuroSAT-small...


Uploading files: 100%|██████████| 6/6 [00:00<00:00,  6.03files/s]


## Ingesting through the API

Ingesting a dataset through the API is a multi step process:

1. Create/Retrieve a dataset
2. Create a version
3. Ingest files (optionally, retrieve files to avoid uploading the same file)

The library/CLI will take care of these steps, so it is the recommended way to ingest a dataset. 

However, if you still want to ingest datasets with the API, first you'll need to authenticate as shown in the [exploring](tutorials/workshops/bids23/01_exploring.ipynb) notebook.

In [2]:
token = "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6ImUtdHB2cDI4NEZlX1pfVzVZRUpfaiJ9.eyJuaWNrbmFtZSI6Iml0IiwibmFtZSI6Iml0QGVhcnRocHVsc2UuZXMiLCJwaWN0dXJlIjoiaHR0cHM6Ly9zLmdyYXZhdGFyLmNvbS9hdmF0YXIvNjU1NzQxYmI2ZDkzMDNmNjljMGY2YTUzYmU2MjMwZDQ_cz00ODAmcj1wZyZkPWh0dHBzJTNBJTJGJTJGY2RuLmF1dGgwLmNvbSUyRmF2YXRhcnMlMkZpdC5wbmciLCJ1cGRhdGVkX2F0IjoiMjAyMy0xMC0yNVQxMDoxMDowNy4yMDdaIiwiZW1haWwiOiJpdEBlYXJ0aHB1bHNlLmVzIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImlzcyI6Imh0dHBzOi8vZWFydGhwdWxzZS5ldS5hdXRoMC5jb20vIiwiYXVkIjoic0M1V2Zsem1Qb2owNThGSllMMmNrRU51dHhKTDRQVFciLCJpYXQiOjE2OTgyMzY4NTEsImV4cCI6MTY5ODI3Mjg1MSwic3ViIjoiYXV0aDB8NjE2YjAwNTdhZjBjNzUwMDY5MWEwMjZlIn0.m1KC287ISsi4ckObFvIM1PjKn0AbF-ZQPYwxHAhGYwLDqgOw-d5gclgcaM2JS-qjxeyQ1baJTdI1Ym17Ou-bZkUZkSu47JputasxQ8jj39d6_r4ys9j6XooKQJqOgk0g8sZgd-QFUdhYNjbZZr3PiFJEOGWZ6sZKBs84COfoqw7X2mS27OwcldId9VUUd4XRRjJ7Q97L4LBuj8zQIZL4NiCWaWth9WOahm6UcJtMFDlxJl7ocK2NkPePHuSy1_vBpJ0rxr4c3aPk3A913QM1Bhr8CMnKiZI0kibQKqVEuKqRuddoBfA_YYSkyoLfGSo8FmHf8EZOrnzihOns-d0NLQ"

Then, create a dataset passing the required metadata.

In [4]:
import requests

response = requests.post(
	'https://api.eotdl.com/datasets',
	headers={'Authorization': f'Bearer {token}'},
	json={
		'name': 'EuroSAT-small',
		"authors": ["author1", "author2"],
		"source": "https://link-to-source",
		"license": "the-license"
	}
)

response.json()

{'detail': 'Dataset already exists'}

If the dataset already exists, and you want to ingest a new version, you'll have to retrieve its information.

In [6]:
response = requests.get('https://api.eotdl.com/datasets?name=EuroSAT-small')
response.json()

{'uid': 'auth0|616b0057af0c7500691a026e',
 'id': '6526accffd974011abc2413a',
 'name': 'EuroSAT-small',
 'authors': ['juan'],
 'source': 'http://km.com',
 'license': 'open',
 'files': '6526accffd974011abc2413b',
 'versions': [{'version_id': 1,
   'createdAt': '2023-10-11T16:08:47.864',
   'size': 643464},
  {'version_id': 2, 'createdAt': '2023-10-11T16:08:47.864', 'size': 643464},
  {'version_id': 3, 'createdAt': '2023-10-12T07:14:16.642', 'size': 643464},
  {'version_id': 4, 'createdAt': '2023-10-12T07:14:16.642', 'size': 643464},
  {'version_id': 5, 'createdAt': '2023-10-12T07:14:16.642', 'size': 643464},
  {'version_id': 6, 'createdAt': '2023-10-12T07:14:16.642', 'size': 643464}],
 'description': '',
 'tags': [],
 'createdAt': '2023-10-11T16:08:47.865',
 'updatedAt': '2023-10-25T14:21:57.986',
 'likes': 0,
 'downloads': 0,
 'quality': 0}

Then, create a version.

In [9]:
dataset_id = response.json()['id']
response = requests.post(
	f'https://api.eotdl.com/datasets/version/{dataset_id}',
	headers={'Authorization': f'Bearer {token}'},
)
response.json()

{'dataset_id': '6526accffd974011abc2413a', 'version': 8}

Now you can ingest all the files that you want to this version.

In [13]:
files

['workshop_data/EuroSAT-small/metadata.yml',
 'workshop_data/EuroSAT-small/Forest/Forest_3.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_1.tif',
 'workshop_data/EuroSAT-small/Forest/Forest_2.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif',
 'workshop_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif']

> TODO: ingest through API, can we simplify?