In [1]:
%load_ext autoreload
%autoreload 2

import os
os.environ["EOTDL_API_URL"] = "http://localhost:8000/"

# Ingest an existing Dataset or Model

In this notebook we are going to showcase how to ingest an existing dataset or model into EOTDL.

Once it is ingested, you can use it in the same way as any other dataset or model in EOTDL (exploring, staging, etc.).

## Ingesting through the CLI

The recommended way to ingest a dataset is using the CLI.

In [2]:
!eotdl datasets ingest --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1meotdl datasets ingest [OPTIONS][0m[1m                                        [0m[1m [0m
[1m                                                                                [0m
 Ingest a dataset to the EOTDL.asdf                                             
 [2mThis command ingests the dataset to the EOTDL. The dataset must be a folder [0m   
 [2mwith the dataset files, and at least a README.md file (and a catalog.json file[0m 
 [2mfor Q1+). If these files are missing, the ingestion will not work. All the [0m    
 [2mfiles in the folder will be uploaded to the EOTDL.[0m                             
                                                                                
 [2mThe following constraints apply to the dataset name:[0m                           
 [2m- It must be unique[0m                                                            
 

There are several ways in which you can ingest a dataset:

1. From a local folder in your system with the data you want to upload.
2. TODO: From a list of links to assets in another repository (cloud bucket, huggingface, etc.).

Let's start with option 1.

In [3]:
!ls example_data

eurosat_rgb_dataset	 EuroSAT-small		   jaca_dataset_structured
EuroSAT-RGB-small	 jaca_dataset		   labels_scaneo
EuroSAT-RGB-small-STAC	 jaca_dataset_q2	   sample_stacdataframe.csv
eurosat_rgb_stac	 jaca_dataset_stac
eurosat_rgb_stac_labels  jaca_dataset_stac_labels


For this tutorial we are going to work with a subsample of the [EuroSAT](https://www.eotdl.com/datasets/EuroSAT-RGB) dataset.

In [5]:
from glob import glob 

path = "example_data/EuroSAT-small"
files = glob(f'{path}/**/*.*', recursive=True)
files

['example_data/EuroSAT-small/Forest/Forest_3.tif',
 'example_data/EuroSAT-small/Forest/Forest_1.tif',
 'example_data/EuroSAT-small/Forest/Forest_2.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_3.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_1.tif',
 'example_data/EuroSAT-small/AnnualCrop/AnnualCrop_2.tif']

For all cases, a `README.md` file is required in order to ingestdatasets and models, containing some basic required information (dataset authors, licens, link to source and dataset name)

In [6]:
# create README.md

text = """---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [7]:
!cat {path}/README.md

---
name: EuroSAT-small
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/02_ingesting.ipynb
---

# EuroSAT-small

This is a small subet of the EuroSAT dataset.


The `name` property in the `README.md` file is used for the name of the dataset or model in the repository, hence it must be unique, between 3 and 45 characters long and can only contain alphanumeric characters and dashes (learn more at [https://www.eotdl.com/docs/datasets/ingest](https://www.eotdl.com/docs/datasets/ingest)).

Trying to ingest a dataset without a `README.md` file will fail.

If everything is correct, the ingestion process should suceed.

In [8]:
!eotdl datasets ingest -p example_data/EuroSAT-small

Ingesting directory:  example_data/EuroSAT-small
Ingesting files: 100%|████████████████████████████| 8/8 [00:00<00:00, 59.16it/s]
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


And now your dataset is avilable at EOTDL

In [9]:
!eotdl datasets list -n eurosat-small

['EuroSAT-small']


> Since the `EuroSAT-small` name is already taken, this process should fail for you. To solve it, just upload the dataset with a different name. However, this will polute the EOTDL with test datasets so we encourage you to try the ingestion process with a real dataset that you want to ingest (or overwrite your test dataset in the future with useful data).

Before the ingestion, the CLI will create a STAC-compliant `parquet` file with the metadata of the dataset.

In [13]:
import geopandas as gpd

catalog = f"{path}/catalog.parquet"

gdf = gpd.read_parquet(catalog)
gdf.head()

Unnamed: 0,stac_extensions,id,bbox,geometry,assets,links,collection,abc,123
0,[],catalog.parquet,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8000/datasets/67a3...,[],EuroSAT-small,[],"{'asfhjk': [1, 2, 3]}"
1,[],README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8000/datasets/67a3...,[],EuroSAT-small,[],"{'asfhjk': [1, 2, 3]}"
2,[],Forest/Forest_3.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8000/datasets/67a3...,[],EuroSAT-small,[],"{'asfhjk': [1, 2, 3]}"
3,[],Forest/Forest_1.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8000/datasets/67a3...,[],EuroSAT-small,[],"{'asfhjk': [1, 2, 3]}"
4,[],Forest/Forest_2.tif,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8000/datasets/67a3...,[],EuroSAT-small,[],"{'asfhjk': [1, 2, 3]}"


However, if your local dataset already contains a STAC catalog, the available schema will be used to create the EOTDL `parquet` catalog (including the different STAC extensions or properties that might be present).

> TODO: example with STAC catalog

> TODO: list of links

You can ingest a model exactly in the same way

In [14]:
# !eotdl models ingest --help

## Versioning (TODO)

By default, every time you re-upload a dataset or model a new version is created.

When you download a dataset, the latest version is used by default.

In [15]:
# !eotdl datasets get EuroSAT-small

However, you can specify the version

In [16]:
# !eotdl datasets get EuroSAT-small -v 1

In [17]:
# !ls $HOME/.cache/eotdl/datasets/EuroSAT-small

We apply versioning at dataset/model and file level, meaning only new or modified files will be uploaded in future re-uploads, downloading the appropriate files for each version.

You can explore the different versions in the user interface.

## Ingesting through the Library

You can also ingest datasets and models using the library

In [19]:
from eotdl.datasets import ingest_dataset

ingest_dataset("example_data/EuroSAT-small");

Ingesting directory:  example_data/EuroSAT-small


Ingesting files:   0%|          | 0/8 [00:00<?, ?it/s]

Ingesting files: 100%|██████████| 8/8 [00:00<00:00, 47.69it/s]
  np.nanmin(b[:, 0]),  # minx
  np.nanmin(b[:, 1]),  # miny
  np.nanmax(b[:, 2]),  # maxx
  np.nanmax(b[:, 3]),  # maxy


> TODO: example with STAC catalog and models