In [1]:
%load_ext autoreload
%autoreload 2

import os
os.environ["EOTDL_API_URL"] = "http://localhost:8000/"


New way to ingest datasets:

1. In order to ingest a dataset to EOTDL we require:
	- `eotdl.parquet`: A parquet file representing the STAC catalog/collection as list of STAC items.
	- `README.md`: A markdown file with the metadata of the dataset.
2. The parquet file is autogenerated for all these cases:
	- Ingest all files in a folder (without STAC metadata)
	- Provide a list of links to files (virtual datasets)
	- Ingest an existing STAC catalo

Only local assets will be ingeted to the EOTDL (not URLs).

# Ingesting a dataset from a folder

If user wants to ingest dataset form folder without STAC metdata, first we read all files in the folder recursively and create a parquet file.

In [2]:
from glob import glob

path = 'data/EuroSAT-RGB-small'

# # retrieve all files in the folder recursively
# files = glob(path + '/**/*', recursive=True)

# len(files), files[:3]

In order to ingest any dataset to EOTDL, we require a REDME.md file with some mandatory metadata.

In [3]:
# create README.md

text = """---
name: EuroSAT-RGB-small-prototype
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-prototype

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [4]:
from eotdl.datasets import ingest_dataset

ingest_dataset(path)

Ingesting directory:  data/EuroSAT-RGB-small


Uploading files: 100%|██████████| 102/102 [00:01<00:00, 52.03it/s]


PosixPath('data/EuroSAT-RGB-small/catalog.parquet')

In [29]:
import geopandas as gpd

gpd.read_parquet(path + "/catalog.parquet")

Unnamed: 0,stac_extensions,id,bbox,geometry,assets,links,collection,abc,123
0,[],data/EuroSAT-RGB-small/catalog.parquet,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
1,[],data/EuroSAT-RGB-small/README.md,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
2,[],data/EuroSAT-RGB-small/Industrial/Industrial_1...,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
3,[],data/EuroSAT-RGB-small/Industrial/Industrial_1...,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
4,[],data/EuroSAT-RGB-small/Industrial/Industrial_1...,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
...,...,...,...,...,...,...,...,...,...
97,[],data/EuroSAT-RGB-small/Pasture/Pasture_650.jpg,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
98,[],data/EuroSAT-RGB-small/Pasture/Pasture_370.jpg,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
99,[],data/EuroSAT-RGB-small/Pasture/Pasture_1976.jpg,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"
100,[],data/EuroSAT-RGB-small/Pasture/Pasture_839.jpg,"{'xmax': 0.0, 'xmin': 0.0, 'ymax': 0.0, 'ymin'...",POLYGON EMPTY,[{'href': 'http://localhost:8001/datasets/'}],[],EuroSAT-RGB-small-prototype,[],"{'asfhjk': [1, 2, 3]}"


In [122]:
import pyarrow.parquet as pq
import stac_geoparquet

table = pq.read_table(path + "/catalog.parquet")

for item in stac_geoparquet.arrow.stac_table_to_items(table):
	print(item)
	break

{'stac_extensions': [], 'id': 'data/EuroSAT-RGB-small/README.md', 'bbox': [0.0, 0.0, 0.0, 0.0], 'geometry': {'type': 'Polygon', 'coordinates': []}, 'assets': [{'href': 'data/EuroSAT-RGB-small/README.md'}], 'links': [], 'collection': 'EuroSAT-RGB-small-prototype', 'properties': {'abc': [], '123': {'asfhjk': [1, 2, 3]}, '123:asd': [1, 2, 3]}}


will get all files in the folder recursively, create a simple catalog.json and ingest it into EOTDL.

# Example 2 - ingesting a dataset from a list of links

We can ingest a new dataset from a list of links (huggingface, s3, etc).


In [9]:
links = [
	'https://link1.com',
	'https://link2.com',
	'https://link3.com',
]

metadata = {
	'name': 'Test-links',
	'authors': ['Juan B. Pedro'],
	'license': 'free',
	'source': 'https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb',
	'description': """# Test links

Testing the ingestion of a dataset from a list of links.
"""
}

path = 'data/test-links'

ingest_dataset_prototype(path, metadata, links, replicate=False)

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Loading STAC catalog...
Using EOTDL API URL: http://localhost:8001/
New version created, version: 1


100%|██████████| 3/3 [00:00<00:00, 7653.84it/s]

Ingesting STAC catalog...
Done





will create a simple catalog.json with links as items and ingest it into EOTDL. We can choose if we want to replicate the assets in EOTDL or not (use direct sources).

In [10]:
!rm -rf data/test-links

# Example 3 - ingesting a dataset from a catalog


TODO: use stac-geoparquet to create the parquet file from the catalog.json

If STAC catalog already exists, we can ingest it into EOTDL. In this case, create README.md and place it in the root of the catalog.

In [48]:
!cp -r data/EuroSAT-RGB-small data/EuroSAT-RGB-small-stac

In [11]:
path = 'data/EuroSAT-RGB-small-stac'

files = os.listdir(path)
assert 'catalog.json' in files, "catalog.json not found"

!cat data/EuroSAT-RGB-small-stac/catalog.json

{
  "type": "Catalog",
  "id": "EuroSAT-RGB-small-prototype",
  "stac_version": "1.0.0",
  "description": "STAC catalog",
  "links": [
    {
      "rel": "root",
      "href": "./catalog.json",
      "type": "application/json",
      "title": "EuroSAT-RGB-small-prototype"
    },
    {
      "rel": "child",
      "href": "./collection/collection.json",
      "type": "application/json",
      "title": "collection"
    }
  ],
  "eotdl": {
    "name": "EuroSAT-RGB-small-catalog-prototype",
    "license": "free",
    "source": "https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb",
    "thumbnail": "",
    "authors": [
      "Juan B. Pedro"
    ],
    "description": "# EuroSAT-RGB-small-catalog-prototype\n\nThis is a prototype of the EuroSAT dataset."
  },
  "title": "EuroSAT-RGB-small-prototype"
}

In [14]:
# create README.md

text = """---
name: EuroSAT-RGB-small-catalog-prototype
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-catalog-prototype

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [15]:
from eotdl.datasets import ingest_dataset_prototype

ingest_dataset_prototype(path, replicate=False)

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Loading STAC catalog...
Using EOTDL API URL: http://localhost:8001/
New version created, version: 1


100%|██████████| 101/101 [00:02<00:00, 46.95it/s]

Ingesting STAC catalog...
Done





In [12]:
!rm -rf data/EuroSAT-RGB-small-stac/README.md

# Staging data

At this point, every dataset ingested in EOTDL is STAC compliant, and the hrefs to the assets are links to the EOTDL api.

At stage time user can choose to get only the metadata or the metadata and the assets.

In [16]:
!rm -rf data/output

In [17]:
from eotdl.datasets import download_dataset_prototype

path = download_dataset_prototype(dataset_name='EuroSAT-RGB-small-catalog-prototype', path="data/output", force=True)
path

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
To download assets, set assets=True or -a in the CLI.


'data/output/EuroSAT-RGB-small-catalog-prototype/v1'

In [18]:
path = download_dataset_prototype(dataset_name='EuroSAT-RGB-small-catalog-prototype', path="data/output", assets=True, force=True)
path

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/


100%|██████████| 101/101 [00:01<00:00, 77.34it/s]


'data/output/EuroSAT-RGB-small-catalog-prototype/v1'

We can also work on the metadata first (filtering, cleaning, etc) and then download the selected assets.