In [1]:
%load_ext autoreload
%autoreload 2

import os
os.environ["EOTDL_API_URL"] = "http://localhost:8001/"


# Ingesting a dataset from a folder without a catalog.json

If user wants to ingest dataset form folder without STAC metdata, first we read all files in the folder recursively and create a parquet file.

In [2]:
from glob import glob

path = 'data/EuroSAT-RGB-small'

# retrieve all files in the folder recursively
files = glob(path + '/**/*', recursive=True)

len(files), files[:3]

(113,
 ['data/EuroSAT-RGB-small/catalog.parquet',
  'data/EuroSAT-RGB-small/PermanentCrop',
  'data/EuroSAT-RGB-small/README.md'])

In order to ingest any dataset to EOTDL, we require a REDME.md file with some mandatory metadata.

In [3]:
# create README.md

text = """---
name: EuroSAT-RGB-small-prototype
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-prototype

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [7]:
from eotdl.datasets import ingest_dataset

ingest_dataset(path)

{'name': 'EuroSAT-RGB-small-prototype', 'authors': ['Juan B. Pedro'], 'license': 'free', 'source': 'https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb', 'description': '# EuroSAT-RGB-small-prototype\n\nThis is a prototype of the EuroSAT dataset.'}
authors=['Juan B. Pedro'] license='free' source='https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb' name='EuroSAT-RGB-small-prototype' thumbnail=''


AttributeError: 'Metadata' object has no attribute 'description'

In [24]:
import geopandas as gpd

gpd.read_parquet(path + "/catalog.parquet")

Unnamed: 0,path,geometry
0,data/EuroSAT-RGB-small/catalog.parquet,
1,data/EuroSAT-RGB-small/PermanentCrop,
2,data/EuroSAT-RGB-small/README.md,
3,data/EuroSAT-RGB-small/Forest,
4,data/EuroSAT-RGB-small/Residential,
...,...,...
108,data/EuroSAT-RGB-small/Industrial/Industrial_2...,
109,data/EuroSAT-RGB-small/Industrial/Industrial_2...,
110,data/EuroSAT-RGB-small/Industrial/Industrial_4...,
111,data/EuroSAT-RGB-small/Industrial/Industrial_1...,


In [25]:
import pyarrow.parquet as pq
from stac_geoparquet.arrow._api import stac_table_to_items
import pystac
import json

# Path to your GeoParquet file (which should have been created from STAC items)
parquet_path = path + "/catalog.parquet"

# Read the Parquet file into an Arrow table
table = pq.read_table(parquet_path)

# Convert the Arrow table to a list of STAC item dictionaries
stac_items = []
for item_dict in stac_table_to_items(table):
	print(item_dict)
    # # The 'assets' field is typically stored as a JSON string.
    # if 'assets' in item_dict:
    #     item_dict['assets'] = json.loads(item_dict['assets'])
    # # Convert the dictionary to a PySTAC Item
    # stac_item = pystac.Item.from_dict(item_dict)
    # stac_items.append(stac_item)

print(f"Converted {len(stac_items)} STAC items from the Parquet file.")

ValueError: Expected 4 or 6 fields in bbox struct.

In [22]:
import pandas as pd

pd.read_parquet(path + "/catalog.parquet")

Unnamed: 0,path
0,data/EuroSAT-RGB-small/catalog.parquet
1,data/EuroSAT-RGB-small/PermanentCrop
2,data/EuroSAT-RGB-small/README.md
3,data/EuroSAT-RGB-small/Forest
4,data/EuroSAT-RGB-small/Residential
...,...
108,data/EuroSAT-RGB-small/Industrial/Industrial_2...
109,data/EuroSAT-RGB-small/Industrial/Industrial_2...
110,data/EuroSAT-RGB-small/Industrial/Industrial_4...
111,data/EuroSAT-RGB-small/Industrial/Industrial_1...


will get all files in the folder recursively, create a simple catalog.json and ingest it into EOTDL.

In [7]:
!rm -rf data/EuroSAT-RGB-small/catalog.json
!rm -rf data/EuroSAT-RGB-small/collection

# Example 2 - ingesting a dataset from a list of links

We can ingest a new dataset from a list of links (huggingface, s3, etc).


In [9]:
links = [
	'https://link1.com',
	'https://link2.com',
	'https://link3.com',
]

metadata = {
	'name': 'Test-links',
	'authors': ['Juan B. Pedro'],
	'license': 'free',
	'source': 'https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb',
	'description': """# Test links

Testing the ingestion of a dataset from a list of links.
"""
}

path = 'data/test-links'

ingest_dataset_prototype(path, metadata, links, replicate=False)

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Loading STAC catalog...
Using EOTDL API URL: http://localhost:8001/
New version created, version: 1


100%|██████████| 3/3 [00:00<00:00, 7653.84it/s]

Ingesting STAC catalog...
Done





will create a simple catalog.json with links as items and ingest it into EOTDL. We can choose if we want to replicate the assets in EOTDL or not (use direct sources).

In [10]:
!rm -rf data/test-links

# Example 3 - ingesting a dataset from a catalog


If STAC catalog already exists, we can ingest it into EOTDL. In this case, create README.md and place it in the root of the catalog.

In [48]:
!cp -r data/EuroSAT-RGB-small data/EuroSAT-RGB-small-stac

In [11]:
path = 'data/EuroSAT-RGB-small-stac'

files = os.listdir(path)
assert 'catalog.json' in files, "catalog.json not found"

!cat data/EuroSAT-RGB-small-stac/catalog.json

{
  "type": "Catalog",
  "id": "EuroSAT-RGB-small-prototype",
  "stac_version": "1.0.0",
  "description": "STAC catalog",
  "links": [
    {
      "rel": "root",
      "href": "./catalog.json",
      "type": "application/json",
      "title": "EuroSAT-RGB-small-prototype"
    },
    {
      "rel": "child",
      "href": "./collection/collection.json",
      "type": "application/json",
      "title": "collection"
    }
  ],
  "eotdl": {
    "name": "EuroSAT-RGB-small-catalog-prototype",
    "license": "free",
    "source": "https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb",
    "thumbnail": "",
    "authors": [
      "Juan B. Pedro"
    ],
    "description": "# EuroSAT-RGB-small-catalog-prototype\n\nThis is a prototype of the EuroSAT dataset."
  },
  "title": "EuroSAT-RGB-small-prototype"
}

In [14]:
# create README.md

text = """---
name: EuroSAT-RGB-small-catalog-prototype
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/workshops/philab24/02_prototype_ingesting.ipynb
---

# EuroSAT-RGB-small-catalog-prototype

This is a prototype of the EuroSAT dataset.
"""

with open(f"{path}/README.md", "w") as outfile:
    outfile.write(text)

In [15]:
from eotdl.datasets import ingest_dataset_prototype

ingest_dataset_prototype(path, replicate=False)

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Loading STAC catalog...
Using EOTDL API URL: http://localhost:8001/
New version created, version: 1


100%|██████████| 101/101 [00:02<00:00, 46.95it/s]

Ingesting STAC catalog...
Done





In [12]:
!rm -rf data/EuroSAT-RGB-small-stac/README.md

# Staging data

At this point, every dataset ingested in EOTDL is STAC compliant, and the hrefs to the assets are links to the EOTDL api.

At stage time user can choose to get only the metadata or the metadata and the assets.

In [16]:
!rm -rf data/output

In [17]:
from eotdl.datasets import download_dataset_prototype

path = download_dataset_prototype(dataset_name='EuroSAT-RGB-small-catalog-prototype', path="data/output", force=True)
path

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
To download assets, set assets=True or -a in the CLI.


'data/output/EuroSAT-RGB-small-catalog-prototype/v1'

In [18]:
path = download_dataset_prototype(dataset_name='EuroSAT-RGB-small-catalog-prototype', path="data/output", assets=True, force=True)
path

Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/
Using EOTDL API URL: http://localhost:8001/


100%|██████████| 101/101 [00:01<00:00, 77.34it/s]


'data/output/EuroSAT-RGB-small-catalog-prototype/v1'

We can also work on the metadata first (filtering, cleaning, etc) and then download the selected assets.