-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
I'm building a data distribution platform where datasets need persistent, resolvable identifiers — similar to DOI, but for datasets (potentially ARK). The goal is that given a dataset identifier, the system can resolve it to the best available access method for the consumer: an S3 bucket via multistore, an IPFS/libp2p CID, a BitTorrent magnet link or torrent, or even a physical location for offline access (think requesting a library book through interlibrary loan).
multistore's provider system (DynamoDB, PostgreSQL, HTTP, static file) already solves the "where is this bucket and how do I reach it" problem for S3-compatible backends. I'm interested in how the maintainers see these providers fitting into a broader dataset metadata and discovery layer.
Concrete example: STAC GeoParquet
Consider a STAC GeoParquet catalog. Today, a consumer needs to know:
- Which S3 bucket holds the GeoParquet file
- What endpoint/region to use
- What credentials (if any) are needed
With multistore, you can abstract away (2) and (3) via a virtual bucket. But the consumer still needs to discover that the dataset exists and that it's available through your proxy. There's no metadata layer that says: "Dataset X (version 2, covering Canada, 2024 vintage) is available at these locations via these protocols."
What I'm envisioning
A metadata registry that sits alongside (or extends) the existing BucketRegistry and maps dataset identifiers to access methods:
┌─────────────────────────────────────────────────────────────┐
│ Dataset Identifier │
│ (e.x., "ca_statcan_2021A000011124_d4c-datapkg-statistical_census_pop_census_subdivisions_2021_v0.1.0-beta") │
└─────────────────────┬───────────────────────────────────────┘
│ resolves to
▼
┌─────────────────────────────────────────────────────────────┐
│ Access Methods │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ S3 proxy │ │ IPFS / │ │BitTorrent│ │ Offline │ │
│ │(Multistore│ │ libp2p │ │ │ │ (request │ │
│ │ bucket) │ │ │ │ │ │ media) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
│ │
│ Metadata: format=geoparquet, size=2.3GB, region=ca, │
│ license=OGL-CA, updated=2024-03-01, checksum=sha256:... │
└─────────────────────────────────────────────────────────────┘
How this relates to existing providers
The current provider model maps a bucket name → BucketConfig with backend_type, backend_options, backend_prefix, etc. This is the right abstraction for S3 routing. My question is about the layer above it:
| Provider | Current role | Potential metadata role |
|---|---|---|
| Static file | Baked-in STAC-GeoParquet | Could include dataset metadata alongside bucket definitions, region, provider, etc |
| PostgreSQL | Same as above | Same as above |
| DynamoDB | Same as above | Same as above |
| HTTP | Fetch from another source | Could proxy to an external dataset catalog (e.g., a STAC API or custom registry) |
End goal
The end goal is a fully transparent, federated data distribution system where:
- Dataset identifiers are persistent and protocol-agnostic — like DOI, but resolving to concrete access methods rather than a landing page
- Multiple providers can register access methods for the same dataset — my multistore deployment offers S3 access, someone else pins it on IPFS, a university library offers it on physical media
- Consumers can choose the optimal access method based on their location, bandwidth, cost constraints, and protocol support
- The metadata is machine-readable — so tools like DuckDB, GDAL, or pandas can automatically discover and connect to the best available source
multistore already solves the "unified S3 interface to heterogeneous backends" problem. I'm asking whether the provider/registry layer could be extended to also solve the "which datasets exist and how can I reach them" problem — or whether that should be a separate system that uses Multistore as one of its access methods.
Questions
- Do you see the provider layer (DynamoDB/Postgres/HTTP/static) as purely an internal config store, or is there appetite for it to also serve as a dataset discovery/metadata layer?
- For the HTTP provider specifically — could it be pointed at a STAC API or similar catalog to dynamically resolve dataset identifiers to bucket configurations?
I'm happy to contribute to the discussion or a prototype. The STAC GeoParquet use case is a concrete starting point, but the pattern generalizes to any dataset format.
Resources for Inspiration
- https://www.youtube.com/watch?v=GZvJ0H89G0A
- https://youtu.be/0B8Z-KKkPWM?si=CaCbu4kKeY7mw3ZV
- https://github.com/TomNicholas/FROST
Other
I am currently working on defining a rarity/risk framework for dataset disappearance that takes into account multiple factors (ex. sovereignty, funding, file format, etc). While I do not enjoy duplicating data for the sake of duplicating, it may be necessary at times...but you can do it intelligently (i.e. you don't need to replicate an entire dataset but parts of it).