Question: Dataset Metadata Registry & Multi-Protocol Routing

### Context

I'm building a data distribution platform where datasets need persistent, resolvable identifiers — similar to DOI, but for datasets (potentially [ARK](https://arks.org/)). The goal is that given a dataset identifier, the system can resolve it to the best available access method for the consumer: an S3 bucket via multistore, an IPFS/libp2p CID, a BitTorrent magnet link or torrent, or even a physical location for offline access (think requesting a library book through interlibrary loan).

multistore's provider system (DynamoDB, PostgreSQL, HTTP, static file) already solves the "where is this bucket and how do I reach it" problem for S3-compatible backends. I'm interested in how the maintainers see these providers fitting into a broader **dataset metadata and discovery** layer.

### Concrete example: STAC GeoParquet

Consider a [STAC GeoParquet](https://stac-geoparquet.org/) catalog. Today, a consumer needs to know:

1. Which S3 bucket holds the GeoParquet file
2. What endpoint/region to use
3. What credentials (if any) are needed

With multistore, you can abstract away (2) and (3) via a virtual bucket. But the consumer still needs to discover that the dataset *exists* and that it's available through your proxy. There's no metadata layer that says: "Dataset X (version 2, covering Canada, 2024 vintage) is available at these locations via these protocols."

### What I'm envisioning

A metadata registry that sits alongside (or extends) the existing [`BucketRegistry`](https://github.com/developmentseed/multistore/blob/main/crates/core/src/registry/bucket.rs) and maps **dataset identifiers** to **access methods**:

```
┌─────────────────────────────────────────────────────────────┐
│                    Dataset Identifier                        │
│            (e.x., "ca_statcan_2021A000011124_d4c-datapkg-statistical_census_pop_census_subdivisions_2021_v0.1.0-beta")             │
└─────────────────────┬───────────────────────────────────────┘
                      │ resolves to
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                   Access Methods                             │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐ │
│  │ S3 proxy  │  │  IPFS /  │  │BitTorrent│  │  Offline   │ │
│  │(Multistore│  │ libp2p   │  │          │  │  (request  │ │
│  │  bucket)  │  │          │  │          │  │   media)   │ │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘ │
│                                                             │
│  Metadata: format=geoparquet, size=2.3GB, region=ca,       │
│  license=OGL-CA, updated=2024-03-01, checksum=sha256:...   │
└─────────────────────────────────────────────────────────────┘
```

### How this relates to existing providers

The current provider model maps a **bucket name** → [`BucketConfig`](https://github.com/developmentseed/multistore/blob/main/crates/core/src/types.rs) with `backend_type`, `backend_options`, `backend_prefix`, etc. This is the right abstraction for S3 routing. My question is about the layer *above* it:

| Provider | Current role | Potential metadata role |
|---|---|---|
| **Static file** | Baked-in STAC-GeoParquet | Could include dataset metadata alongside bucket definitions, region, provider, etc |
| **PostgreSQL** | Same as above | Same as above |
| **DynamoDB** | Same as above | Same as above |
| **HTTP** | Fetch from another source | Could proxy to an external dataset catalog (e.g., a STAC API or custom registry) |

### End goal

The end goal is a **fully transparent, federated data distribution system** where:

1. **Dataset identifiers are persistent and protocol-agnostic** — like DOI, but resolving to concrete access methods rather than a landing page
2. **Multiple providers can register access methods** for the same dataset — my multistore deployment offers S3 access, someone else pins it on IPFS, a university library offers it on physical media
3. **Consumers can choose the optimal access method** based on their location, bandwidth, cost constraints, and protocol support
4. **The metadata is machine-readable** — so tools like DuckDB, GDAL, or pandas can automatically discover and connect to the best available source

multistore already solves the "unified S3 interface to heterogeneous backends" problem. I'm asking whether the provider/registry layer could be extended to also solve the "which datasets exist and how can I reach them" problem — or whether that should be a separate system that *uses* Multistore as one of its access methods.

### Questions

1. **Do you see the provider layer (DynamoDB/Postgres/HTTP/static) as purely an internal config store**, or is there appetite for it to also serve as a dataset discovery/metadata layer?
2. **For the HTTP provider specifically** — could it be pointed at a STAC API or similar catalog to dynamically resolve dataset identifiers to bucket configurations?

I'm happy to contribute to the discussion or a prototype. The STAC GeoParquet use case is a concrete starting point, but the pattern generalizes to any dataset format.

## Resources for Inspiration
- https://www.youtube.com/watch?v=GZvJ0H89G0A
- https://youtu.be/0B8Z-KKkPWM?si=CaCbu4kKeY7mw3ZV
- https://github.com/TomNicholas/FROST
## Other
I am currently [working on defining a rarity/risk framework for dataset disappearance](https://github.com/dataforcanada/d4c-infra-distribution/issues/35) that takes into account multiple factors (ex. sovereignty, funding, file format, etc). While I do not enjoy duplicating data for the sake of duplicating, it may be necessary at times...but you can do it intelligently (i.e. you don't need to replicate an entire dataset but parts of it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Dataset Metadata Registry & Multi-Protocol Routing #15

Context

Concrete example: STAC GeoParquet

What I'm envisioning

How this relates to existing providers

End goal

Questions

Resources for Inspiration

Other

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provider	Current role	Potential metadata role
Static file	Baked-in STAC-GeoParquet	Could include dataset metadata alongside bucket definitions, region, provider, etc
PostgreSQL	Same as above	Same as above
DynamoDB	Same as above	Same as above
HTTP	Fetch from another source	Could proxy to an external dataset catalog (e.g., a STAC API or custom registry)

Question: Dataset Metadata Registry & Multi-Protocol Routing #15

Description

Context

Concrete example: STAC GeoParquet

What I'm envisioning

How this relates to existing providers

End goal

Questions

Resources for Inspiration

Other

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions