Decentralized Data Lake Ideas #41

davidgasquez · 2023-08-08T20:15:30Z

Random thoughts around decentralized and permissionless data lakes.

An easy target is blockchain data.
Everything should be content adressed and inmutable! Easy to get with chain data. I should be able to query any CID without caring where it is.
Publish the CID of the something like a Delta Catalog JSON file on Ethereum. You can publish your fork or write contracts on top of it. Use any compute engine to run queries on top of that.
Collaborate on data TrueBlocks style, where more people usinig the service means better data reliability and speed. If there is a section missing, I can send somemthing like a PR to fill that data.

Reading "The Database I Wish I Had" and thinking about something like that for OLAP workloads. Feels like OLAP use cases might be the "killer database" for IPFS/Hypercore/Dat. For analysis, you want data to be inmutable, don't care that much about latency, and have to store large amount of data.

davidgasquez · 2023-10-25T14:14:21Z

Chatted with some folks working on Subsquid. They're doing interesting things on the decentralized data lake area.

This is more or less what I understood about how the Subsquid Archive works.

Right now, data is indexed by Subsquid itself (running substrate-ingest). In the future, anyone will be able to publish their arbitrary datasets.
Indexed data is packaged and into height partitioned Parquet files and sent to an orchestrator/router that distributes these across nodes in the subsquid network. This orchestrator takes into account dataset durability, response times, geolocation distribution, ...
Users send (and pay) queries to the Subsquid network (via a gateway or contract?), and the gateway will select the nodes to run these queries. Nodes will run the query (DuckDB on the nodes), and send back the results.

Subsquid Labs maintains public Archive endpoints and offers batch access via the Squid SDK free of charge.

Questions

How can you join across different heights/datasets that are in different machines? This will need a proper decentralized query engine (perhaps something where Substrait, Datafusion and Ballista can help!).
How do you guarantee fast response times?
Is there any mechanism in which commonly accessed data is more distributed?

davidgasquez · 2023-12-11T09:48:17Z

Adding a small note that Dagster is already relying on "hashes" to check when runs are needed! A step closer to fully content addresses workflows.

davidgasquez · 2024-01-23T17:45:56Z

You can ATTACH to a remote DuckDB database! There might be a world where a bunch of people publish their small/medium databases and people just attach to them.

davidgasquez added the question Further information is requested label Aug 8, 2023

davidgasquez self-assigned this Aug 8, 2023

davidgasquez transferred this issue from datonic/datadex Aug 5, 2024

datonic locked and limited conversation to collaborators Aug 5, 2024

davidgasquez converted this issue into discussion #42 Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Decentralized Data Lake Ideas #41

Decentralized Data Lake Ideas #41

davidgasquez commented Aug 8, 2023 •

edited

Loading

davidgasquez commented Oct 25, 2023 •

edited

Loading

davidgasquez commented Dec 11, 2023

davidgasquez commented Jan 23, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Decentralized Data Lake Ideas #41

Decentralized Data Lake Ideas #41

Comments

davidgasquez commented Aug 8, 2023 • edited Loading

davidgasquez commented Oct 25, 2023 • edited Loading

Questions

davidgasquez commented Dec 11, 2023

davidgasquez commented Jan 23, 2024

This issue was moved to a discussion.

davidgasquez commented Aug 8, 2023 •

edited

Loading

davidgasquez commented Oct 25, 2023 •

edited

Loading