Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decentralized Data Lake Ideas #41

Closed
davidgasquez opened this issue Aug 8, 2023 · 3 comments
Closed

Decentralized Data Lake Ideas #41

davidgasquez opened this issue Aug 8, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@davidgasquez
Copy link
Member

davidgasquez commented Aug 8, 2023

Random thoughts around decentralized and permissionless data lakes.

  • An easy target is blockchain data.
  • Everything should be content adressed and inmutable! Easy to get with chain data. I should be able to query any CID without caring where it is.
  • Publish the CID of the something like a Delta Catalog JSON file on Ethereum. You can publish your fork or write contracts on top of it. Use any compute engine to run queries on top of that.
  • Collaborate on data TrueBlocks style, where more people usinig the service means better data reliability and speed. If there is a section missing, I can send somemthing like a PR to fill that data.

Also from datonic/datadex#22 (comment).

Reading "The Database I Wish I Had" and thinking about something like that for OLAP workloads. Feels like OLAP use cases might be the "killer database" for IPFS/Hypercore/Dat. For analysis, you want data to be inmutable, don't care that much about latency, and have to store large amount of data.

@davidgasquez davidgasquez added the question Further information is requested label Aug 8, 2023
@davidgasquez davidgasquez self-assigned this Aug 8, 2023
@davidgasquez
Copy link
Member Author

davidgasquez commented Oct 25, 2023

Chatted with some folks working on Subsquid. They're doing interesting things on the decentralized data lake area.

This is more or less what I understood about how the Subsquid Archive works.

  1. Right now, data is indexed by Subsquid itself (running substrate-ingest). In the future, anyone will be able to publish their arbitrary datasets.
  2. Indexed data is packaged and into height partitioned Parquet files and sent to an orchestrator/router that distributes these across nodes in the subsquid network. This orchestrator takes into account dataset durability, response times, geolocation distribution, ...
  3. Users send (and pay) queries to the Subsquid network (via a gateway or contract?), and the gateway will select the nodes to run these queries. Nodes will run the query (DuckDB on the nodes), and send back the results.

Subsquid Labs maintains public Archive endpoints and offers batch access via the Squid SDK free of charge.

Questions

  • How can you join across different heights/datasets that are in different machines? This will need a proper decentralized query engine (perhaps something where Substrait, Datafusion and Ballista can help!).
  • How do you guarantee fast response times?
  • Is there any mechanism in which commonly accessed data is more distributed?

@davidgasquez
Copy link
Member Author

Adding a small note that Dagster is already relying on "hashes" to check when runs are needed! A step closer to fully content addresses workflows.

@davidgasquez
Copy link
Member Author

You can ATTACH to a remote DuckDB database! There might be a world where a bunch of people publish their small/medium databases and people just attach to them.

@davidgasquez davidgasquez transferred this issue from datonic/datadex Aug 5, 2024
@datonic datonic locked and limited conversation to collaborators Aug 5, 2024
@davidgasquez davidgasquez converted this issue into discussion #42 Aug 5, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
Status: Done
Development

No branches or pull requests

1 participant