You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Random thoughts around decentralized and permissionless data lakes.
An easy target is blockchain data.
Everything should be content adressed and inmutable! Easy to get with chain data. I should be able to query any CID without caring where it is.
Publish the CID of the something like a Delta Catalog JSON file on Ethereum. You can publish your fork or write contracts on top of it. Use any compute engine to run queries on top of that.
Collaborate on data TrueBlocks style, where more people usinig the service means better data reliability and speed. If there is a section missing, I can send somemthing like a PR to fill that data.
Reading "The Database I Wish I Had" and thinking about something like that for OLAP workloads. Feels like OLAP use cases might be the "killer database" for IPFS/Hypercore/Dat. For analysis, you want data to be inmutable, don't care that much about latency, and have to store large amount of data.
The text was updated successfully, but these errors were encountered:
Right now, data is indexed by Subsquid itself (running substrate-ingest). In the future, anyone will be able to publish their arbitrary datasets.
Indexed data is packaged and into height partitioned Parquet files and sent to an orchestrator/router that distributes these across nodes in the subsquid network. This orchestrator takes into account dataset durability, response times, geolocation distribution, ...
Users send (and pay) queries to the Subsquid network (via a gateway or contract?), and the gateway will select the nodes to run these queries. Nodes will run the query (DuckDB on the nodes), and send back the results.
Subsquid Labs maintains public Archive endpoints and offers batch access via the Squid SDK free of charge.
Questions
How can you join across different heights/datasets that are in different machines? This will need a proper decentralized query engine (perhaps something where Substrait, Datafusion and Ballista can help!).
How do you guarantee fast response times?
Is there any mechanism in which commonly accessed data is more distributed?
You can ATTACH to a remote DuckDB database! There might be a world where a bunch of people publish their small/medium databases and people just attach to them.
Random thoughts around decentralized and permissionless data lakes.
Also from datonic/datadex#22 (comment).
The text was updated successfully, but these errors were encountered: