Skip to content

Bacalhau project report 20220128

lukemarsden edited this page Jan 29, 2022 · 13 revisions

Kai and I have made significant progress this week on understanding the landscape (Filecoin & IPFS), have developed a more complete vision for Bacalhau, and updated our prototype to do execution and a first pass on what will drive our reputation system (see below).

The design has been updated to include:

  • FVM based scheduler - implementing the job request & scheduler logic on the chain in FVM will make it easier to issue a coin, as well as providing persistence for job metadata (which our libp2p based prototype doesn't have). In advance of FVM being ready, we will prototype it in EVM.
  • Trace based reputation layer - we did a deep dive into how to establish confidence/trust in off-chain computation in an untrusted network. Several ideas were considered and rejected: zk-snarks, a checking function, ransomware approach, homomorphic encryption (we can explain why on request). We settled on a reputation layer. We did a deep dive onto the challenges around determinism. Even with WASM, entropy is a killer, we can't depend on determinism (we can explain more on request). Then we had a breakthrough idea: rather than hashing the output of the job to compare in a reputation system, we fingerprint the execution trace of the program. In particular, we need a method for inspecting a running VM or container or wasm process to determine a fingerprint that would be extremely hard to synthesize without actually running it. It's like using a variant of the halting problem to our advantage. The initial execution trace we'll use is looking at CPU and memory usage over time, however many other trace variables can be used, such as the entropy profile of the output, to increase our confidence in the result. We could call this approach "evidence of work" rather than "proof of work" :)
  • Storage - we will read from IPFS initially, and eventually implement the desired behavior of "read from IPFS if available, otherwise submit a retrieval deal to download the data from Filecoin". Of course, when Bacalhau is running on the same node as a Filecoin miner who has an unsealed copy of the file, it will want to preferantially fetch the file from the local cache - this is a later optimization but key to achieving the goal of "compute over data".

Overall, this system feels like it will fit together and work, and we are excited to have a more complete plan.

We developed the existing libp2p based prototype to execute jobs in firecracker and capture traces of the CPU and memory of them. You can imagine here how we can apply signal processing techniques to calculate the similarity between the traces as "fingerprints" to drive the reputation system.

Demo! https://user-images.githubusercontent.com/264658/151596920-79207ea6-56d2-428e-9691-e434e2a4adc0.mp4

For example, here are three traces of the following job unzip <5 million row csv file>; sleep 10; sed s/Office Supplies/Booze/ -i data.csv (10 times with 2 second sleeps in between):

boom1-2622846 boom1-2635569 boom1-2647782

Note the memory increases while the unzip command is happening - and the CPU spiking happens in a way that would be hard to predict from static analysis (i.e. without running the command).

You can see how these can be compared (using encryption between storage nodes and clients so that storage nodes can't just copy eachothers' traces until client job validation is complete) in order to implement a bidirectional reputation system.

Clone this wiki locally