Skip to content

Bacalhau project report 20221021

Kai Davenport edited this page Oct 21, 2022 · 1 revision

Lotus publisher 🌸

Excellent progress has been made by Will on the Lotus Publisher which means we are one step closer to Bacalhau enabling compute providers to create storage deals directly. We've managed to get a local testnet bacalhau node to publish a deal to the local testnet Lotus network and for that deal to reach StorageDealActive status. This is a huge step forward and means we are closer to onboarding more data to Filecoin using Lotus directly.

A few more useful Lotus related issues have been completed:

  • discover the miner along with some reasonably arbitrary selection #925
  • Update the Lotus publisher to use the API #912
  • updated the current test to retrieve data from Lotus

Airflow operator 👷

We've got two main tracks for DAG support in Bacalhau:

  • integrate with existing DAG systems
  • build a new DAG system within bacalhau

We are currently focusing on #1 and Enrico has been working on an Airflow operator for Bacalhau. To make this work - he has delved deep into the land of xcoms to understand how we can pass the output of one bacalhau job to the input of the next using Airflow.

This is incredibly exciting - if you currently have an Airflow pipeline where each stage could be Dockerized - now your entire pipeline could now run on Bacalhau and the results of each stage in your pipeline would be published to Filecoin!

Example Bacalhau Airflow Operator

Re-enable and add resiliance for Estuary publisher 🌳

Simon has done some great work at making our Estuary publisher more resilient and we now have the concept of a fallback publisher

We have also re-enabled the Estuary publisher as the default for Bacalhau jobs - this means the results of jobs will be published to Filecoin via Estuary!

WASM executor improvements and demo 🌅

Simon has also been making great progress on the WASM executor - throwing various workloads at it and really testing the limits of WASM workloads on bacalhau.

There is an awesome demo of a WASM job using a rust library to do seam carving on an image:

So here is a slightly less trivial WASM example: I created a stable diffusion image on bacalhau using the prompt "underwater landscape with cod" (thanks for I then fed the resulting IPFS CID into a WASM job that does a "content-aware shrink" i.e. shrinks the width of the image by 25% whilst trying to maintain the important subjects. Here are the results – on the left is the original output from stable diffusion, in the middle is the shrunk output, and the right is the seams in the original that it removed. Most advanced thing we've run on mainnet with wasm yet!

WASM image seam remover example

Stable diffusion examples running on GPUs 👀

Phil has been hard at work creating a GPU enabled version of stable diffusion (shoutout to Vedant who created the original example)

bacalhau docker run \
  --gpu 1 \
  ghcr.io/bacalhau-project/examples/stable-diffusion-gpu:0.0.1 -- \
  python main.py --o ./outputs --p "cod swimming through data"

This is an awesome demo of how bacalhau can consume GPUs to run generative AI workloads on bacalhau.

Stable diffusiuon running on GPUs

Speaking of GPUs 💻

When adding more GPUs to the network we came across a bug that prevented jobs being scheduled to a GPU node. We first thought this was a bug introduced by network latency because the second GPU we were adding was in Europe and the other machines were in the US.

To test this theory - Kai wrote some test tooling that:

  • let's us create a multinode noop stack on a single machine
  • simulate network latency between nodes with an artificial delay in the transport
  • configured a network with a combination of GPU nodes and CPU nodes

With these awesome test tools - we found that the bug was not related to network latency at all but instead was an edge case where bids were being "stuck" and nodes then not releasing their resources.

Shoutout to Walid, Phil and Kai for their work fixing this

Now we have some useful test tooling, the bug is fixed and we can start adding more GPUs to the network!

Various bug fixes and tasks 🔍

  • Examples CI: make current notebooks pass, remove time hogs to speed up execution
  • #917 Fixing a bug around size limits on our context upload tarballs
  • #923 Populate Estuary API keys when deploying from Terraform
  • #913 test for concurrent gpus
  • #894 Avoid reusing cancelled context in cleanup
  • #904 Add resilience to Estuary publishing
  • #910 Make Language jobs use Estuary publisher by default
  • #912 Update the Lotus publisher to use the API

What's next? ⏭️

  • resilience and performance tasks
  • more examples ready for Lisbon
  • Lotus publisher
  • WASM executor features (e.g. program arguments)
  • design doc for FIL+ integration
Clone this wiki locally