Bacalhau project report 20220701

Latest demo!

Demo shows Python/WASM backend, see below...

bacalhau-demo-july-2022-small.mp4

Stress testing, reliability & performance!

This week we finished the stress testing framework we started last week, were able to run it and address a load of reliability & performance issues. We are really happy with the results, take a look for yourself:

Scale & Reliability

At first when we started running the stress test against the code, it was constantly panicking, as you might expect for code that hasn't faced the fires of a high level of traffic yet.

We addressed all the panics, and started measuring with tiny jobs to see what sort of QPS & performance we could get!

Using noop executor to really stress the internals

We found measuring just docker run busybox /bin/true on its own was taking around around 3 seconds, so we adapted devstack to support a special "noop" execution mode, where the actual execution of a job was just a no-op, and found that this enabled us to really stress the libp2p transport and the scheduler.

And after all our fixes and hardening, this is the result!

We are able to run 10,000 jobs, at a concurrency level of 10, and have 100% of them succeed with each job taking on average 630ms to complete!

We are pretty happy with this initial level of performance and stability! Of course there's lots more work to do here, but it already significantly exceeds the goal:

Goal for phase: Extend the system to work 99% of the time. Submit 10,000 jobs and show that at most 100 of them fail. It might take several minutes to resolve each job.

We have done significant work on the Terraform setup to enable us to launch large scale testing clusters on Google Cloud. We're not quite at the goal for testing yet:

Goal for phase: Extend the system to work when 100 nodes and have access to 100TB of data in IPFS that jobs are operating on. Still, the error rate might be high.

But between the scale testing progress (above) and the Terraform support now, we are close to being able to run real world stress tests of the above scale.

We have also developed reliability mechanisms as described in the project plan:

max job cpu & memory usage
total system cpu & memory usage
introduce compute node config
use limits as part of job selection
control loop to bid on jobs previously seen where there was not enough capacity to run

This means that nodes will no longer take on more work than they can do, which had been resulting in nodes crashing.

Overall, huge progress on scale & reliability.

WASM Python demo!

We also got WASM working with python this week. The two commands that now work are:

bacalhau run python --deterministic -c "print(1+1)"

To run an inline command against the (soon-to-be) deterministic Python WASM backend, and:

bacalhau run python --deterministic main.py

to run against a file.

The latter is brilliantly simple for users to use, since they can just point Bacalhau at an existing python file (or files) they have, and under the hood it is quite clever. It sends the current working directory context to the requestor node over the REST API, the requestor node pins it to IPFS, the executor then downloads the context from IPFS and exposes it to the Python WASM executor, which wraps the docker executor with our new pyodide docker image (as an implementation detail of the python wasm executor) which then executes the Python code in the pyodide/wasm context inside the container, using the existing Docker functionality to attach the code to the container using IPFS volume mounts. Nice composition of executors!!

Other pieces

We've made great progress in several other areas this week too:

We now serve prometheus job-related metrics, useful for observability of the production network (Guy)
We generate client (CLI) keypair and use it to sign API requests, so that clients can have an identity in the network (so you'll be able to ask "show me my own jobs") - these API requests are now signed and verified (Guy)
We are starting a migration to CircleCI from GitHub Actions (Dave)
We've cleaned up our CI system so we no longer need to use an M1 Mac runner (Dave)
We've scaled up the production network:
- We’ve upgraded the production machines to 16 CPUs and 64GB RAM per node
- We’ve upgraded the disks to 500GB of boot disk (used for docker images and tmp files for jobs) and 500GB of IPFS disk

What's next

Video demo for Dave's keynote on Monday
Stress tests against terraform clusters on GCP
Make wasm actually deterministic
Raw wasm mode (non-python flavor)
Support requirements.txt in python wasm+docker flavors
--determinism=false for bacalhau run python with plain docker
(multi-file) Extend that system to work when jobs consist of many (thousands) of files, rather than a single file, and we want to distribute the work across the network and run it in parallel.
(scale-2) Extend the system to work when 1000 nodes are participating, over 1PB of data. Resolving jobs now may take a very long time (10s of minutes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly