Bacalhau project report 20220711

Goals for July

Looking ahead to the main goals for July, they are: implementing parallelism & sharding, and more performance/scale work. We will continue pushing the boundaries of our code with increasingly large scale testnets, and we are designing the main feature for this month:

Design for parallelism & sharding

One of the original design goals for the network was to be able to run "embarassingly parallel" workloads. That is, workloads which parallelize easily across a partitioning of the dataset. This is also known as "data parallel". By way of example, if you have 10,000,000 images and you want to resize them, you can do this slowly on one machine, or you can do it ~10x faster if you split the images into 1000 groups of 10,000. Of course, this depends on data locality to some extent, although even if you need to download the data to the machine before you can process the job, you benefit from parallelizing across 1000 nodes from the increased bandwidth available across those machines.

Kai and I have worked together on the design for this feature this week. The basic design sketch is as follows:

First we implement a glob pattern: in the style of Spark and Pachyderm this allows the user to specify how they want their job to be parallelized, if at all. For example, a glob pattern of / means "everything in this CID should run in one job", /* means "every top level file or directory can be split up into a separate job. It also allows users to filter which data gets processed, for example /2022/01/*.csv would create a job per csv file only for Jan 2022.
Then we implement parallelization: in particular, make it possible for a pattern whose glob results in N shards to be split across N sub-jobs. Compute nodes can bid on sub-jobs individually.
Then we implement batching: support /* on 10K files being batched by 1000, so you get 10 jobs instead of 10K jobs (for example). This will reduce the scheduling impact. Batch size would be a per job parameter also.

We will use IPLD throughout to make the system more sympathetic with IPFS and PL technology stack.

Design for scale testing

The July goals for scale testing are 1000 nodes with access to 1PB data. Getting 1000 nodes on Google Cloud is actually quite hard, as they won't support quota requests so high, so we plan to develop a "hybrid" of devstack across a smaller number of chunky nodes in order to actually test 1000 bacalhau processes on a libp2p network, each with their own IPFS server, but across, say 10 or 100 machines. We believe this will still give a representative test of network behavior. We also plan to distribute the network across regions to expose latency-induced issues/instability.

Work done this week

Finished migrating CI system to CircleCI - we are fighting some flakiness in CI still, which is an ongoing battle we will win!
Capacity manager - deals with CPU / Memory & Disk limits in an agnostic way
Compute node control loop - means we can bid on jobs after the first time we hear about them (e.g. when enough space frees up for a job previously heard about)
Implement disk space controls on jobs and their volumes (to make sure we don't run jobs that we don't have enough disk space for)
Log handling - return logs back to the client even in error cases to help with debugging failed user containers
WASM volume mounts - present volumes into the Python WASM runtime
Addressed more data races/concurrency issues discovered by the Go race detector
Implemented bacalhau version for reporting client & server versions

What's next?

Focus on July goals: parallelism/sharding & enhanced scale testing
Get CI reliably green 💪
Work with early users to get feedback & refine product
SAME project backend for Bacalhau for running notebooks on the network

Provide feedback

Saved searches