Skip to content

Bacalhau project report 20220601

lukemarsden edited this page Jun 1, 2022 · 7 revisions

Launch day! 🚀

Big news for Bacalhau, the production network is now live. And, we are proud to say, on time as planned at the end of May :-)

There is now a 3-node cluster running on Google Cloud Platform with IPFS and Bacalhau running on each node.

You can run jobs against the network using this guide! Please give it a go.

Code & infra changes

The following changes enabled the production network to be deployed:

  • Disabling the network as a security precaution before going live. Jobs should only require dependencies that are baked into their Docker images and the input files mounted from IPFS in order to produce their output, therefore access to the network shouldn't be needed.

  • Various fixes to make bacalhau serve actually work and pick up the network interface to bind to from the right config parameter.

  • Persisting the libp2p keypair so that nodes can be restarted and not lose their identity on the network. This is necessary since we hardcode the bootstrap node identities in the multiaddresses for the bootstrap list.

  • A new ops folder in the repo containing Terraform code for deploying to GCP

The Terraform is structured in such a way that you can dial up and down the number of nodes, the size of them in terms of CPU and memory (instance type/machine type) and the size of the attached disks simply by editing the production.tfvars file and re-running terraform apply.

The VMs themselves have stateless boot disks. In particular, any state (such as the keypair, and soon IPFS data directory) will be stored on the external attached disks. Therefore, we can easily upgrade IPFS, Bacalhau, or change the parameters of the machines without worrying about losing the boot disk state.

Bootstrap list, multiaddresses and DNS, oh my!

Initially we tried just pointing the nodes to /dns4/bootstrap.production.bacalhau.org to peer with eachother but of course this doesn't include the peer IDs (the hash of their public key) so the nodes weren't able to join up to eachother.

Then we tried using dnsaddr like IPFS does:

luke@chunky:~$ dig +short txt _dnsaddr.bootstrap.production.bacalhau.org
"dnsaddr=/ip4/35.245.251.239/tcp/1235/p2p/QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3"
"dnsaddr=/ip4/35.245.115.191/tcp/1235/p2p/QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL"
"dnsaddr=/ip4/35.245.61.251/tcp/1235/p2p/QmXaXu9N5GNetatsvwnTfQqNtSeKAD6uCmarbh3LMRYAcF"

But libp2p complains when you give it a /dnsaddr/bootstrap.production.bacalhau.org because it doesn't have a /p2p segment, even though when you resolve the dnsaddr it gives you nodes which have the /p2p segment in them. If anyone knows how to fix this, please let us know. Failing that, we will go read the go-ipfs code to figure out how that does it.

So we just hardcoded the bootstrap list in the Bacalhau CLI for now. Turns out IPFS does the same thing anyway.

Deploying the bootstrap nodes was kinda funny, because we first had to deploy the nodes, let them generate their own keys, then update the code and release a new version which had all the peers' public keys and IP addresses in them, then upgrade the production cluster. It felt kinda circular, like a snake eating its own tail :-) but it worked!

Guy Paterson-Jones

Guy has just started working with us on the Bacalhau project, and has hit the ground running! He has already ported the JSON-RPC API to a REST API to make it easier to add OpenTelemetry support. We are looking forward to many more great contributions and value add from Guy. Welcome, Guy! :-)

Demo

Well, try the demo yourself ;-)

https://docs.bacalhau.org/getting-started/installation

Plan for next week

  • Monitoring the nodes and the service
  • CLI sort by job date and maybe filter list by job id
  • Persistence for ipfs on the production nodes
  • Make the nodes bigger (disks, CPU, mem) requires quota increase
  • Sketch for architecture diagram for Wes
  • Continue work on WASM/Python FaaS executor
Clone this wiki locally