Skip to content

Commit 8b3746a

Browse files
authored
Add "Remote Persistent Workers" (#219)
* Add "Remote Persistent Workers" * Add coeuvre
1 parent e3202d5 commit 8b3746a

File tree

2 files changed

+123
-0
lines changed

2 files changed

+123
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ list.
2020

2121
| Last updated | Title | Author(s) alias | Category |
2222
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | --------------------- |
23+
| 2021-03-06 | [Remote Persistent Workers](designs/2021-03-06-remote-persistent-workers.md) | [@ulfjack](https://github.com/ulfjack) | Remote Execution |
2324
| 2021-02-18 | [Toolchainifying proto rules](https://docs.google.com/document/d/1go3UMwm2nM8JHhI3MZrtyoFEy3BYg5omGqQmOwcjmNE/edit?usp=sharing) | [@Yannic](https://github.com/Yannic) | Protobuf |
2425
| 2021-02-10 | [Remote Output Service: place bazel-out/ on a FUSE file system](designs/2021-02-09-remote-output-service.md) | [@EdSchouten](https://github.com/EdSchouten) | Remote Execution |
2526
| 2020-10-19 | [Bazel External Dependencies Overhaul](https://docs.google.com/document/d/1moQfNcEIttsk6vYanNKIy3ZuK53hQUFq1b1r0rmsYVg/edit) | [@meteorcloudy](https://github.com/meteorcloudy), [@philwo](https://github.com/philwo), [@Wyverald](https://github.com/Wyverald) | External Repositories |
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
created: 2021-03-06
3+
last updated: 2021-03-06
4+
status: To be reviewed
5+
reviewers:
6+
- bergsieker
7+
- coeuvre
8+
- EricBurnett
9+
- larsrc-google
10+
title: "Remote Persistent Workers"
11+
authors:
12+
- ulfjack
13+
---
14+
15+
# Abstract
16+
17+
This document proposes that Bazel passes information through the existing
18+
[remote execution protocol](https://github.com/bazelbuild/remote-apis) to a
19+
remote execution system such that that system can run actions using persistent
20+
workers. Remote execution systems that do not support this silently ignore the
21+
additional information and execute these actions in the normal way.
22+
23+
This document does not discuss how the remote execution system implements this
24+
feature, e.g., how to find matching persistent worker processes in a potentially
25+
large distributed system.
26+
27+
In our testing, we have achieved 2x improvements in build time for large
28+
remotely executed builds. We expect further speedups with improvement to the
29+
scheduling algorithm.
30+
31+
# Background
32+
33+
Bazel's
34+
[persistent workers](https://github.com/bazelbuild/bazel/blob/master/site/docs/persistent-workers.md)
35+
significantly improve local build times; the original
36+
[blog post](https://blog.bazel.build/2015/12/10/java-workers.html) indicates an
37+
impressive 4x improvement for Java builds. Unfortunately, workers are not
38+
available with remote execution, making it impossible for a remote execution
39+
system to achieve comparable build times at comparable CPU availability. If
40+
there are a significant number of worker-supported actions on the critical path,
41+
then this also applies to the end-to-end build time.
42+
43+
This document describes an way for Bazel to pass enough information through the
44+
existing [remote execution API](https://github.com/bazelbuild/remote-apis) to a
45+
remote execution system such that that system can run actions using persistent
46+
workers. This is backwards compatible with existing remote execution systems,
47+
which silently ignore the additional information and fall back to normal
48+
execution.
49+
50+
In order to use persistent workers, Bazel rules annotate specific actions as
51+
worker-compatible and also annotate a subset of action inputs as 'tool inputs';
52+
these are files that are required for the persistent worker process. Bazel then
53+
takes the action command line, removes some of the arguments (arguments starting
54+
with '@', '--flagfile' or '-flagfile' (with some additional exceptions), and
55+
appends an additional `--persistent_worker` argument.
56+
57+
Furthermore, Bazel computes a 'worker key' consisting of the names and hashes of
58+
the tool inputs and runfiles (these are implicitly considered tool inputs), the
59+
action's environment variables, the rewritten command line, action mnemonic, as
60+
well as a few internal flags. Actions with equal keys are routed to the same
61+
pool of persistent worker processes for execution. It then uses a simple
62+
protocol over stdin/stdout to send the persistent worker process the parameter
63+
files that were removed earlier, as well as some metadata about the inputs.
64+
65+
As of 2021-03-06, Bazel supports both a binary protobuf and a json protocol, and
66+
also has support for multiplex workers.
67+
68+
# Proposal
69+
70+
In order for a remote system to replicate the steps that Bazel performs, it
71+
requires the same inputs. The environment and command line are already provided
72+
to the remote system. What's missing are the tool inputs and the fact that the
73+
action supports workers.
74+
75+
The remote execution protocol already has a generic 'node properties' field that
76+
can be used to annotate action inputs as tool inputs. In addition, there is a
77+
generic platform definition that can indicate support for workers or multiplex
78+
workers.
79+
80+
In our prototype, we use a node property key of `bazel_tool_input` with an empty
81+
value to indicate that an input file is a tool input. As of 2021-03-06, our
82+
prototype only supports non-multiplex, binary protobuf persistent workers, so
83+
the service assumes that the presence of node properties indicates support for
84+
this.
85+
86+
In addition, our prototype implementation in Bazel sets `persistentWorkerKey`
87+
as a platform option, with the value being the computed worker key. This is used
88+
by the remote execution scheduler to route actions to a matching worker machine.
89+
90+
It would be trivial to extend the prototype to set a `persistentWorkerProtocol`
91+
platform option to indicate the protocol (json or protobuf) as well as a
92+
`persistentWorkerMultiplex` platform option to indicate support for multiplex
93+
workers.
94+
95+
There are two open issues:
96+
- Bazel supports a `--worker_extra_flag` flag which it uses to non-hermetically
97+
pass flags to persistent workers. These flags could be passed to the remote
98+
execution system as well.
99+
- The persistent worker protocol is not formally specified and is currently just
100+
'whatever Bazel implements'. This is not ideal since we'd like multiple
101+
implementations that are fully compatible.
102+
103+
# Alternatives considered
104+
105+
- We considered not using persistent workers in the remote execution system.
106+
However, the benefits are very significant.
107+
- We considered using dynamic execution. Dynamic execution automatically decides
108+
whether to execute actions locally or remotely, and can even do both in
109+
parallel. However, this has a number of shortcomings:
110+
- It requires the local machine to be compatible with the remote machines;
111+
e.g., it cannot be used if the local machine is a Mac and the remote
112+
execution system runs on Linux (or if it's x86_64 and ARM64). This can also
113+
an issue if the remote execution system runs actions inside Docker with a
114+
different OS version or set of tools.
115+
- It requires a beefy local machine, since the local machine is now again on
116+
the critical path compared to full remote execution.
117+
- It interacts badly with 'build without the bytes'.
118+
119+
# Backward-compatibility
120+
121+
Remote execution systems that do not support these node and platform properties
122+
can silently ignore them and execute these actions in the normal way.

0 commit comments

Comments
 (0)