New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"X actions, Y running" is misleading with the dynamic spawn scheduler #7345

Open
jmmv opened this Issue Feb 4, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@jmmv
Copy link
Contributor

jmmv commented Feb 4, 2019

[ At this point I'm not fully convinced my theory is correct so blaming the dynamic spawn scheduler might be wrong. ]

When using the dynamic spawn scheduler and e.g. --jobs=100, it's common to see Bazel reporting progress messages of the form 100 actions, 23 running -- but when using a pure remote build, the messages typically look like 100 actions running. Thus the question arises: "If there are 100 possible actions and running purely remotely shows that they can run in parallel, why isn't that the case with the dynamic spawn scheduler?"

I believe action scheduling is working fine and that this is a problem with UI reporting.

We use ActionStatusMessage to report, via the event bus, action status changes. In particular, we report "scheduling" and "running" messages: the former indicates the action is waiting for some kind of resource and the latter indicates that the action is truly running. The UI from ExperimentalStateTracker inspects running actions and tallies the ones reported as "running", then printing the totals in the offending message.

And hence the problem appears. In dynamic scheduling, we have two strategies running at once for each action. Each strategy reports its own view of scheduling/running states for the action... and therefore the statuses will run into each other. For example: the remote strategy may have reported that 100 actions are already running, but soon after the sandboxed strategy comes in and reports that those same 100 actions are waiting for local resources (thus scheduling)... and voila, 0 actions can be reported as running from the UI. See e.g. https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/sandbox/AbstractSandboxSpawnRunner.java;l=71 .

If this theory is correct, we should start by fixing the UI to understand that an action cannot regress from running to a past state. But this seems like a hack given that we are handling a case that "should never happen", right? Instead, and given we are moving towards a world were multiple strategies can be active at once, it'd be great if the UI could break down progress reporting by strategy.

CC @philwo @aehlig .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment