Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[skymeld-dogfood] build unexpectedly exits causing following commands to wait for non-existing command to complete #19211

Closed
BalestraPatrick opened this issue Aug 9, 2023 · 12 comments
Labels
awaiting-user-response Awaiting a response from the author team-Performance Issues for Performance teams type: bug

Comments

@BalestraPatrick
Copy link
Member

Description of the bug:

We added common --experimental_merged_skyframe_analysis_execution to our bazelrc a few days ago. Shortly after that, developers started reporting issues regarding builds never completing locally. Upon inspection, builds were stuck in a state such as Another command (pid=76180) is running. Waiting for it to complete on the server (server_pid=27106)....

We've never seen this behavior before and given the multiple reports by developers, we reverted our change. Since then, we didn't receive any reports of builds getting stuck. From our BES, we can see that some of these developers experienced unexpectedly killed or stopped builds before the following builds don't start (the BES is truncated, so the builds show up as "Disconnected" in our BES service).

In the above example, there was no process with pid=76180 but there was a process for server_pid=27106. Running jstack against the server process reveals that it's stuck in some waiting state (not sure if that's expected, but hopefully it's helpful). The workaround for developers was to run kill -9 27106.

Let me know if I can somehow provide more logs or details.

cc: @joeleba

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately I don't have a reproducible example. We didn't see on CI during the short timeframe where this flag was enabled, but we thus turned it off. Our IDE integration has multiple output bases, so it's possible that something specific to that integration is making it hit this error.

Which operating system are you running Bazel on?

macOS 13.5

What is the output of bazel info release?

6.3.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

stacktrace2.txt

@joeleba
Copy link
Member

joeleba commented Aug 9, 2023

Thanks so much for trying out skymeld and filing the issue!

I noticed that you're using Bazel 6.3.1. In the last 3 months we've fixed some interrupt issues with Skymeld and with luck perhaps this issue is among those. Would you mind trying your builds again at HEAD and see if it's reproducible?

@BalestraPatrick
Copy link
Member Author

Thanks for the fast answer! There are some breakages at HEAD with rules_apple so we can't jump on it just yet. I will report back as soon as those get resolved and we're able to test again.

@iancha1992 iancha1992 added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label Aug 9, 2023
@meteorcloudy meteorcloudy added team-Performance Issues for Performance teams and removed team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. labels Aug 22, 2023
@oquenchil oquenchil added awaiting-user-response Awaiting a response from the author and removed untriaged labels Aug 29, 2023
@brentleyjones
Copy link
Contributor

@joeleba This is reproducing on 7.2.0rc1.

jstack.txt

@brentleyjones
Copy link
Contributor

I also sent SIGINT to the server a couple times before killing it, got logs like this:
SIGINT.1.log
SIGINT.2.log

@fmeum
Copy link
Collaborator

fmeum commented May 23, 2024

@bazel-io fork 7.2.0

@fmeum
Copy link
Collaborator

fmeum commented May 23, 2024

I marked this as a release blocker for now - Brentley's stack trace has this look like a regression caused by 52adf0b.

@brentleyjones
Copy link
Contributor

That exact stacktrace has appeared in multiple users SIGINT logs. And we only started to see the issues once upgrading to 7.2.0rc1.

@fmeum
Copy link
Collaborator

fmeum commented May 28, 2024

CC @joeleba

@joeleba joeleba self-assigned this May 28, 2024
@brentleyjones
Copy link
Contributor

Seems like I hijacked this issue, sorry @BalestraPatrick. Is your original issue resolved? If so, once my issue is we could close this.

@BalestraPatrick
Copy link
Member Author

We haven't seen this issue after bumping to Bazel 7.0 and having skymeld enabled by default, so it can be closed after that from my side.

@meteorcloudy
Copy link
Member

@brentleyjones Can you please file a new issue for the regression?

@brentleyjones
Copy link
Contributor

#22586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Performance Issues for Performance teams type: bug
Projects
None yet
Development

No branches or pull requests

9 participants