Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: bazel clean --expunge sometimes causes Bazel server to crash #3956

Closed
meteorcloudy opened this issue Oct 24, 2017 · 14 comments
Closed
Labels
P1 I'll work on this now. (Assignee required) platform: windows type: bug

Comments

@meteorcloudy
Copy link
Member

When I run bazel clean --expunge on Windows, I sometimes get

INFO: Starting clean (this may take a while). Consider using --expunge_async if the clean takes more than several minutes.

Server terminated abruptly (error code: 14, error message: '', log file: 'c:\tmp\_bazel_pcloudy\2a_d8ear/server/jvm.out')

Both errror_message and log file are empty.

Bazel version: HEAD

@meteorcloudy meteorcloudy added platform: windows P1 I'll work on this now. (Assignee required) type: bug labels Oct 24, 2017
@meteorcloudy
Copy link
Member Author

Possible culprit: 4869c4e

@meteorcloudy
Copy link
Member Author

git bisect confirmed my suspicion.

@meteorcloudy
Copy link
Member Author

@kush-c @lberki Could you take a look please?

@meteorcloudy
Copy link
Member Author

I can reproduce this when running bazel build src:bazel && bazel clean --expunge on Windows.

@BenTheElder
Copy link

I think we've actually been seeing this in an ubuntu based docker image kubernetes/test-infra#5137, and also now with bazel version too.

@meteorcloudy
Copy link
Member Author

@BenTheElder Which Bazel version are you using?

@BenTheElder
Copy link

BenTheElder commented Oct 25, 2017 via email

@BenTheElder
Copy link

BenTheElder commented Oct 28, 2017

Using --batch seems like it might be working as a mitigation at least in our case, I'm not sure yet but so far since switching that on we haven't seen a bazel failure. cc @ixdy
Edit: obviously this isn't ideal, but we only run a few commands per job in our environment so this may be acceptable for the moment. We're canary testing this.

@znull
Copy link
Contributor

znull commented Nov 6, 2017

I just started seeing this with bazel 0.8.0rc1 on linux (ubuntu). We hadn't seen it previously.

@meteorcloudy
Copy link
Member Author

Adding this issue to release blocker cause it's in 0.8.0rc1
@lberki @kush-c Could you take a look?

@kush-c
Copy link
Contributor

kush-c commented Nov 11, 2017

I wasn't able to reproduce it on my linux box after multiple runs of the loop for i in {1..10}; do bazel build src:bazel && bazel clean --expunge && bazel shutdown; done
If I can't reproduce on Monday as well, we'll probably have to consider a rollback again :(

@BenTheElder
Copy link

@kush-c we're fairly confident it's related to memory pressure since we only sometimes saw it in our CI but never locally with 0.7 (and the CI nodes are heavily loaded). My best guess is that it was triggered when the job landed on a node under extra memory pressure, though we did not see this with previous bazel versions as far as I know.

At a suggestion from @ixdy we tried flipping on --batch with 0.7 and have not seen any similar issues for about two weeks now since turning it on. This is in an ubuntu based docker image running on kubernetes for the kuberentes/test-infra CI.

@kush-c
Copy link
Contributor

kush-c commented Nov 13, 2017

That's very helpful @BenTheElder. So the issue was present with 0.7 but has just been worsened at HEAD? Commit 4869c4e certainly may be responsible, since it tries to kill the server when the system is low on memory, and perhaps we ran into a race condition when a bazel clean is issued when bazel is in the process of killing itself.

That also explains why I wasn't able reproduce this issue even at HEAD on my 48G linux box, any why --batch solves the problem since commit 4869c4e shouldn't affect --batch invocations.

@meteorcloudy Given this issue did also occur rarely with 0.7, and using --batch may be a reasonable workaround for workflows which wish to use bazel clean --expunge anyways, does this need to be a release blocker? Given that I don't have a local repro yet, if this is a release blocker, we'll probably need to rollback the commit and hence the important functionality of bazel shutting down itself, on a system running low on memory.

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Nov 14, 2017

Hmm.. I doubt they are the same issue on Linux and Windows.
Because on Windows, I can always repro after 3~5 runs in 0.8.0rc, but not in 0.7.0. So it does look like a regression to me.
Can you try to use one of the CI windows slaves to repo?

The shell script I used:

pcloudy@pcloudy0-w MSYS ~/workspace/my_tests/bazel
$ cat ./test.sh

for (( i=1; i<=10; i++))
do
 bazel clean --expunge && bazel build src:bazel
done

bazel-io pushed a commit that referenced this issue Nov 15, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Nov 17, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Nov 17, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Nov 17, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Nov 21, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Nov 22, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Dec 1, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Dec 4, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
dslomov pushed a commit that referenced this issue Dec 4, 2017
*** Reason for rollback ***

Causing Bazel server to crash when running bazel clean --expunge
#3956

*** Original change description ***

Delayed rollforward of commit 8fb311b.

This was rolled back due to Tensorflow breakage but the patch I exported to gerrit (https://bazel-review.googlesource.com/c/bazel/+/18590) passed Tensorflow (https://ci.bazel.io/job/bazel/job/presubmit/52/Downstream_projects/). Confirmed with jcater@ that the "newly failing" projects in the Global Tests are known issues. I think we can check this in now.

Additionally I had attempted to reproduce any tensorflow issues with this by building and testing Tensor...

***

ROLLBACK_OF=172361085

RELNOTES:None
PiperOrigin-RevId: 175821671
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) platform: windows type: bug
Projects
None yet
Development

No branches or pull requests

4 participants