Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOSXFsEventsDiffAwarenessTest is flaky #10776

Open
philwo opened this issue Feb 13, 2020 · 3 comments
Open

MacOSXFsEventsDiffAwarenessTest is flaky #10776

philwo opened this issue Feb 13, 2020 · 3 comments
Assignees

Comments

@philwo
Copy link
Member

@philwo philwo commented Feb 13, 2020

//src/test/java/com/google/devtools/build/lib/skyframe:SkyframeTests has been failing on CI recently due to MacOSXFsEventsDiffAwarenessTest being flaky. The current theory is that the test fails when the machine is under high load.

When we looked at the test initially, it seemed quite obvious why it would be flaky - it did some changes to the filesystem, waited 200ms and then checked that FSEvents reported them correctly. It's conceivable that this might not be the case under load.

@jmmv kindly fixed the test and added logic to retry up to 60 seconds, however the test is still flaky (although apparently less than before):

Log: https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/391e674d-e4dc-4d2d-9b2d-e12d7e4082da/src/test/java/com/google/devtools/build/lib/skyframe/SkyframeTests/attempt_1.log
CI job: https://buildkite.com/bazel/bazel-bazel/builds/11455

This is bad, because it might indicate that our --watchfs implementation on macOS sometimes doesn't work correctly and Bazel might miss changed files during incremental rebuilds. 🤔

I'll disable the test for now, but we should look into this.

bazel-io pushed a commit that referenced this issue Feb 13, 2020
Test is still flaky: #10776

RELNOTES: None.
PiperOrigin-RevId: 294865777
@jmmv

This comment has been minimized.

Copy link
Contributor

@jmmv jmmv commented Feb 13, 2020

Did anything change at all on the CI machines? I find it quite strange that "heavy load" has become a problem only now and is triggering this problem so frequently when we never saw it in the past.

@philwo

This comment has been minimized.

Copy link
Member Author

@philwo philwo commented Feb 13, 2020

@jmmv It's very hard to say. We upgraded the production fleet to macOS Catalina on 2020-01-29 (but on testing CI already much earlier, I think since the beta), and the test worked fine until yesterday. I remember seeing the same failure occasionally during a presubmit of my CL in the testing org, but that was a completely unrelated change, then it went away a few days later and I was able to submit it.

Maybe it's a bug in macOS Catalina, but it's not clear what triggers it :|

@jmmv

This comment has been minimized.

Copy link
Contributor

@jmmv jmmv commented Feb 13, 2020

Alright, I think I know what happens. The test starts by doing:

View view1 = underTest.getCurrentView(watchFsEnabledProvider);

which in turn calls MacOSXFsEventsDiffAwareness#init to initialize the monitor. But the monitor is started in a separate thread. There is no guarantee that by the end of init, fsevents is already listening for changes, so if the thread is delayed, we can modify the client before fsevents starts to listen -- and thus we lose the initial events and we don't see anything ever changing.

Trivial to expose this by adding a sleep right before the call to MacOSXFsEventsDiffAwareness#run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.