-
Notifications
You must be signed in to change notification settings - Fork 562
Browser tests are flaky on Linux #3074
Comments
Okay, patches are landing, so I'm lowering this to P1. |
Here's at least one problem: during |
We've seen scenarios where DesktopBrowserFinder unit tests fail because sys.platform reports a wrong platform, presumably due to being stubbed out too aggressively by system_stub. This CL adds some temporary logging to make sure that the reason browsers sometimes can't be found on the Catapult CQ is due to this stubbing problem. BUG=catapult:#3074 NOTRY=true Review-Url: https://codereview.chromium.org/2567783002
To provide an update here, I added logging to confirm that this was the case and all of the failing tests on Linux reported that Currently have a dry run of a CL that deletes |
So, while deleting desktop_browser_finder_unittest cleans up flake, it does reveal that there seems to be a problem with using the reference browser on Linux in parallel. We get this stdout:
I wonder if that last line is significant. |
I'm going to try running the tests again on my machine but pass --disable-gpu to Chrome. It shouldn't matter anyway since we're running under xvfb. |
Okay, I have two concrete suggestions for unblocking patches from landing, since I don't know how much more time I have to work on this at present:
@nedn WDYT? |
For 2. @kenrussell do you know if it's true that we can run browser tests in parallel without "--disable-gpu"? |
Talked with @vmiura about this. We both agree that passing --disable-gpu when using xvfb on Linux is a reasonable workaround. While it's likely that Chrome would detect Mesa / Gallium / llvmpipe and a blacklist entry would disable GPU functionality, it's unfortunate that Mesa seems to be spinning up threads before we get a chance to detect that it's the OpenGL implementation in use. |
@nedn to confirm, instead of doing 1. it seems like you're actively looking at the cause of the system_stub flake now as of https://codereview.chromium.org/2564413002/, is that correct? We'll still need the patch that adds the --disable-gpu xvfb warning; I'll write that one. |
All the browser tests were flaky because we fail to restore the |sys| module in desktop_browser_finder. The reason were because the module desktop_browser_finder were overriden twice, hence in the second time its original module |desktop_browser_finder.sys| is already stubbed. The system_stub framework then tries to restore desktop_browser_finder.sys with this "original" sys module, hence failing to restore desktop_browser_finder to real original state. This CL remove the duplicate stubbing & add assertion in system_stub to avoid overriding a module twice. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2574453002
This CL works around issues with Mesa spinning up processes too early on Linux, which causes browser tests to fail with the current reference build. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2573563002
I lowered this to P2 because Ned's patch fixes the |
… (patchset #1 id:1 of https://codereview.chromium.org/2567783002/ ) Reason for revert: No longer needed; root cause diagnosed. Original issue's description: > [Telemetry] Add temporary logging to desktop_browser_finder > > We've seen scenarios where DesktopBrowserFinder unit tests fail because > sys.platform reports a wrong platform, presumably due to being stubbed out too > aggressively by system_stub. This CL adds some temporary logging to make sure > that the reason browsers sometimes can't be found on the Catapult CQ is due to > this stubbing problem. > > BUG=catapult:#3074 > NOTRY=true > > Review-Url: https://codereview.chromium.org/2567783002 > Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/707aaac64b3c0a86d253e1ab502e73996d63927e TBR=sullivan@chromium.org,nednguyen@google.com # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2574483002
Could you post some links to example failing jobs? |
Here are a couple where retries caused the whole build to time out: https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5939 |
Here's one from before my --disable-gpu CL that shows the error I pasted above: https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5894 |
Looks like the browser's intermittently deadlocking upon startup. This is bad news; some behavior like this was seen on macOS a few months ago, and Emily added retries to the SeriallyExecutedBrowserTestCase framework which seemed to suppress them. At the time they were investigated, it wasn't clear whether the hangs on macOS were caused by machine misconfigurations. If it were possible to get a minidump at the point where the browser hung, and thereby a stack trace, we could file a P1 issue against the appropriate Chrome sub-team. I'm not sure why one isn't being generated. How does Telemetry tear down the browser if it determines that it's hung? https://www.chromium.org/developers/crash-reports#TOC-On-Linux1 indicates that setting the environment variable CHROME_HEADLESS=1 is supposed to make it produce .dmp files, but that's being set in your logs and none is being generated. Maybe if it sent the browser a more cooperative signal like SIGTERM and waited ~10 seconds before sending it SIGKILL, it might be more likely to generate a minidump which could then be symbolized? |
Looks like we do send SIGTERM right now but the timeout is 5. Do you think that might not be long enough? |
Yeah, I'd suggest increasing it to 20 seconds and see if it changes the results. I'd also actually suggest introducing a hang in the browser process's UI thread and see if the harness catches it. |
CL: https://codereview.chromium.org/2568033003 @nedn Another question I have here is, why do we wait 240 seconds for the browser to finish launching? That seems like a crazy long amount of time. Any way we could halve that to 120s? I think that would allow many more Catapult CLs to pass the CQ. |
Answered my own question: this was actually done recently by https://codereview.chromium.org/2526853003. At that time, the flake was caused by the system_stub problem rather than this hang. If the 240s timeout means that jobs are much more likely to time out in the event of flake and there's only one test in a different repository that needs the timeout to be bumped, could we change the default back to 120 and allow this option to be configurable, so that environments that would like a longer timeout can have it? @achuith Would it make sense for you to make use of such an option if it were introduced? |
To shut down the browser on desktop, we first try to quit the browser normally, then send SIGTERM, then wait a while, then send SIGKILL if we still don't see that the browser is shut down. Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on SIGTERM, but we see no such file. kbr@ suggests that this may simply be because we're not waiting long enough. The best result of this CL would be that we start getting useful stack traces from these launch failures. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2568033003
We're still failing to get stack traces (or indeed observe a successful response to SIGTERM) even though the timeout has been bumped; I'll try to introduce a hang locally and see if I can get a stack trace out like @kenrussell has suggested. |
Alright, @ehanley324 has noticed that benchmark runtimes have ballooned on Linux as a result of my CL, and is reverting it. However, further investigation reveals that we apparently never manage to shut down the browser cooperatively on Linux, so that's definitely a red flag. |
…#2 id:20001 of https://codereview.chromium.org/2568033003/ ) Reason for revert: Testing to see if it caused the issue on Linux on the waterfall: https://bugs.chromium.org/p/chromium/issues/detail?id=674146 Original issue's description: > [Telemetry] Increase timeout after sending SIGTERM > > To shut down the browser on desktop, we first try to quit the browser normally, > then send SIGTERM, then wait a while, then send SIGKILL if we still don't see > that the browser is shut down. > > Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on > SIGTERM, but we see no such file. > > kbr@ suggests that this may simply be because we're not waiting long enough. > The best result of this CL would be that we start getting useful stack traces > from these launch failures. > > BUG=catapult:#3074 > > Review-Url: https://codereview.chromium.org/2568033003 > Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/e6e0862c81652393de2dd878322e8c0c1e43d428 TBR=kbr@chromium.org,nednguyen@google.com,eakuefner@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2570413002
Sorry about that. That's discouraging. It'd be useful to know under what circumstances a minidump can be gathered on Linux. Right now it sounds like it's only happening when the browser actually crashes. If you can force a hang locally and experiment with sending various signals that'd be useful. |
Right, I think that before we can figure out when we can get minidumps we have to resolve the issue on linux where we don't shut down the browser correctly. This may solve both of our problems actually because maybe we aren't giving the browser a chance to produce this minidump. |
…ards to 2 (linux) It's a known problem that launching browsers in parallel on Linux will cause the test to hang due to deadlock ( catapult-project/catapult#3074 (comment)), and since each hang test takes 1 min timeout, so we disable telemetry parallelization & shards it on 2 machines to reduce the overall wait time. BUG=676742 Review-Url: https://codereview.chromium.org/2653163002 Cr-Commit-Position: refs/heads/master@{#445998}
Archiving. |
Not sure if these are outright failures or flake since some of these tests were previously flaky. At the moment it seems to be blocking Telemetry CLs from landing.
If this represents outright failure we should consider reverting the ref build roll; @benshayden will keep continuing to try to land https://codereview.chromium.org/2556223002 so we can figure out if it's still possible to even land Telemetry changes.
I'm going to spend a little time looking into whether I can get browser_finder being flaky to repro locally.
@Apeliotes @nedn @anniesullie
The text was updated successfully, but these errors were encountered: