Browser tests are flaky on Linux #3074

eakuefner · 2016-12-08T20:10:15Z

Not sure if these are outright failures or flake since some of these tests were previously flaky. At the moment it seems to be blocking Telemetry CLs from landing.

If this represents outright failure we should consider reverting the ref build roll; @benshayden will keep continuing to try to land https://codereview.chromium.org/2556223002 so we can figure out if it's still possible to even land Telemetry changes.

I'm going to spend a little time looking into whether I can get browser_finder being flaky to repro locally.

@Apeliotes @nedn @anniesullie

eakuefner · 2016-12-08T23:58:24Z

Okay, patches are landing, so I'm lowering this to P1.

eakuefner · 2016-12-09T00:00:36Z

Here's at least one problem: during LinuxFindTest, sys.platform seems to be returning win32. Best guess is that there's at least something funky going on with system_stub. We've wanted to kill system_stub for a while (see https://bugs.chromium.org/p/chromium/issues/detail?id=547237), so maybe this is an opportunity to do that finally.

We've seen scenarios where DesktopBrowserFinder unit tests fail because sys.platform reports a wrong platform, presumably due to being stubbed out too aggressively by system_stub. This CL adds some temporary logging to make sure that the reason browsers sometimes can't be found on the Catapult CQ is due to this stubbing problem. BUG=catapult:#3074 NOTRY=true Review-Url: https://codereview.chromium.org/2567783002

eakuefner · 2016-12-09T21:01:01Z

To provide an update here, I added logging to confirm that this was the case and all of the failing tests on Linux reported that sys.platform was returning win32.

Currently have a dry run of a CL that deletes desktop_browser_finder_unittest going, to confirm my hypothesis that our usage of system_stub in these suites is flaky: https://codereview.chromium.org/2565003002

eakuefner · 2016-12-09T21:56:22Z

So, while deleting desktop_browser_finder_unittest cleans up flake, it does reveal that there seems to be a problem with using the reference browser on Linux in parallel. We get this stdout:

Xlib:  extension "RANDR" missing on display ":99".
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: EnableMemoryInfo
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: RemoteAccessClientFirewallTraversal
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
Xlib:  extension "RANDR" missing on display ":99".
[9224:9224:1209/154602:ERROR:sandbox_linux.cc(343)] InitializeSandbox() called with multiple threads in process gpu-process.

I wonder if that last line is significant.

eakuefner · 2016-12-09T22:02:42Z

I'm going to try running the tests again on my machine but pass --disable-gpu to Chrome. It shouldn't matter anyway since we're running under xvfb.

eakuefner · 2016-12-09T22:24:25Z

Okay, --disable-gpu definitely reduces the flake: only two browser tests failed, and they were retried successfully.

I have two concrete suggestions for unblocking patches from landing, since I don't know how much more time I have to work on this at present:

Make desktop_browser_finder unit tests only run on the platform they are targeting. This should solve the immediate problem of system_stub being overly aggressive on Linux.
Pass --disable-gpu if we are running tests under Xvfb.

@nedn WDYT?

nedn · 2016-12-09T23:01:06Z

sounds ok to me

For 2. @kenrussell do you know if it's true that we can run browser tests in parallel without "--disable-gpu"?

kenrussell · 2016-12-10T03:40:10Z

Talked with @vmiura about this. We both agree that passing --disable-gpu when using xvfb on Linux is a reasonable workaround. While it's likely that Chrome would detect Mesa / Gallium / llvmpipe and a blacklist entry would disable GPU functionality, it's unfortunate that Mesa seems to be spinning up threads before we get a chance to detect that it's the OpenGL implementation in use.

eakuefner · 2016-12-12T16:39:25Z

@nedn to confirm, instead of doing 1. it seems like you're actively looking at the cause of the system_stub flake now as of https://codereview.chromium.org/2564413002/, is that correct?

We'll still need the patch that adds the --disable-gpu xvfb warning; I'll write that one.

All the browser tests were flaky because we fail to restore the |sys| module in desktop_browser_finder. The reason were because the module desktop_browser_finder were overriden twice, hence in the second time its original module |desktop_browser_finder.sys| is already stubbed. The system_stub framework then tries to restore desktop_browser_finder.sys with this "original" sys module, hence failing to restore desktop_browser_finder to real original state. This CL remove the duplicate stubbing & add assertion in system_stub to avoid overriding a module twice. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2574453002

This CL works around issues with Mesa spinning up processes too early on Linux, which causes browser tests to fail with the current reference build. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2573563002

eakuefner · 2016-12-12T19:36:08Z

I lowered this to P2 because Ned's patch fixes the system_stub problem and my patch fixes the GPU error. However, we're still seeing flake on Linux, and now I can't determine the cause. Fortunately, tryjobs appear to no longer be timing out, but the runtime of telemetry_unittests on the Catapult CQ could probably be on the order of a few minutes instead of like, 20 if we addressed this flake issue.

… (patchset #1 id:1 of https://codereview.chromium.org/2567783002/ ) Reason for revert: No longer needed; root cause diagnosed. Original issue's description: > [Telemetry] Add temporary logging to desktop_browser_finder > > We've seen scenarios where DesktopBrowserFinder unit tests fail because > sys.platform reports a wrong platform, presumably due to being stubbed out too > aggressively by system_stub. This CL adds some temporary logging to make sure > that the reason browsers sometimes can't be found on the Catapult CQ is due to > this stubbing problem. > > BUG=catapult:#3074 > NOTRY=true > > Review-Url: https://codereview.chromium.org/2567783002 > Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/707aaac64b3c0a86d253e1ab502e73996d63927e TBR=sullivan@chromium.org,nednguyen@google.com # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2574483002

kenrussell · 2016-12-12T21:14:14Z

Could you post some links to example failing jobs?

eakuefner · 2016-12-12T21:28:20Z

Here are a couple where retries caused the whole build to time out:

https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5939
https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5938

eakuefner · 2016-12-12T21:29:10Z

Here's one from before my --disable-gpu CL that shows the error I pasted above: https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5894

kenrussell · 2016-12-12T21:45:00Z

Looks like the browser's intermittently deadlocking upon startup. This is bad news; some behavior like this was seen on macOS a few months ago, and Emily added retries to the SeriallyExecutedBrowserTestCase framework which seemed to suppress them.

At the time they were investigated, it wasn't clear whether the hangs on macOS were caused by machine misconfigurations.

If it were possible to get a minidump at the point where the browser hung, and thereby a stack trace, we could file a P1 issue against the appropriate Chrome sub-team.

I'm not sure why one isn't being generated. How does Telemetry tear down the browser if it determines that it's hung? https://www.chromium.org/developers/crash-reports#TOC-On-Linux1 indicates that setting the environment variable CHROME_HEADLESS=1 is supposed to make it produce .dmp files, but that's being set in your logs and none is being generated. Maybe if it sent the browser a more cooperative signal like SIGTERM and waited ~10 seconds before sending it SIGKILL, it might be more likely to generate a minidump which could then be symbolized?

eakuefner · 2016-12-12T21:51:48Z

Looks like we do send SIGTERM right now but the timeout is 5. Do you think that might not be long enough?

https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/backends/chrome/desktop_browser_backend.py#L619

kenrussell · 2016-12-12T21:58:36Z

Yeah, I'd suggest increasing it to 20 seconds and see if it changes the results. I'd also actually suggest introducing a hang in the browser process's UI thread and see if the harness catches it.

eakuefner · 2016-12-12T22:09:36Z

CL: https://codereview.chromium.org/2568033003

@nedn Another question I have here is, why do we wait 240 seconds for the browser to finish launching? That seems like a crazy long amount of time. Any way we could halve that to 120s? I think that would allow many more Catapult CLs to pass the CQ.

eakuefner · 2016-12-12T22:22:33Z

Answered my own question: this was actually done recently by https://codereview.chromium.org/2526853003.

At that time, the flake was caused by the system_stub problem rather than this hang. If the 240s timeout means that jobs are much more likely to time out in the event of flake and there's only one test in a different repository that needs the timeout to be bumped, could we change the default back to 120 and allow this option to be configurable, so that environments that would like a longer timeout can have it?

@achuith Would it make sense for you to make use of such an option if it were introduced?

To shut down the browser on desktop, we first try to quit the browser normally, then send SIGTERM, then wait a while, then send SIGKILL if we still don't see that the browser is shut down. Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on SIGTERM, but we see no such file. kbr@ suggests that this may simply be because we're not waiting long enough. The best result of this CL would be that we start getting useful stack traces from these launch failures. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2568033003

eakuefner · 2016-12-13T18:03:41Z

We're still failing to get stack traces (or indeed observe a successful response to SIGTERM) even though the timeout has been bumped; I'll try to introduce a hang locally and see if I can get a stack trace out like @kenrussell has suggested.

eakuefner · 2016-12-14T20:00:01Z

Alright, @ehanley324 has noticed that benchmark runtimes have ballooned on Linux as a result of my CL, and is reverting it. However, further investigation reveals that we apparently never manage to shut down the browser cooperatively on Linux, so that's definitely a red flag.

…#2 id:20001 of https://codereview.chromium.org/2568033003/ ) Reason for revert: Testing to see if it caused the issue on Linux on the waterfall: https://bugs.chromium.org/p/chromium/issues/detail?id=674146 Original issue's description: > [Telemetry] Increase timeout after sending SIGTERM > > To shut down the browser on desktop, we first try to quit the browser normally, > then send SIGTERM, then wait a while, then send SIGKILL if we still don't see > that the browser is shut down. > > Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on > SIGTERM, but we see no such file. > > kbr@ suggests that this may simply be because we're not waiting long enough. > The best result of this CL would be that we start getting useful stack traces > from these launch failures. > > BUG=catapult:#3074 > > Review-Url: https://codereview.chromium.org/2568033003 > Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/e6e0862c81652393de2dd878322e8c0c1e43d428 TBR=kbr@chromium.org,nednguyen@google.com,eakuefner@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=catapult:#3074 Review-Url: https://codereview.chromium.org/2570413002

kenrussell · 2016-12-14T22:00:28Z

Sorry about that. That's discouraging.

It'd be useful to know under what circumstances a minidump can be gathered on Linux. Right now it sounds like it's only happening when the browser actually crashes. If you can force a hang locally and experiment with sending various signals that'd be useful.

ehanley324 · 2016-12-15T14:51:04Z

Right, I think that before we can figure out when we can get minidumps we have to resolve the issue on linux where we don't shut down the browser correctly. This may solve both of our problems actually because maybe we aren't giving the browser a chance to produce this minidump.

…ards to 2 (linux) It's a known problem that launching browsers in parallel on Linux will cause the test to hang due to deadlock ( catapult-project/catapult#3074 (comment)), and since each hang test takes 1 min timeout, so we disable telemetry parallelization & shards it on 2 machines to reduce the overall wait time. BUG=676742 Review-Url: https://codereview.chromium.org/2653163002 Cr-Commit-Position: refs/heads/master@{#445998}

eakuefner · 2018-11-10T01:46:51Z

Archiving.

eakuefner added P1 Telemetry labels Dec 8, 2016

eakuefner self-assigned this Dec 8, 2016

eakuefner added P0 and removed P1 labels Dec 8, 2016

eakuefner added P1 and removed P0 labels Dec 8, 2016

eakuefner changed the title ~~All browser tests are flaky/failing~~ All browser tests are flaky Dec 8, 2016

eakuefner changed the title ~~All browser tests are flaky~~ Browser tests are flaky on Linux Dec 12, 2016

eakuefner removed the P1 label Dec 12, 2016

eakuefner closed this as completed Nov 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browser tests are flaky on Linux #3074

Browser tests are flaky on Linux #3074

eakuefner commented Dec 8, 2016

eakuefner commented Dec 8, 2016

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016 •

edited

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016

nedn commented Dec 9, 2016 •

edited

kenrussell commented Dec 10, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 13, 2016

eakuefner commented Dec 14, 2016

kenrussell commented Dec 14, 2016

ehanley324 commented Dec 15, 2016

eakuefner commented Nov 10, 2018

Browser tests are flaky on Linux #3074

Browser tests are flaky on Linux #3074

Comments

eakuefner commented Dec 8, 2016

eakuefner commented Dec 8, 2016

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016 • edited

eakuefner commented Dec 9, 2016

eakuefner commented Dec 9, 2016

nedn commented Dec 9, 2016 • edited

kenrussell commented Dec 10, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

kenrussell commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 12, 2016

eakuefner commented Dec 13, 2016

eakuefner commented Dec 14, 2016

kenrussell commented Dec 14, 2016

ehanley324 commented Dec 15, 2016

eakuefner commented Nov 10, 2018

eakuefner commented Dec 9, 2016 •

edited

nedn commented Dec 9, 2016 •

edited