Skip to content
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.

Browser tests are flaky on Linux #3074

Closed
eakuefner opened this issue Dec 8, 2016 · 23 comments
Closed

Browser tests are flaky on Linux #3074

eakuefner opened this issue Dec 8, 2016 · 23 comments
Assignees

Comments

@eakuefner
Copy link
Contributor

Not sure if these are outright failures or flake since some of these tests were previously flaky. At the moment it seems to be blocking Telemetry CLs from landing.

If this represents outright failure we should consider reverting the ref build roll; @benshayden will keep continuing to try to land https://codereview.chromium.org/2556223002 so we can figure out if it's still possible to even land Telemetry changes.

I'm going to spend a little time looking into whether I can get browser_finder being flaky to repro locally.

@Apeliotes @nedn @anniesullie

@eakuefner eakuefner self-assigned this Dec 8, 2016
@eakuefner eakuefner added P0 and removed P1 labels Dec 8, 2016
@eakuefner
Copy link
Contributor Author

Okay, patches are landing, so I'm lowering this to P1.

@eakuefner eakuefner added P1 and removed P0 labels Dec 8, 2016
@eakuefner eakuefner changed the title All browser tests are flaky/failing All browser tests are flaky Dec 8, 2016
@eakuefner
Copy link
Contributor Author

Here's at least one problem: during LinuxFindTest, sys.platform seems to be returning win32. Best guess is that there's at least something funky going on with system_stub. We've wanted to kill system_stub for a while (see https://bugs.chromium.org/p/chromium/issues/detail?id=547237), so maybe this is an opportunity to do that finally.

chromium-infra-bot pushed a commit that referenced this issue Dec 9, 2016
We've seen scenarios where DesktopBrowserFinder unit tests fail because
sys.platform reports a wrong platform, presumably due to being stubbed out too
aggressively by system_stub. This CL adds some temporary logging to make sure
that the reason browsers sometimes can't be found on the Catapult CQ is due to
this stubbing problem.

BUG=catapult:#3074
NOTRY=true

Review-Url: https://codereview.chromium.org/2567783002
@eakuefner
Copy link
Contributor Author

To provide an update here, I added logging to confirm that this was the case and all of the failing tests on Linux reported that sys.platform was returning win32.

Currently have a dry run of a CL that deletes desktop_browser_finder_unittest going, to confirm my hypothesis that our usage of system_stub in these suites is flaky: https://codereview.chromium.org/2565003002

@eakuefner
Copy link
Contributor Author

eakuefner commented Dec 9, 2016

So, while deleting desktop_browser_finder_unittest cleans up flake, it does reveal that there seems to be a problem with using the reference browser on Linux in parallel. We get this stdout:

Xlib:  extension "RANDR" missing on display ":99".
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: EnableMemoryInfo
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: RemoteAccessClientFirewallTraversal
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
[9158:9158:1209/154601:ERROR:config_dir_policy_loader.cc(65)] Ignoring unknown platform policy: _comment
Xlib:  extension "RANDR" missing on display ":99".
[9224:9224:1209/154602:ERROR:sandbox_linux.cc(343)] InitializeSandbox() called with multiple threads in process gpu-process. 

I wonder if that last line is significant.

@eakuefner
Copy link
Contributor Author

I'm going to try running the tests again on my machine but pass --disable-gpu to Chrome. It shouldn't matter anyway since we're running under xvfb.

@eakuefner
Copy link
Contributor Author

Okay, --disable-gpu definitely reduces the flake: only two browser tests failed, and they were retried successfully.

I have two concrete suggestions for unblocking patches from landing, since I don't know how much more time I have to work on this at present:

  1. Make desktop_browser_finder unit tests only run on the platform they are targeting. This should solve the immediate problem of system_stub being overly aggressive on Linux.
  2. Pass --disable-gpu if we are running tests under Xvfb.

@nedn WDYT?

@nedn
Copy link
Contributor

nedn commented Dec 9, 2016

  1. sounds ok to me

For 2. @kenrussell do you know if it's true that we can run browser tests in parallel without "--disable-gpu"?

@kenrussell
Copy link
Member

Talked with @vmiura about this. We both agree that passing --disable-gpu when using xvfb on Linux is a reasonable workaround. While it's likely that Chrome would detect Mesa / Gallium / llvmpipe and a blacklist entry would disable GPU functionality, it's unfortunate that Mesa seems to be spinning up threads before we get a chance to detect that it's the OpenGL implementation in use.

@eakuefner
Copy link
Contributor Author

@nedn to confirm, instead of doing 1. it seems like you're actively looking at the cause of the system_stub flake now as of https://codereview.chromium.org/2564413002/, is that correct?

We'll still need the patch that adds the --disable-gpu xvfb warning; I'll write that one.

chromium-infra-bot pushed a commit that referenced this issue Dec 12, 2016
All the browser tests were flaky because we fail to restore the |sys| module in
desktop_browser_finder. The reason were because the module
desktop_browser_finder were overriden twice, hence in the second time
its original module |desktop_browser_finder.sys| is already stubbed. The
system_stub framework then tries to restore desktop_browser_finder.sys with this
"original" sys module, hence failing to restore desktop_browser_finder to real
original state.

This CL remove the duplicate stubbing & add assertion in system_stub to avoid overriding a module twice.

BUG=catapult:#3074

Review-Url: https://codereview.chromium.org/2574453002
chromium-infra-bot pushed a commit that referenced this issue Dec 12, 2016
This CL works around issues with Mesa spinning up processes too early on Linux,
which causes browser tests to fail with the current reference build.

BUG=catapult:#3074

Review-Url: https://codereview.chromium.org/2573563002
@eakuefner eakuefner changed the title All browser tests are flaky Browser tests are flaky on Linux Dec 12, 2016
@eakuefner eakuefner removed the P1 label Dec 12, 2016
@eakuefner
Copy link
Contributor Author

I lowered this to P2 because Ned's patch fixes the system_stub problem and my patch fixes the GPU error. However, we're still seeing flake on Linux, and now I can't determine the cause. Fortunately, tryjobs appear to no longer be timing out, but the runtime of telemetry_unittests on the Catapult CQ could probably be on the order of a few minutes instead of like, 20 if we addressed this flake issue.

chromium-infra-bot pushed a commit that referenced this issue Dec 12, 2016
… (patchset #1 id:1 of https://codereview.chromium.org/2567783002/ )

Reason for revert:
No longer needed; root cause diagnosed.

Original issue's description:
> [Telemetry] Add temporary logging to desktop_browser_finder
>
> We've seen scenarios where DesktopBrowserFinder unit tests fail because
> sys.platform reports a wrong platform, presumably due to being stubbed out too
> aggressively by system_stub. This CL adds some temporary logging to make sure
> that the reason browsers sometimes can't be found on the Catapult CQ is due to
> this stubbing problem.
>
> BUG=catapult:#3074
> NOTRY=true
>
> Review-Url: https://codereview.chromium.org/2567783002
> Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/707aaac64b3c0a86d253e1ab502e73996d63927e

TBR=sullivan@chromium.org,nednguyen@google.com
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG=catapult:#3074

Review-Url: https://codereview.chromium.org/2574483002
@kenrussell
Copy link
Member

Could you post some links to example failing jobs?

@eakuefner
Copy link
Contributor Author

@eakuefner
Copy link
Contributor Author

Here's one from before my --disable-gpu CL that shows the error I pasted above: https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Linux%20Tryserver/builds/5894

@kenrussell
Copy link
Member

Looks like the browser's intermittently deadlocking upon startup. This is bad news; some behavior like this was seen on macOS a few months ago, and Emily added retries to the SeriallyExecutedBrowserTestCase framework which seemed to suppress them.

At the time they were investigated, it wasn't clear whether the hangs on macOS were caused by machine misconfigurations.

If it were possible to get a minidump at the point where the browser hung, and thereby a stack trace, we could file a P1 issue against the appropriate Chrome sub-team.

I'm not sure why one isn't being generated. How does Telemetry tear down the browser if it determines that it's hung? https://www.chromium.org/developers/crash-reports#TOC-On-Linux1 indicates that setting the environment variable CHROME_HEADLESS=1 is supposed to make it produce .dmp files, but that's being set in your logs and none is being generated. Maybe if it sent the browser a more cooperative signal like SIGTERM and waited ~10 seconds before sending it SIGKILL, it might be more likely to generate a minidump which could then be symbolized?

@eakuefner
Copy link
Contributor Author

Looks like we do send SIGTERM right now but the timeout is 5. Do you think that might not be long enough?

https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/backends/chrome/desktop_browser_backend.py#L619

@kenrussell
Copy link
Member

Yeah, I'd suggest increasing it to 20 seconds and see if it changes the results. I'd also actually suggest introducing a hang in the browser process's UI thread and see if the harness catches it.

@eakuefner
Copy link
Contributor Author

CL: https://codereview.chromium.org/2568033003

@nedn Another question I have here is, why do we wait 240 seconds for the browser to finish launching? That seems like a crazy long amount of time. Any way we could halve that to 120s? I think that would allow many more Catapult CLs to pass the CQ.

@eakuefner
Copy link
Contributor Author

Answered my own question: this was actually done recently by https://codereview.chromium.org/2526853003.

At that time, the flake was caused by the system_stub problem rather than this hang. If the 240s timeout means that jobs are much more likely to time out in the event of flake and there's only one test in a different repository that needs the timeout to be bumped, could we change the default back to 120 and allow this option to be configurable, so that environments that would like a longer timeout can have it?

@achuith Would it make sense for you to make use of such an option if it were introduced?

chromium-infra-bot pushed a commit that referenced this issue Dec 13, 2016
To shut down the browser on desktop, we first try to quit the browser normally,
then send SIGTERM, then wait a while, then send SIGKILL if we still don't see
that the browser is shut down.

Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on
SIGTERM, but we see no such file.

kbr@ suggests that this may simply be because we're not waiting long enough.
The best result of this CL would be that we start getting useful stack traces
from these launch failures.

BUG=catapult:#3074

Review-Url: https://codereview.chromium.org/2568033003
@eakuefner
Copy link
Contributor Author

We're still failing to get stack traces (or indeed observe a successful response to SIGTERM) even though the timeout has been bumped; I'll try to introduce a hang locally and see if I can get a stack trace out like @kenrussell has suggested.

@eakuefner
Copy link
Contributor Author

Alright, @ehanley324 has noticed that benchmark runtimes have ballooned on Linux as a result of my CL, and is reverting it. However, further investigation reveals that we apparently never manage to shut down the browser cooperatively on Linux, so that's definitely a red flag.

chromium-infra-bot pushed a commit that referenced this issue Dec 14, 2016
…#2 id:20001 of https://codereview.chromium.org/2568033003/ )

Reason for revert:
Testing to see if it caused the issue on Linux on the waterfall: https://bugs.chromium.org/p/chromium/issues/detail?id=674146

Original issue's description:
> [Telemetry] Increase timeout after sending SIGTERM
>
> To shut down the browser on desktop, we first try to quit the browser normally,
> then send SIGTERM, then wait a while, then send SIGKILL if we still don't see
> that the browser is shut down.
>
> Apparently, with CHROME_HEADLESS=1, we should be getting a .dmp file on
> SIGTERM, but we see no such file.
>
> kbr@ suggests that this may simply be because we're not waiting long enough.
> The best result of this CL would be that we start getting useful stack traces
> from these launch failures.
>
> BUG=catapult:#3074
>
> Review-Url: https://codereview.chromium.org/2568033003
> Committed: https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/e6e0862c81652393de2dd878322e8c0c1e43d428

TBR=kbr@chromium.org,nednguyen@google.com,eakuefner@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG=catapult:#3074

Review-Url: https://codereview.chromium.org/2570413002
@kenrussell
Copy link
Member

Sorry about that. That's discouraging.

It'd be useful to know under what circumstances a minidump can be gathered on Linux. Right now it sounds like it's only happening when the browser actually crashes. If you can force a hang locally and experiment with sending various signals that'd be useful.

@ehanley324
Copy link
Contributor

Right, I think that before we can figure out when we can get minidumps we have to resolve the issue on linux where we don't shut down the browser correctly. This may solve both of our problems actually because maybe we aren't giving the browser a chance to produce this minidump.

vade pushed a commit to vade/chromium that referenced this issue Jan 25, 2017
…ards to 2 (linux)

It's a known problem that launching browsers in parallel on Linux will cause the
test to hang due to deadlock (
catapult-project/catapult#3074 (comment)),
and since each hang test takes 1 min timeout, so we disable telemetry
parallelization & shards it on 2 machines to reduce the overall wait time.

BUG=676742

Review-Url: https://codereview.chromium.org/2653163002
Cr-Commit-Position: refs/heads/master@{#445998}
@eakuefner
Copy link
Contributor Author

Archiving.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants