Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High priority: Investigate flakiness on dart2js-windows #28955

Open
mkustermann opened this issue Mar 2, 2017 · 18 comments
Open

High priority: Investigate flakiness on dart2js-windows #28955

mkustermann opened this issue Mar 2, 2017 · 18 comments
Assignees
Labels
area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. gardening web-libraries Issues impacting dart:html, etc., libraries

Comments

@mkustermann
Copy link
Member

There is a lot of flaky timesouts on all dart2js-windows and these do not seeem test-specific

The underlying problem might be an infrastructure issue in test.dart's browser controller in connection with IE10/IE11. Our buildbots don't surface the debug log information from test.dart so it's really hard to diagnose.

I think it should be a high priority to look into those flakes and fix the underlying issue.

@mkustermann mkustermann added area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. gardening labels Mar 2, 2017
@mkustermann mkustermann changed the title Investigate flakiness on dart2js-windows High priority: Investigate flakiness on dart2js-windows Mar 2, 2017
@mkustermann
Copy link
Member Author

This issue has been going on for a long time. We should consider moving all of these builders to FYI until they are fixed.

@whesse
Copy link
Contributor

whesse commented Mar 2, 2017

I'm rebooting these machines, because ie10 doesn't seem to be working well, even in a remote desktop session. IE 10 is old and unsupported, so we should switch to testing the new IE, edge, on Windows 10.

@whesse
Copy link
Contributor

whesse commented Mar 15, 2017

Looking at just one builder, the ie11 shard 2 of 4, and getting all of the timeouts in the 10 failing ie11tests in the last 100 runs, we found two tests that were repeatedly timing out, and more that only timed out once.

dart2js-ie11 release_x64 html/custom/constructor_calls_created_synchronously_test failed 3 times
FAILED: dart2js-ie11 release_x64 pkg/testing/test/hello_test failed 4 times

These each failed once:
FAILED: dart2js-ie11 release_x64 html/custom_elements_test/preregister
FAILED: dart2js-ie11 release_x64 html/js_typed_interop_test/avoid leaks on dart:core
FAILED: dart2js-ie11 release_x64 html/xsltprocessor_test/supported
FAILED: dart2js-ie11 release_x64 pkg/front_end/test/scanner_test
FAILED: dart2js-ie11 release_x64 pkg/front_end/test/src/async_dependency_walker_test

All failures are timeouts of tests that don't normally time out.

More investigation of the tests that are timing out on all windows browser tests can help us figure out what the problem is.

@whesse
Copy link
Contributor

whesse commented Mar 15, 2017

@sigmundch
Copy link
Member

/cc @stereotype441
I'd skip for now the pkg/front_end/* tests - I don't think there is much value in running these tests in ie11 at the moment. We do want to test that we can run the frontend in chrome (the main use case is to one day run ddc as part of the chrome-debugger), but I'm OK skipping all other browsers for now.

@peter-ahe-google
Copy link
Contributor

I want to be sure that there's no confusion here: running pkg/front_end/* tests on the Dart VM is absolutely critical, but yes, we can skip them on browsers if that's helpful. However, it might also be valuable to check these tests for asynchronous pitfalls.

@whesse whesse added web-dart2js web-libraries Issues impacting dart:html, etc., libraries labels Mar 16, 2017
@whesse
Copy link
Contributor

whesse commented Mar 16, 2017

The current work that needs to be done first, I think, is:
Using a script or a shell script, get the logs with the flaky timeouts from the dart2js-windows columns in the buildbot. Each column can be done separately, since every test only runs on one of the shards, and which shard it is is deterministic. Each step can also be done separately, since most steps don't have any failures they can be skipped, and the co19 steps can be done separately from the other steps.

The logs that are held on the buildbot (about 2 weeks worth) can be fetched using the stdio/text URLs. For example, I just did a command line:
for i in 4121 4112 4110 4107 4100 4090 4062 4061 4045; do curl -o log$i https://build.chromium.org/p/client.dart/builders/dart2js-win7-ie11ff-2-4-be/builds/$i/steps/dart2js%20ie11%20tests/logs/stdio/text; done

There is also a permanent record of older logs in cloud storage, stored in logdog, viewable by the links in square brackets in a build, which use the logdog viewer:
https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fclient.dart%2Fdart2js-win7-ie11ff-3-4-be%2F4076%2F%2B%2Frecipes%2Fsteps%2Fdart2js_ff_observatory_ui_tests%2F0%2Fstdout

There is also a different way of seeing all the runs in a column, using milo, the replacement for buildbot. This can also go back arbitrarily long, rather than dropping runs more than a month old:
https://luci-milo.appspot.com/buildbot/client.dart/dart2js-win7-ie11ff-4-4-be/
https://luci-milo.appspot.com/

There is a command-line tool, called logdog, for fetching the file as a direct text download:
https://github.com/luci/luci-go/blob/master/logdog/client/cmd/logdog/README.md
I will be finding out how we get these command-line tools other than downloading and building them ourselves.

@floitschG
Copy link
Contributor

I just committed a small library that allows us to download logs via logdog:
https://github.com/dart-lang/sdk/blob/master/tools/gardening/lib/src/logdog.dart

This should allow us to fetch logs from a longer period, thus showing us which tests fail the most often.

@efortuna efortuna self-assigned this Apr 4, 2017
@efortuna
Copy link
Contributor

efortuna commented Apr 4, 2017

Hi Bill,
What's the latest on this bug? Last I heard chrome people were re-imaging the Windows machines, that has been done, and now it looks like we're just back to spurious flaky timeout failures (as opposed to the return status failures of last week that required the reimaging). What's the current status of handling flaky tests? It looks like in some cases we're re-running, but I don't see any rerunning happening for these tests that timed out: https://uberchromegw.corp.google.com/i/client.dart/builders/dart2js-win7-ie11ff-3-4-be/builds/75/steps/dart2js%20ie11%20tests/logs/stdio

@mkustermann
Copy link
Member Author

During my gardening shift I saw this build failure today.

One very interesting thing is that the number of failures equals precisely the number of cores this machine has (which corresponds to the number of parallel tests we run).

This leads me to suspect that Internet Explorer might get into a bad state, which will affect all open tabs / windows, thereby affecting all concurrently running tests.

One possible situation in which this could happen if Internet Explorer (or the system) pops up a modal dialog (e.g. "this script is no longer responding, do you want to wait for it? y/n"). Similar issues have occurred in the past.

Since I'm the gardener today, I'll do a far fetched attempt, by making a screenshot before killing browsers and upload them to cloud storage (see cl).

mkustermann added a commit that referenced this issue Apr 12, 2017
The hypothesis is that a modal dialog from Internet Explorer causes
the currently running test to hang until test.dart kills the browser.

Capturing a screenshot might give an insight into there is a dialog showing up.

BUG=#28955
R=johnniwinther@google.com

Review-Url: https://codereview.chromium.org/2811093003 .
whesse added a commit that referenced this issue May 11, 2017
@whesse
Copy link
Contributor

whesse commented May 11, 2017

RIght now, IE11 is mainly timing out in the first run of IE on a build. I have watched this with a window open, and the IE window is open, trying to load the driver page, and nothing is happening.
The option to open developer tools is grayed out in the system menu, so I can't see the network traffic.

This timing-out browser is killed, and a new one is opened, and it times out too, while loading the driver page. Only the third time opening the browser, is the connection made.

The strange thing is that this only happens in the first IE11 step in a build, so it seems like there could be some IE11 state that is initialized, and kept warm, even when the browsers are killed.

@whesse
Copy link
Contributor

whesse commented May 15, 2017

The debug log for IE11 bots shows that the 60 second timeout for fetching a test is being hit when starting ie11 for the first time in a build. The browser is then killed, and a new instance is started, which again takes more than 60 seconds to start up. This cycle continues until the max number of failures is reached, and then the test run is stopped.

Increasing the time allowed for a browser to fetch a test to 120 seconds. This should fix the problem on IE. And this does not increase the timeout for tests that take too long, which is controlled by a different timer.

whesse added a commit that referenced this issue May 15, 2017
On a cold, overloaded system, IE sometimes takes more than 60 seconds to start.

BUG=#28955
R=kustermann@google.com

Review-Url: https://codereview.chromium.org/2878423002 .
mithro pushed a commit to mithro/chromium-build that referenced this issue Jun 2, 2017
Bug: dart-lang/sdk#28955
Change-Id: I304136572dc81d2954fecb45a5abcebe6e22090a
Reviewed-on: https://chromium-review.googlesource.com/509610
Reviewed-by: Martin Kustermann <kustermann@google.com>
Commit-Queue: William Hesse <whesse@google.com>
@whesse
Copy link
Contributor

whesse commented Jun 14, 2017

The commit d6ca1a5 seems not to be working correctly, so I am introducing a new CL to catch the timeouts more robustly. If this works, then we know there was a problem with the previous attempt.

whesse added a commit that referenced this issue Jun 14, 2017
@whesse
Copy link
Contributor

whesse commented Jun 14, 2017

https://codereview.chromium.org/2938813002 which should stop ie11 timeouts from reporting as errors, has landed as d98d32b

whesse added a commit that referenced this issue Jun 14, 2017
Only content_shell testing uses the BrowserCommandOutputImpl class.
Rename the class to ContentShellCommandOutputImpl.

BUG=#28955
BUG=#29869
R=kustermann@google.com

Review-Url: https://codereview.chromium.org/2934243002 .
@efortuna
Copy link
Contributor

I feel like the these bots have been a lot more stable. Am I mistaken? Can we close this issue?

@whesse
Copy link
Contributor

whesse commented Nov 16, 2017

Actually, we were just thinking about reverting the change that ignores Windows IE timeouts. The builders have been stable because up to 5 or 10 timeouts are ignored, for each run of test.py.
But now I realize that the fix by @mraleph for Windows probably won't help these bots, because they are timing out in IE. So if we revert the hack, and they become unstable, we would recommit the hack (that ignores up to 10 IE timeouts).

I would not close the issue, because it is not fixed, just hidden.

@whesse
Copy link
Contributor

whesse commented Dec 17, 2018

After the ie11 builders were moved from the golo lab (permanently assigned vms assigned to them) to swarming (windows GCE VMs taken from a pool, randomly), the flakiness and the real timeouts have increased to more than 10 per shard.

To investigate this, and turn it green, we are dropping the code that ignores the timeouts. Real timeouts will show up as errors, and flaky ones will enter the flakiness system, and eventually be forgiven. @sortie

@whesse
Copy link
Contributor

whesse commented Dec 17, 2018

The high number of ignore results could also come from the other dart2js windows thing that returns ignore: a dart2js hang. We are leaving that code in place, to verify that the ignores coming with the move to swarming are ie11 timeouts. The issue tracking dart2js hangs is #26060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. gardening web-libraries Issues impacting dart:html, etc., libraries
Projects
None yet
Development

No branches or pull requests

6 participants