High priority: Investigate flakiness on dart2js-windows #28955

mkustermann · 2017-03-02T11:54:22Z

There is a lot of flaky timesouts on all dart2js-windows and these do not seeem test-specific

The underlying problem might be an infrastructure issue in test.dart's browser controller in connection with IE10/IE11. Our buildbots don't surface the debug log information from test.dart so it's really hard to diagnose.

I think it should be a high priority to look into those flakes and fix the underlying issue.

mkustermann · 2017-03-02T11:57:34Z

This issue has been going on for a long time. We should consider moving all of these builders to FYI until they are fixed.

whesse · 2017-03-02T13:23:49Z

I'm rebooting these machines, because ie10 doesn't seem to be working well, even in a remote desktop session. IE 10 is old and unsupported, so we should switch to testing the new IE, edge, on Windows 10.

whesse · 2017-03-15T16:25:12Z

Looking at just one builder, the ie11 shard 2 of 4, and getting all of the timeouts in the 10 failing ie11tests in the last 100 runs, we found two tests that were repeatedly timing out, and more that only timed out once.

dart2js-ie11 release_x64 html/custom/constructor_calls_created_synchronously_test failed 3 times
FAILED: dart2js-ie11 release_x64 pkg/testing/test/hello_test failed 4 times

These each failed once:
FAILED: dart2js-ie11 release_x64 html/custom_elements_test/preregister
FAILED: dart2js-ie11 release_x64 html/js_typed_interop_test/avoid leaks on dart:core
FAILED: dart2js-ie11 release_x64 html/xsltprocessor_test/supported
FAILED: dart2js-ie11 release_x64 pkg/front_end/test/scanner_test
FAILED: dart2js-ie11 release_x64 pkg/front_end/test/src/async_dependency_walker_test

All failures are timeouts of tests that don't normally time out.

More investigation of the tests that are timing out on all windows browser tests can help us figure out what the problem is.

whesse · 2017-03-15T16:25:54Z

@peter-ahe-google @sigmundch @sortie

sigmundch · 2017-03-15T16:41:25Z

/cc @stereotype441
I'd skip for now the pkg/front_end/* tests - I don't think there is much value in running these tests in ie11 at the moment. We do want to test that we can run the frontend in chrome (the main use case is to one day run ddc as part of the chrome-debugger), but I'm OK skipping all other browsers for now.

peter-ahe-google · 2017-03-16T09:57:55Z

I want to be sure that there's no confusion here: running pkg/front_end/* tests on the Dart VM is absolutely critical, but yes, we can skip them on browsers if that's helpful. However, it might also be valuable to check these tests for asynchronous pitfalls.

whesse · 2017-03-16T13:48:57Z

The current work that needs to be done first, I think, is:
Using a script or a shell script, get the logs with the flaky timeouts from the dart2js-windows columns in the buildbot. Each column can be done separately, since every test only runs on one of the shards, and which shard it is is deterministic. Each step can also be done separately, since most steps don't have any failures they can be skipped, and the co19 steps can be done separately from the other steps.

The logs that are held on the buildbot (about 2 weeks worth) can be fetched using the stdio/text URLs. For example, I just did a command line:
for i in 4121 4112 4110 4107 4100 4090 4062 4061 4045; do curl -o log$i https://build.chromium.org/p/client.dart/builders/dart2js-win7-ie11ff-2-4-be/builds/$i/steps/dart2js%20ie11%20tests/logs/stdio/text; done

There is also a permanent record of older logs in cloud storage, stored in logdog, viewable by the links in square brackets in a build, which use the logdog viewer:
https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fclient.dart%2Fdart2js-win7-ie11ff-3-4-be%2F4076%2F%2B%2Frecipes%2Fsteps%2Fdart2js_ff_observatory_ui_tests%2F0%2Fstdout

There is also a different way of seeing all the runs in a column, using milo, the replacement for buildbot. This can also go back arbitrarily long, rather than dropping runs more than a month old:
https://luci-milo.appspot.com/buildbot/client.dart/dart2js-win7-ie11ff-4-4-be/
https://luci-milo.appspot.com/

There is a command-line tool, called logdog, for fetching the file as a direct text download:
https://github.com/luci/luci-go/blob/master/logdog/client/cmd/logdog/README.md
I will be finding out how we get these command-line tools other than downloading and building them ourselves.

floitschG · 2017-03-16T18:53:15Z

I just committed a small library that allows us to download logs via logdog:
https://github.com/dart-lang/sdk/blob/master/tools/gardening/lib/src/logdog.dart

This should allow us to fetch logs from a longer period, thus showing us which tests fail the most often.

efortuna · 2017-04-04T19:28:00Z

Hi Bill,
What's the latest on this bug? Last I heard chrome people were re-imaging the Windows machines, that has been done, and now it looks like we're just back to spurious flaky timeout failures (as opposed to the return status failures of last week that required the reimaging). What's the current status of handling flaky tests? It looks like in some cases we're re-running, but I don't see any rerunning happening for these tests that timed out: https://uberchromegw.corp.google.com/i/client.dart/builders/dart2js-win7-ie11ff-3-4-be/builds/75/steps/dart2js%20ie11%20tests/logs/stdio

mkustermann · 2017-04-11T14:24:11Z

During my gardening shift I saw this build failure today.

One very interesting thing is that the number of failures equals precisely the number of cores this machine has (which corresponds to the number of parallel tests we run).

This leads me to suspect that Internet Explorer might get into a bad state, which will affect all open tabs / windows, thereby affecting all concurrently running tests.

One possible situation in which this could happen if Internet Explorer (or the system) pops up a modal dialog (e.g. "this script is no longer responding, do you want to wait for it? y/n"). Similar issues have occurred in the past.

Since I'm the gardener today, I'll do a far fetched attempt, by making a screenshot before killing browsers and upload them to cloud storage (see cl).

The hypothesis is that a modal dialog from Internet Explorer causes the currently running test to hang until test.dart kills the browser. Capturing a screenshot might give an insight into there is a dialog showing up. BUG=#28955 R=johnniwinther@google.com Review-Url: https://codereview.chromium.org/2811093003 .

BUG=#28955 R=sigmund@google.com Review-Url: https://codereview.chromium.org/2875683002 .

whesse · 2017-05-11T13:30:45Z

RIght now, IE11 is mainly timing out in the first run of IE on a build. I have watched this with a window open, and the IE window is open, trying to load the driver page, and nothing is happening.
The option to open developer tools is grayed out in the system menu, so I can't see the network traffic.

This timing-out browser is killed, and a new one is opened, and it times out too, while loading the driver page. Only the third time opening the browser, is the connection made.

The strange thing is that this only happens in the first IE11 step in a build, so it seems like there could be some IE11 state that is initialized, and kept warm, even when the browsers are killed.

whesse · 2017-05-15T11:28:09Z

The debug log for IE11 bots shows that the 60 second timeout for fetching a test is being hit when starting ie11 for the first time in a build. The browser is then killed, and a new instance is started, which again takes more than 60 seconds to start up. This cycle continues until the max number of failures is reached, and then the test run is stopped.

Increasing the time allowed for a browser to fetch a test to 120 seconds. This should fix the problem on IE. And this does not increase the timeout for tests that take too long, which is controlled by a different timer.

On a cold, overloaded system, IE sometimes takes more than 60 seconds to start. BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2878423002 .

Bug: dart-lang/sdk#28955 Change-Id: I304136572dc81d2954fecb45a5abcebe6e22090a Reviewed-on: https://chromium-review.googlesource.com/509610 Reviewed-by: Martin Kustermann <kustermann@google.com> Commit-Queue: William Hesse <whesse@google.com>

whesse · 2017-06-14T12:08:46Z

The commit d6ca1a5 seems not to be working correctly, so I am introducing a new CL to catch the timeouts more robustly. If this works, then we know there was a problem with the previous attempt.

BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2938813002 .

whesse · 2017-06-14T12:19:05Z

https://codereview.chromium.org/2938813002 which should stop ie11 timeouts from reporting as errors, has landed as d98d32b

Only content_shell testing uses the BrowserCommandOutputImpl class. Rename the class to ContentShellCommandOutputImpl. BUG=#28955 BUG=#29869 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2934243002 .

The land of https://codereview.chromium.org/2933973002 merged badly with https://codereview.chromium.org/2934243002/ BUG=#28955 R=sortie@google.com Review-Url: https://codereview.chromium.org/2938383002 .

efortuna · 2017-11-16T18:49:11Z

I feel like the these bots have been a lot more stable. Am I mistaken? Can we close this issue?

whesse · 2017-11-16T19:03:50Z

Actually, we were just thinking about reverting the change that ignores Windows IE timeouts. The builders have been stable because up to 5 or 10 timeouts are ignored, for each run of test.py.
But now I realize that the fix by @mraleph for Windows probably won't help these bots, because they are timing out in IE. So if we revert the hack, and they become unstable, we would recommit the hack (that ignores up to 10 IE timeouts).

I would not close the issue, because it is not fixed, just hidden.

whesse · 2018-12-17T11:38:16Z

After the ie11 builders were moved from the golo lab (permanently assigned vms assigned to them) to swarming (windows GCE VMs taken from a pool, randomly), the flakiness and the real timeouts have increased to more than 10 per shard.

To investigate this, and turn it green, we are dropping the code that ignores the timeouts. Real timeouts will show up as errors, and flaky ones will enter the flakiness system, and eventually be forgiven. @sortie

whesse · 2018-12-17T11:58:39Z

The high number of ignore results could also come from the other dart2js windows thing that returns ignore: a dart2js hang. We are leaving that code in place, to verify that the ignores coming with the move to swarming are ie11 timeouts. The issue tracking dart2js hangs is #26060

mkustermann added area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. gardening labels Mar 2, 2017

mkustermann changed the title ~~Investigate flakiness on dart2js-windows~~ High priority: Investigate flakiness on dart2js-windows Mar 2, 2017

nex3 assigned whesse Mar 10, 2017

whesse added web-dart2js web-libraries Issues impacting dart:html, etc., libraries labels Mar 16, 2017

efortuna self-assigned this Apr 4, 2017

whesse added a commit that referenced this issue May 11, 2017

Report IE11 timeouts in debug log, not as failing tests.

d6ca1a5

BUG=#28955 R=sigmund@google.com Review-Url: https://codereview.chromium.org/2875683002 .

whesse added a commit that referenced this issue May 15, 2017

Increase startup time allowed for browsers

c235977

On a cold, overloaded system, IE sometimes takes more than 60 seconds to start. BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2878423002 .

whesse added a commit that referenced this issue Jun 14, 2017

Report ie11 timeouts in debug log, not as errors (attempt 2)

d98d32b

BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2938813002 .

sigmundch mentioned this issue Nov 16, 2017

Builder dart2js-win7-ie11ff-1-4-be has many timeouts #28662

Closed

sigmundch removed the web-dart2js label Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High priority: Investigate flakiness on dart2js-windows #28955

High priority: Investigate flakiness on dart2js-windows #28955

mkustermann commented Mar 2, 2017

mkustermann commented Mar 2, 2017

whesse commented Mar 2, 2017

whesse commented Mar 15, 2017

whesse commented Mar 15, 2017

sigmundch commented Mar 15, 2017

peter-ahe-google commented Mar 16, 2017

whesse commented Mar 16, 2017

floitschG commented Mar 16, 2017

efortuna commented Apr 4, 2017 •

edited

Loading

mkustermann commented Apr 11, 2017

whesse commented May 11, 2017

whesse commented May 15, 2017

whesse commented Jun 14, 2017

whesse commented Jun 14, 2017

efortuna commented Nov 16, 2017

whesse commented Nov 16, 2017

whesse commented Dec 17, 2018

whesse commented Dec 17, 2018

High priority: Investigate flakiness on dart2js-windows #28955

High priority: Investigate flakiness on dart2js-windows #28955

Comments

mkustermann commented Mar 2, 2017

mkustermann commented Mar 2, 2017

whesse commented Mar 2, 2017

whesse commented Mar 15, 2017

whesse commented Mar 15, 2017

sigmundch commented Mar 15, 2017

peter-ahe-google commented Mar 16, 2017

whesse commented Mar 16, 2017

floitschG commented Mar 16, 2017

efortuna commented Apr 4, 2017 • edited Loading

mkustermann commented Apr 11, 2017

whesse commented May 11, 2017

whesse commented May 15, 2017

whesse commented Jun 14, 2017

whesse commented Jun 14, 2017

efortuna commented Nov 16, 2017

whesse commented Nov 16, 2017

whesse commented Dec 17, 2018

whesse commented Dec 17, 2018

efortuna commented Apr 4, 2017 •

edited

Loading