-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High priority: Investigate flakiness on dart2js-windows #28955
Comments
This issue has been going on for a long time. We should consider moving all of these builders to FYI until they are fixed. |
I'm rebooting these machines, because ie10 doesn't seem to be working well, even in a remote desktop session. IE 10 is old and unsupported, so we should switch to testing the new IE, edge, on Windows 10. |
Looking at just one builder, the ie11 shard 2 of 4, and getting all of the timeouts in the 10 failing ie11tests in the last 100 runs, we found two tests that were repeatedly timing out, and more that only timed out once. dart2js-ie11 release_x64 html/custom/constructor_calls_created_synchronously_test failed 3 times These each failed once: All failures are timeouts of tests that don't normally time out. More investigation of the tests that are timing out on all windows browser tests can help us figure out what the problem is. |
/cc @stereotype441 |
I want to be sure that there's no confusion here: running |
The current work that needs to be done first, I think, is: The logs that are held on the buildbot (about 2 weeks worth) can be fetched using the stdio/text URLs. For example, I just did a command line: There is also a permanent record of older logs in cloud storage, stored in logdog, viewable by the links in square brackets in a build, which use the logdog viewer: There is also a different way of seeing all the runs in a column, using milo, the replacement for buildbot. This can also go back arbitrarily long, rather than dropping runs more than a month old: There is a command-line tool, called logdog, for fetching the file as a direct text download: |
I just committed a small library that allows us to download logs via logdog: This should allow us to fetch logs from a longer period, thus showing us which tests fail the most often. |
Hi Bill, |
During my gardening shift I saw this build failure today. One very interesting thing is that the number of failures equals precisely the number of cores this machine has (which corresponds to the number of parallel tests we run). This leads me to suspect that Internet Explorer might get into a bad state, which will affect all open tabs / windows, thereby affecting all concurrently running tests. One possible situation in which this could happen if Internet Explorer (or the system) pops up a modal dialog (e.g. "this script is no longer responding, do you want to wait for it? y/n"). Similar issues have occurred in the past. Since I'm the gardener today, I'll do a far fetched attempt, by making a screenshot before killing browsers and upload them to cloud storage (see cl). |
The hypothesis is that a modal dialog from Internet Explorer causes the currently running test to hang until test.dart kills the browser. Capturing a screenshot might give an insight into there is a dialog showing up. BUG=#28955 R=johnniwinther@google.com Review-Url: https://codereview.chromium.org/2811093003 .
BUG=#28955 R=sigmund@google.com Review-Url: https://codereview.chromium.org/2875683002 .
RIght now, IE11 is mainly timing out in the first run of IE on a build. I have watched this with a window open, and the IE window is open, trying to load the driver page, and nothing is happening. This timing-out browser is killed, and a new one is opened, and it times out too, while loading the driver page. Only the third time opening the browser, is the connection made. The strange thing is that this only happens in the first IE11 step in a build, so it seems like there could be some IE11 state that is initialized, and kept warm, even when the browsers are killed. |
The debug log for IE11 bots shows that the 60 second timeout for fetching a test is being hit when starting ie11 for the first time in a build. The browser is then killed, and a new instance is started, which again takes more than 60 seconds to start up. This cycle continues until the max number of failures is reached, and then the test run is stopped. Increasing the time allowed for a browser to fetch a test to 120 seconds. This should fix the problem on IE. And this does not increase the timeout for tests that take too long, which is controlled by a different timer. |
On a cold, overloaded system, IE sometimes takes more than 60 seconds to start. BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2878423002 .
Bug: dart-lang/sdk#28955 Change-Id: I304136572dc81d2954fecb45a5abcebe6e22090a Reviewed-on: https://chromium-review.googlesource.com/509610 Reviewed-by: Martin Kustermann <kustermann@google.com> Commit-Queue: William Hesse <whesse@google.com>
The commit d6ca1a5 seems not to be working correctly, so I am introducing a new CL to catch the timeouts more robustly. If this works, then we know there was a problem with the previous attempt. |
BUG=#28955 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2938813002 .
https://codereview.chromium.org/2938813002 which should stop ie11 timeouts from reporting as errors, has landed as d98d32b |
Only content_shell testing uses the BrowserCommandOutputImpl class. Rename the class to ContentShellCommandOutputImpl. BUG=#28955 BUG=#29869 R=kustermann@google.com Review-Url: https://codereview.chromium.org/2934243002 .
The land of https://codereview.chromium.org/2933973002 merged badly with https://codereview.chromium.org/2934243002/ BUG=#28955 R=sortie@google.com Review-Url: https://codereview.chromium.org/2938383002 .
I feel like the these bots have been a lot more stable. Am I mistaken? Can we close this issue? |
Actually, we were just thinking about reverting the change that ignores Windows IE timeouts. The builders have been stable because up to 5 or 10 timeouts are ignored, for each run of test.py. I would not close the issue, because it is not fixed, just hidden. |
After the ie11 builders were moved from the golo lab (permanently assigned vms assigned to them) to swarming (windows GCE VMs taken from a pool, randomly), the flakiness and the real timeouts have increased to more than 10 per shard. To investigate this, and turn it green, we are dropping the code that ignores the timeouts. Real timeouts will show up as errors, and flaky ones will enter the flakiness system, and eventually be forgiven. @sortie |
The high number of ignore results could also come from the other dart2js windows thing that returns ignore: a dart2js hang. We are leaving that code in place, to verify that the ignores coming with the move to swarming are ie11 timeouts. The issue tracking dart2js hangs is #26060 |
There is a lot of flaky timesouts on all dart2js-windows and these do not seeem test-specific
The underlying problem might be an infrastructure issue in test.dart's browser controller in connection with IE10/IE11. Our buildbots don't surface the debug log information from test.dart so it's really hard to diagnose.
I think it should be a high priority to look into those flakes and fix the underlying issue.
The text was updated successfully, but these errors were encountered: