In components E2E tests, always wait for initial render after each MountTestComponent call #10082

SteveSandersonMS · 2019-05-08T16:38:26Z

This might wipe out a huge amount of the flakiness we've experienced in the last few days.

I said we needed something unique to signal the start of each test case, but thinking a step further, that should be unnecessary. We can explicitly wait for the last test case to be removed from the DOM, and then wait for the DOM to be updated a further time for the new test case.

…untTestComponent call

ajaybhargavb

Nice 👏

rynowak

👏 👏 👏

SteveSandersonMS · 2019-05-08T21:32:46Z

Well, some more still randomly failed.

For the new failures, the thing that failed was not "finding the first element immediately on initial render", so it's possible that the fix in this PR has helped with that case.

Instead, the only way I can make sense of the newer failures is if the CI runners are spectacularly slow sometimes and it just times out randomly in the middle of a test, even if we've done everything right to perform all assertions and element lookups with the "wait and retry" pattern.

In case CI runner slowness actually is the underlying reason why this has become much less reliable in the last few days, I'm trying extending the timeouts.

javiercn · 2019-05-09T11:43:24Z

src/Components/test/E2ETest/Properties/AssemblyInfo.cs

+// interference between test classes, for example test A observing the UI state of test B, which
+// suggests that either Selenium has thread-safety issues that direct commands to the wrong
+// browser instance, or something in our tracking of browser instances goes wrong.
+[assembly: CollectionBehavior(DisableTestParallelization = true)]


How did you get to diagnose this?

Posted below

SteveSandersonMS · 2019-05-09T11:56:31Z

I've now investigated this very extensively.

Did we change something in our sources that made it turn flaky?
- Very hard to tell. The flakiness started about a day ago, but nothing under src/Components changed anything about how the E2E tests run in recent days. Also nothing under src/Components changed around the time flakiness started that should be related to this.
- Likewise, no other changes to the rest of the aspnetcore source tree around this time appear related
- However, considering how intermittent this is, and how sensitive it seems to be to the amount of load and parallelism on the server, it's conceivable that some really low level thing (e.g., in Kestrel) has changed something. Not that it necessarily means a defect in the other thing - maybe we were just relying on a fluke for correct execution before.
Did Chrome do a release that affects this?
- I don't think so. The latest Chrome 74 was released 16 days ago, ages before this started.
Did something in the Azure Pipelines (or our config for it) change?
- It's theoretically possible. It's worth noting that the flakiness started in both the public and internal builds on the same day.

I've tried changing lots of things about how the E2E tests run that didn't work:

Updating to newer versions of selenium-standalone, the Selenium Chrome driver, and the .NET Selenium pages - problem still occurs after each update
Creating a separate browser instance per test - problem still occurs
Being incredibly generous with timeouts - problem still occurs

... and along the way, got some error cases that were more interesting than others. The most interesting ones were very rare, but were cases where a test from class A (e.g., ServerComponentRenderingTest) did an assertion that failed because it observed a UI state that was clearly put there by a completely different test from a different class B (e.g., ServerBindTest). How is it possible that when executing one test, we see output from a different test? It shouldn't happen, because:

We have different browser instances per test class, so it can't be just that we didn't tear down old state adequately
~~We even have different server instances per test class, so this can't be due to an actual server-side bug where different users interfere with each other~~ Highly misleading/incorrect. We do have separate web host instances per test class, but they are all running in the same .NET process and hence share state (e.g., instatic fields), so they are not independent. This is a very good thing in this case, as we wouldn't have discovered the issue until it shipped otherwise.

I don't have an explanation, other than to posit that either Selenium itself has thread-safety issues (we are sharing a Selenium server instance across all tests), or that somehow our tracking of Browser instances goes wrong and we are invoking methods on the wrong instance (but looking at the implementation, I don't see how).

A fix

One thing that does actually reliably make the problem go away is disabling parallelism on the E2E tests entirely. At least, it fixes it when I run locally - we're yet to see about CI. This shouldn't actually affect our level of test coverage, since we weren't covering simultaneous users on a single server anyway (we have separate server instances per test class, and tests within a class are serial anyway). The only drawbacks are:

The tests take a bit longer to run, though it doesn't really move the needle on our overall build times
It confirms that we don't really know what the underlying problem is, except to theorize that Selenium server has unknown thread-safety issues

Making this change also makes some bits of our E2E runner infrastructure somewhat redundant, e.g., all the tracking of browser instances, since we could now just have a single browser. I'm not proposing to rip out that infrastructure since it's meant to be consistent with how we run Selenium in other places (e.g., template tests, of which there are far fewer so it's not surprising they don't trip up as much) and hopefully we're reintroduce the parallelism at some point.

@javiercn I know you've done some work on the E2E infrastructure, browser tracking, etc. What's your take on all this? Any theories about what we might be doing wrong in our browser tracking or assumptions around parallelism?

javiercn · 2019-05-09T12:22:41Z

@SteveSandersonMS Thanks for the impressive post-mortem. You are totally right in your analysis. I would have concluded the same thing.

Regarding selenium thread-safety there are a couple of things to note (not asking you to do any of this):

We can probably capture more detailed logs from the server. (I think it exposes them)
We could try and switch from a single global into a global server per testcollection (that essentially run sequentially).
- This would involve changing https://github.com/aspnet/AspNetCore/blob/master/src/Shared/E2ETesting/SeleniumStandaloneServer.cs#L65 to take in a key (a guid) and having SeleniumStandaloneServer hold a list of active servers.
- BrowserFixture would pass in the key when requesting a server and would dispose of the instance at the end.

I think your analysis has convinced me to the idea that we should also run the E2E SPA tests sequentially. They are already mostly sequential (as NPM restore happens sequentially) and we've seen flakyness in those two that can be associated with this.

SteveSandersonMS · 2019-05-09T13:20:08Z

Update: I was unsatisfied about not really understanding why this only just started happening so investigated further. Turns out it's actually a real defect introduced in a recent commit. And appropriately enough, it's my bug. Working on an actual fix now.

javiercn · 2019-05-09T13:42:29Z

Update: I was unsatisfied about not really understanding why this only just started happening so investigated further. Turns out it's actually a real defect introduced in a recent commit. And appropriately enough, it's my bug. Working on an actual fix now.

I would be interested in knowing what went on here. Is there an explicit new test that we can write to capture this situation?

SteveSandersonMS · 2019-05-10T08:54:28Z

Closing as superseded by #10112

We could apply some of the changes here (particularly the "explicit wait for first render"). However since we're not seeing flakiness currently, I'm not massively enthusiastic about putting in more defensiveness about something that isn't broken. The more retry-type logic we put into the E2E tests, the less they are able to warn us about very rare intermittent faults.

I would be interested in knowing what went on here. Is there an explicit new test that we can write to capture this situation?

Described in #10112. The E2E tests as they are do provide validation that the bug no longer exists, in that they were failing on most builds before and no longer do so. However there isn't a single explicit E2E test that reliably captures the idea of cross-user interference, and I'm not sure how we'd write one since that's not mean to happen and if it did happen, we wouldn't be able to predict in advance what might trigger it (besides just running a lot of tests in parallel, which we already do).

In components E2E tests, always wait for initial render after each Mo…

1f8acec

…untTestComponent call

SteveSandersonMS requested review from ajaybhargavb, javiercn and rynowak May 8, 2019 16:41

SteveSandersonMS added area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates area-blazor Includes: Blazor, Razor Components labels May 8, 2019

javiercn approved these changes May 8, 2019

View reviewed changes

ajaybhargavb approved these changes May 8, 2019

View reviewed changes

Update ClientSideHostingTest to match updated doc shape

c66c7b1

rynowak approved these changes May 8, 2019

View reviewed changes

Extend timeouts

534a4fc

Disable E2E test parallelism

c5f22e6

javiercn reviewed May 9, 2019

View reviewed changes

SteveSandersonMS mentioned this pull request May 9, 2019

Set defaults for Json options #10057

Merged

SteveSandersonMS closed this May 10, 2019

dougbu deleted the stevesa/more-components-e2e-reliability branch August 21, 2021 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In components E2E tests, always wait for initial render after each MountTestComponent call #10082

In components E2E tests, always wait for initial render after each MountTestComponent call #10082

Uh oh!

SteveSandersonMS commented May 8, 2019 •

edited

Loading

Uh oh!

ajaybhargavb left a comment

Uh oh!

rynowak left a comment

Uh oh!

SteveSandersonMS commented May 8, 2019

Uh oh!

javiercn May 9, 2019

Uh oh!

SteveSandersonMS May 9, 2019

Uh oh!

SteveSandersonMS commented May 9, 2019 •

edited

Loading

Uh oh!

javiercn commented May 9, 2019

Uh oh!

SteveSandersonMS commented May 9, 2019

Uh oh!

javiercn commented May 9, 2019

Uh oh!

SteveSandersonMS commented May 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

In components E2E tests, always wait for initial render after each MountTestComponent call #10082

In components E2E tests, always wait for initial render after each MountTestComponent call #10082

Uh oh!

Conversation

SteveSandersonMS commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajaybhargavb left a comment

Choose a reason for hiding this comment

Uh oh!

rynowak left a comment

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS commented May 8, 2019

Uh oh!

javiercn May 9, 2019

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS May 9, 2019

Choose a reason for hiding this comment

Uh oh!

SteveSandersonMS commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A fix

Uh oh!

javiercn commented May 9, 2019

Uh oh!

SteveSandersonMS commented May 9, 2019

Uh oh!

javiercn commented May 9, 2019

Uh oh!

SteveSandersonMS commented May 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SteveSandersonMS commented May 8, 2019 •

edited

Loading

SteveSandersonMS commented May 9, 2019 •

edited

Loading