Skip to content

Conversation

SteveSandersonMS
Copy link
Member

@SteveSandersonMS SteveSandersonMS commented May 8, 2019

This might wipe out a huge amount of the flakiness we've experienced in the last few days.

I said we needed something unique to signal the start of each test case, but thinking a step further, that should be unnecessary. We can explicitly wait for the last test case to be removed from the DOM, and then wait for the DOM to be updated a further time for the new test case.

@SteveSandersonMS SteveSandersonMS added area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates area-blazor Includes: Blazor, Razor Components labels May 8, 2019
Copy link
Contributor

@ajaybhargavb ajaybhargavb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👏

Copy link
Member

@rynowak rynowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏

@SteveSandersonMS
Copy link
Member Author

Well, some more still randomly failed.

For the new failures, the thing that failed was not "finding the first element immediately on initial render", so it's possible that the fix in this PR has helped with that case.

Instead, the only way I can make sense of the newer failures is if the CI runners are spectacularly slow sometimes and it just times out randomly in the middle of a test, even if we've done everything right to perform all assertions and element lookups with the "wait and retry" pattern.

In case CI runner slowness actually is the underlying reason why this has become much less reliable in the last few days, I'm trying extending the timeouts.

// interference between test classes, for example test A observing the UI state of test B, which
// suggests that either Selenium has thread-safety issues that direct commands to the wrong
// browser instance, or something in our tracking of browser instances goes wrong.
[assembly: CollectionBehavior(DisableTestParallelization = true)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you get to diagnose this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted below

@SteveSandersonMS
Copy link
Member Author

SteveSandersonMS commented May 9, 2019

I've now investigated this very extensively.

  • Did we change something in our sources that made it turn flaky?
    • Very hard to tell. The flakiness started about a day ago, but nothing under src/Components changed anything about how the E2E tests run in recent days. Also nothing under src/Components changed around the time flakiness started that should be related to this.
    • Likewise, no other changes to the rest of the aspnetcore source tree around this time appear related
    • However, considering how intermittent this is, and how sensitive it seems to be to the amount of load and parallelism on the server, it's conceivable that some really low level thing (e.g., in Kestrel) has changed something. Not that it necessarily means a defect in the other thing - maybe we were just relying on a fluke for correct execution before.
  • Did Chrome do a release that affects this?
    • I don't think so. The latest Chrome 74 was released 16 days ago, ages before this started.
  • Did something in the Azure Pipelines (or our config for it) change?
    • It's theoretically possible. It's worth noting that the flakiness started in both the public and internal builds on the same day.

I've tried changing lots of things about how the E2E tests run that didn't work:

  • Updating to newer versions of selenium-standalone, the Selenium Chrome driver, and the .NET Selenium pages - problem still occurs after each update
  • Creating a separate browser instance per test - problem still occurs
  • Being incredibly generous with timeouts - problem still occurs

... and along the way, got some error cases that were more interesting than others. The most interesting ones were very rare, but were cases where a test from class A (e.g., ServerComponentRenderingTest) did an assertion that failed because it observed a UI state that was clearly put there by a completely different test from a different class B (e.g., ServerBindTest). How is it possible that when executing one test, we see output from a different test? It shouldn't happen, because:

  • We have different browser instances per test class, so it can't be just that we didn't tear down old state adequately
  • We even have different server instances per test class, so this can't be due to an actual server-side bug where different users interfere with each other Highly misleading/incorrect. We do have separate web host instances per test class, but they are all running in the same .NET process and hence share state (e.g., instatic fields), so they are not independent. This is a very good thing in this case, as we wouldn't have discovered the issue until it shipped otherwise.

I don't have an explanation, other than to posit that either Selenium itself has thread-safety issues (we are sharing a Selenium server instance across all tests), or that somehow our tracking of Browser instances goes wrong and we are invoking methods on the wrong instance (but looking at the implementation, I don't see how).

A fix

One thing that does actually reliably make the problem go away is disabling parallelism on the E2E tests entirely. At least, it fixes it when I run locally - we're yet to see about CI. This shouldn't actually affect our level of test coverage, since we weren't covering simultaneous users on a single server anyway (we have separate server instances per test class, and tests within a class are serial anyway). The only drawbacks are:

  • The tests take a bit longer to run, though it doesn't really move the needle on our overall build times
  • It confirms that we don't really know what the underlying problem is, except to theorize that Selenium server has unknown thread-safety issues

Making this change also makes some bits of our E2E runner infrastructure somewhat redundant, e.g., all the tracking of browser instances, since we could now just have a single browser. I'm not proposing to rip out that infrastructure since it's meant to be consistent with how we run Selenium in other places (e.g., template tests, of which there are far fewer so it's not surprising they don't trip up as much) and hopefully we're reintroduce the parallelism at some point.

@javiercn I know you've done some work on the E2E infrastructure, browser tracking, etc. What's your take on all this? Any theories about what we might be doing wrong in our browser tracking or assumptions around parallelism?

@javiercn
Copy link
Member

javiercn commented May 9, 2019

@SteveSandersonMS Thanks for the impressive post-mortem. You are totally right in your analysis. I would have concluded the same thing.

Regarding selenium thread-safety there are a couple of things to note (not asking you to do any of this):

  • We can probably capture more detailed logs from the server. (I think it exposes them)
  • We could try and switch from a single global into a global server per testcollection (that essentially run sequentially).

I think your analysis has convinced me to the idea that we should also run the E2E SPA tests sequentially. They are already mostly sequential (as NPM restore happens sequentially) and we've seen flakyness in those two that can be associated with this.

@SteveSandersonMS
Copy link
Member Author

Update: I was unsatisfied about not really understanding why this only just started happening so investigated further. Turns out it's actually a real defect introduced in a recent commit. And appropriately enough, it's my bug. Working on an actual fix now.

@javiercn
Copy link
Member

javiercn commented May 9, 2019

Update: I was unsatisfied about not really understanding why this only just started happening so investigated further. Turns out it's actually a real defect introduced in a recent commit. And appropriately enough, it's my bug. Working on an actual fix now.

I would be interested in knowing what went on here. Is there an explicit new test that we can write to capture this situation?

@SteveSandersonMS
Copy link
Member Author

Closing as superseded by #10112

We could apply some of the changes here (particularly the "explicit wait for first render"). However since we're not seeing flakiness currently, I'm not massively enthusiastic about putting in more defensiveness about something that isn't broken. The more retry-type logic we put into the E2E tests, the less they are able to warn us about very rare intermittent faults.

I would be interested in knowing what went on here. Is there an explicit new test that we can write to capture this situation?

Described in #10112. The E2E tests as they are do provide validation that the bug no longer exists, in that they were failing on most builds before and no longer do so. However there isn't a single explicit E2E test that reliably captures the idea of cross-user interference, and I'm not sure how we'd write one since that's not mean to happen and if it did happen, we wouldn't be able to predict in advance what might trigger it (besides just running a lot of tests in parallel, which we already do).

@dougbu dougbu deleted the stevesa/more-components-e2e-reliability branch August 21, 2021 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-blazor Includes: Blazor, Razor Components area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants