-
Notifications
You must be signed in to change notification settings - Fork 10.5k
In components E2E tests, always wait for initial render after each MountTestComponent call #10082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…untTestComponent call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 👏
Well, some more still randomly failed. For the new failures, the thing that failed was not "finding the first element immediately on initial render", so it's possible that the fix in this PR has helped with that case. Instead, the only way I can make sense of the newer failures is if the CI runners are spectacularly slow sometimes and it just times out randomly in the middle of a test, even if we've done everything right to perform all assertions and element lookups with the "wait and retry" pattern. In case CI runner slowness actually is the underlying reason why this has become much less reliable in the last few days, I'm trying extending the timeouts. |
// interference between test classes, for example test A observing the UI state of test B, which | ||
// suggests that either Selenium has thread-safety issues that direct commands to the wrong | ||
// browser instance, or something in our tracking of browser instances goes wrong. | ||
[assembly: CollectionBehavior(DisableTestParallelization = true)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you get to diagnose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted below
I've now investigated this very extensively.
I've tried changing lots of things about how the E2E tests run that didn't work:
... and along the way, got some error cases that were more interesting than others. The most interesting ones were very rare, but were cases where a test from class A (e.g.,
I don't have an explanation, other than to posit that either Selenium itself has thread-safety issues (we are sharing a Selenium server instance across all tests), or that somehow our tracking of A fixOne thing that does actually reliably make the problem go away is disabling parallelism on the E2E tests entirely. At least, it fixes it when I run locally - we're yet to see about CI. This shouldn't actually affect our level of test coverage, since we weren't covering simultaneous users on a single server anyway (we have separate server instances per test class, and tests within a class are serial anyway). The only drawbacks are:
Making this change also makes some bits of our E2E runner infrastructure somewhat redundant, e.g., all the tracking of browser instances, since we could now just have a single browser. I'm not proposing to rip out that infrastructure since it's meant to be consistent with how we run Selenium in other places (e.g., template tests, of which there are far fewer so it's not surprising they don't trip up as much) and hopefully we're reintroduce the parallelism at some point. @javiercn I know you've done some work on the E2E infrastructure, browser tracking, etc. What's your take on all this? Any theories about what we might be doing wrong in our browser tracking or assumptions around parallelism? |
@SteveSandersonMS Thanks for the impressive post-mortem. You are totally right in your analysis. I would have concluded the same thing. Regarding selenium thread-safety there are a couple of things to note (not asking you to do any of this):
I think your analysis has convinced me to the idea that we should also run the E2E SPA tests sequentially. They are already mostly sequential (as NPM restore happens sequentially) and we've seen flakyness in those two that can be associated with this. |
Update: I was unsatisfied about not really understanding why this only just started happening so investigated further. Turns out it's actually a real defect introduced in a recent commit. And appropriately enough, it's my bug. Working on an actual fix now. |
I would be interested in knowing what went on here. Is there an explicit new test that we can write to capture this situation? |
Closing as superseded by #10112 We could apply some of the changes here (particularly the "explicit wait for first render"). However since we're not seeing flakiness currently, I'm not massively enthusiastic about putting in more defensiveness about something that isn't broken. The more retry-type logic we put into the E2E tests, the less they are able to warn us about very rare intermittent faults.
Described in #10112. The E2E tests as they are do provide validation that the bug no longer exists, in that they were failing on most builds before and no longer do so. However there isn't a single explicit E2E test that reliably captures the idea of cross-user interference, and I'm not sure how we'd write one since that's not mean to happen and if it did happen, we wouldn't be able to predict in advance what might trigger it (besides just running a lot of tests in parallel, which we already do). |
This might wipe out a huge amount of the flakiness we've experienced in the last few days.
I said we needed something unique to signal the start of each test case, but thinking a step further, that should be unnecessary. We can explicitly wait for the last test case to be removed from the DOM, and then wait for the DOM to be updated a further time for the new test case.