Skip to content

Conversation

SteveSandersonMS
Copy link
Member

@SteveSandersonMS SteveSandersonMS commented May 9, 2019

I eventually found that the newly-introduced components E2E test flakiness was actually due to a real product bug that I created in the "key support" feature.

As a workaround to what seems like a Chrome bug (or at least a bizarre inconsistency between Chrome and Firefox), I had put in some new logic to ensure events are only ever dispatched in series, and are not nested inside each other. This was only intended to affect client-side execution, since the same problem can't occur on the server because of its sync context. However, I made a mistake and put the change into RendererRegistryEventDispatcher which is actually shared between server and client (I had thought it was client only). Therefore the queuing system I implemented was inadvertently being shared across all users on the server, creating random timing effects where it's possible for one user's event to be dispatched to a different user. Hence the interference between simultaneous tests.

The fix here is to move the new dispatch logic into a place where it really only runs on the client, where there really only is one user.

Secondly, while trying to verify this through E2E tests, I noticed that we were missing server-side execution coverage for some of the E2E test classes. Once I added that, I found that some minor tweaks were needed to create consistency between the two environments and have all tests pass everywhere.

What we can learn from this:

  • Big PRs are bad because nobody can really review them. Maybe if the "key support" PR was a series of much smaller ones, someone might have noticed that I was introducing a serious defect where state would leak between users.
  • It's not great that we have to remember to manually add "server" versions of any new E2E test classes we add. We forgot about several until now. I don't have a proposal to fix this right now because we have more urgent things to do, so I'm just going to try to point this out when people next add new E2E test classes.
  • It's good that our E2E runner blasts the machine as fast as it can with as many tests as possible in parallel. This can uncover reliability issues that you'd never see if you just ran tests one at a time locally.
  • ... anything else?

{
var info = deferredIncomingEvents.Dequeue();
var task = DispatchEventAsync(info.EventHandlerId, info.EventArgs);
task.ContinueWith(_ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a really good reason to use ContinueWith here instead of await. I know there's no scheduler in this use case, but it's seemed like we've had bugs in most of our usage of ContinueWith so I like to avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just avoiding async void since there's no drawback to the ContinueWith here. However it doesn't really matter, so based on your preference I've changed this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also async Task and then invoke it as fire-and-forget.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be more misleading. It suggests the upstream code is responsible for error handling.

Copy link
Member

@rynowak rynowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to get this in and stabilize things ASAP

@Eilon Eilon added the area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates label May 9, 2019
@SteveSandersonMS SteveSandersonMS added the area-blazor Includes: Blazor, Razor Components label May 9, 2019
@javiercn
Copy link
Member

Holly molly.

I'm glad that we found that. I actually noticed the RenderRegistry change, I just didn't realize the implications. It might have been an oversight on my part too, as the change was in a commit after I signed off on the PR I might have dismissed it as minor and didn't think that much of it.

WRT big PRs I agree. The smaller the better.

@SteveSandersonMS
Copy link
Member Author

It might have been an oversight on my part too

Not your fault. I wrote that code!

I am also a bit shaken that we got so close to shipping such a bad bug (security implications, but only once your site gets to a high level of traffic -- absolutely the worst). I'm taking it as a lesson to hunt down anything weird relentlessly, and not wave it away with vague ideas that external things (Selenium, in this case) might just be unreliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-blazor Includes: Blazor, Razor Components area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants