Disconnect client/Dispose Circuit on error by rynowak · Pull Request #12857 · dotnet/aspnetcore

rynowak · 2019-08-04T05:53:48Z

Heads up on a few things if you are reviewing this bad-boi:

This assumes Javier's changes about disconnect handling will go in first (I did a mad katz cherry pick)
You have no hope to review this without looking commit-by-commit - it will just look like I massacred CircuitHost.

This does however express some of the main ideas, along with fixing some of the things that were needed as follow ups from previous work I did in this area.

My goal is that at the end of this we all feel really happy with the overall strategy for error handling, and agree that it's complete.

Main points:

Hub methods handle obvious error cases as part of the call to the hub where possible
- Respond to the call with something that indicates success or failure
- Terminate the connection where it makes sense
CircuitHost/Registry methods are fire-and-forget from the Hub and do async error handling
- Async error handler means sending an "error" to the client
- Clients should hang up when they see an error
- Misbehaving clients might not hang up
- Therefore we should dispose the circuit
- The next message a misbehaving client sends will disconnect them
We don't want to let clients interact with a circuit in an unknown state
- We err on the side of tearing them down when something goes wrong
Errors that occur when the client is disconnected should tear down the circuit
- Reporting back to the client that something bad happened is best-effort
We want to make a distinction between invalid input and user-code throwing
- Log invalid input and other shenanigans with debug verbosity, and respons with generic errors
- Log user-code throwing with error verbosity and respond with exception details in development mode

I think what's here covers all of these points pretty comprehensively, and I think these are already details we agree about in principle.

So the major change here is that we now subscribe to unhandled exceptions in the registry rather than the Hub. This is something I tried to address in an earlier PR because it was super wrong. We can't attach event handlers from the Hub, because Hubs need to go away. The registry is guaranteed to outlive a circuit and already knows how to clean one up.

rynowak · 2019-08-04T20:07:39Z

OK this is now ready for review. This included a lot of test updates and small tweaks to complicated fiddly logic.

BrennanConroy · 2019-08-05T01:50:04Z

Nice label 😄

rynowak · 2019-08-05T01:51:03Z

Nice label 😄

I dunno who made it. As is widely known, I am not able to create labels.

BrennanConroy · 2019-08-05T01:52:28Z

Looks like Andrew probably did #10407

rynowak · 2019-08-05T01:54:15Z

The race is on to create a "Blazor 💔 SignalR - now EF is my best friend" label.

src/Components/Server/src/Circuits/CircuitHandle.cs

src/Components/Server/src/Circuits/CircuitHost.cs

src/Components/Server/src/ComponentHub.cs

src/Components/Server/src/Circuits/CircuitHost.cs

src/Components/Server/src/ComponentHub.cs

src/Components/Server/src/Circuits/CircuitHost.cs

src/Components/Components/src/Rendering/ComponentState.cs

src/Components/Components/src/Rendering/Renderer.cs

src/Components/Components/src/Rendering/ComponentState.cs

src/Components/test/testassets/BasicTestApp/ServerReliability/ReliabilityComponent.razor

src/Components/Server/src/ComponentHub.cs

src/Components/Server/src/Circuits/CircuitRegistry.cs

src/Components/Components/src/Rendering/ComponentState.cs

src/Components/Server/src/Circuits/CircuitHandle.cs

src/Components/Server/src/Circuits/CircuitHost.cs

src/Components/Server/src/ComponentHub.cs

src/Components/Server/src/Circuits/CircuitHost.cs

SteveSandersonMS

This is so detailed! It's good to see a lot of comments in here. Without them it would be very tough to keep track of how the responsibilities for error handling, disposal, etc. are distributed.

Marking as approved because this looks super-well-thought-out now and I don't want to delay you when you're ready to merge. @BrennanConroy's comments (e.g., typos) and this would still be worth addressing.

Fixes: #11845 See: #12857 for detailed notes.

rynowak · 2019-08-10T23:42:11Z

Where this is stuck right now...

I finished cleaning up all of these cases, and I've added a lot of reliability and logging to our tests.

We now fail somewhat unpredictably in all server-side tests when RemoteRenderer.UpdateDisplayAsync runs into empty frames. This usually looks like a case where the first 20 or so frames of the reference frames in RenderBatch have been cleared.

My guess is that we've had this bug for some time, but now that we have more robust error handling, it's causing failures to be visible.

I'm investigating this now inside the renderer, as a possible sharing violation, or a possible double-return bug with the array pool.

This change prevents thread pool starvation when running a bunch of selenium-based tests, by turning the blocking wait for a WebDriver to start into an async wait. This also seems to help with speed, and reliability since we're not running too many browsers at once. I was experencing timeouts, and seeing them in the debugger while running tests locally, this no longer happens.

This is used from a bunch of static methods. Dictionary isn't thread safe. Encountered this while debugging some other things.

Since we're using the ArrayPool, it's really essential that we prevent use-after-free bugs. I'm currently tracking one down.

It turns out we frequently have errors in the browser console in cases where we're hitting a "timeout".

Fixes: #11845 See: #12857 for detailed notes.

rynowak · 2019-08-11T17:51:22Z

I'm investigating this now inside the renderer, as a possible sharing violation, or a possible double-return bug with the array pool.

The issue here is that it's possible for the RemoteRenderer to operate on a buffer that was returned during disposal by doing more rendering work after its been disposed.

I made a fix that harden ArrayBuilder and makes it throw when you try to reuse it after disposal (also preventing use-after-free on the underlying buffer). It looks like that turns the problem into an exception after the circuit is already disposed (harmless, but we should follow up on it).

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

SteveSandersonMS · 2019-08-12T09:06:57Z

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Does that mean there was also a bug in the E2E tests? It sounds like there must be, since otherwise I wouldn't expect them to try to use buffers after disposal. Is this because some tests failed to stop background work after disposal or something, and if so, did you find out which tests it was so we can fix them?

javiercn · 2019-08-12T09:25:05Z

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Sounds to me that it wasn't a test issue but a product issue that manifested in test flakyness? If that's the case, can we/should we do anything to prevent these types of bugs?

SteveSandersonMS · 2019-08-12T10:13:09Z

It's not clear to me. It sounds like previously the product code was too tolerant to use-after-dispose, and Ryan has hardened that here.

What I don't know is whether our product code also intrinsically causes use-after-dispose (in which case we need to understand exactly how), or whether that was only in the test code (in which case we should fix tests to avoid perpetuating bad patterns). Let's find out from @rynowak later today.

rynowak · 2019-08-12T15:00:43Z

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Does that mean there was also a bug in the E2E tests? It sounds like there must be, since otherwise I wouldn't expect them to try to use buffers after disposal. Is this because some tests failed to stop background work after disposal or something, and if so, did you find out which tests it was so we can fix them?

This is (both is and was) definitely a product issue. Any pending renering work you have queued at the time of disposal will still render, and would hit this case. Before it was a use-after-free, not it's an ObjectDisposedException. This no longer causes flakiness because the circuit is already disposed, the exception just gets ignored. However it's still not a great design.

The interleaving looks like this (before):

Component is rendering in Circuit1
Component starts some async work (will queue more rendering in the future)
Circuit1 is shut downs and returns its buffer to the pool
Circuit2 asks for a buffer and gets the one that Circuit1 is using
Circuit1 has another queued render after async work completes, now Circuit1 and Circuit2 are sharing a buffer

The interleaving looks like this (after):

Component is rendering in Circuit1
Component starts some async work (will queue more rendering in the future)
Circuit1 is shut downs and returns its buffer to the pool
Circuit2 asks for a buffer and gets the one that Circuit1 was using (but Circuit1 is no longer using it)
Circuit1 has another queued render after async work completes, now Circuit1 throws an ObjectDisposedException.

see: #13056

rynowak requested a review from SteveSandersonMS as a code owner August 4, 2019 05:53

rynowak force-pushed the rynowak/forced-disconnection branch from 8f1618a to 8cd6cf9 Compare August 4, 2019 20:06

rynowak requested review from javiercn and pranavkm August 4, 2019 20:07

rynowak force-pushed the rynowak/forced-disconnection branch from 8cd6cf9 to a71dee8 Compare August 5, 2019 01:06

rynowak added area-blazor Includes: Blazor, Razor Components Blazor ♥ SignalR This issue is related to the experience of Signal R and Blazor working together tell-mode Indicates a PR which is being merged during tell-mode labels Aug 5, 2019

rynowak requested a review from BrennanConroy August 5, 2019 01:13