Skip to content

Disconnect client/Dispose Circuit on error#12857

Merged
rynowak merged 5 commits intorelease/3.0from
rynowak/forced-disconnection
Aug 11, 2019
Merged

Disconnect client/Dispose Circuit on error#12857
rynowak merged 5 commits intorelease/3.0from
rynowak/forced-disconnection

Conversation

@rynowak
Copy link
Member

@rynowak rynowak commented Aug 4, 2019

Fixes: #11845

Heads up on a few things if you are reviewing this bad-boi:

  • This assumes Javier's changes about disconnect handling will go in first (I did a mad katz cherry pick)
  • You have no hope to review this without looking commit-by-commit - it will just look like I massacred CircuitHost.

This does however express some of the main ideas, along with fixing some of the things that were needed as follow ups from previous work I did in this area.

My goal is that at the end of this we all feel really happy with the overall strategy for error handling, and agree that it's complete.

Main points:

  • Hub methods handle obvious error cases as part of the call to the hub where possible
    • Respond to the call with something that indicates success or failure
    • Terminate the connection where it makes sense
  • CircuitHost/Registry methods are fire-and-forget from the Hub and do async error handling
    • Async error handler means sending an "error" to the client
    • Clients should hang up when they see an error
    • Misbehaving clients might not hang up
    • Therefore we should dispose the circuit
    • The next message a misbehaving client sends will disconnect them
  • We don't want to let clients interact with a circuit in an unknown state
    • We err on the side of tearing them down when something goes wrong
  • Errors that occur when the client is disconnected should tear down the circuit
    • Reporting back to the client that something bad happened is best-effort
  • We want to make a distinction between invalid input and user-code throwing
    • Log invalid input and other shenanigans with debug verbosity, and respons with generic errors
    • Log user-code throwing with error verbosity and respond with exception details in development mode

I think what's here covers all of these points pretty comprehensively, and I think these are already details we agree about in principle.


So the major change here is that we now subscribe to unhandled exceptions in the registry rather than the Hub. This is something I tried to address in an earlier PR because it was super wrong. We can't attach event handlers from the Hub, because Hubs need to go away. The registry is guaranteed to outlive a circuit and already knows how to clean one up.

@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from 8f1618a to 8cd6cf9 Compare August 4, 2019 20:06
@rynowak
Copy link
Member Author

rynowak commented Aug 4, 2019

OK this is now ready for review. This included a lot of test updates and small tweaks to complicated fiddly logic.

@rynowak rynowak requested review from javiercn and pranavkm August 4, 2019 20:07
@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from 8cd6cf9 to a71dee8 Compare August 5, 2019 01:06
@rynowak rynowak added area-blazor Includes: Blazor, Razor Components Blazor ♥ SignalR This issue is related to the experience of Signal R and Blazor working together tell-mode Indicates a PR which is being merged during tell-mode labels Aug 5, 2019
@rynowak rynowak requested a review from BrennanConroy August 5, 2019 01:13
@BrennanConroy
Copy link
Member

Nice label 😄

@rynowak
Copy link
Member Author

rynowak commented Aug 5, 2019

Nice label 😄

I dunno who made it. As is widely known, I am not able to create labels.

@BrennanConroy
Copy link
Member

Looks like Andrew probably did #10407

@rynowak
Copy link
Member Author

rynowak commented Aug 5, 2019

The race is on to create a "Blazor 💔 SignalR - now EF is my best friend" label.

@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from ada1cfe to d71aa60 Compare August 9, 2019 03:14
@rynowak rynowak changed the base branch from master to release/3.0 August 9, 2019 03:15
Copy link
Member

@SteveSandersonMS SteveSandersonMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so detailed! It's good to see a lot of comments in here. Without them it would be very tough to keep track of how the responsibilities for error handling, disposal, etc. are distributed.

Marking as approved because this looks super-well-thought-out now and I don't want to delay you when you're ready to merge. @BrennanConroy's comments (e.g., typos) and this would still be worth addressing.

@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from 4635849 to 80c0e69 Compare August 9, 2019 22:20
rynowak pushed a commit that referenced this pull request Aug 9, 2019
Fixes: #11845

See: #12857 for detailed notes.
@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from 80c0e69 to c3e351b Compare August 10, 2019 20:29
Fixes: #11845

See: #12857 for detailed notes.
@rynowak
Copy link
Member Author

rynowak commented Aug 10, 2019

Where this is stuck right now...

I finished cleaning up all of these cases, and I've added a lot of reliability and logging to our tests.

We now fail somewhat unpredictably in all server-side tests when RemoteRenderer.UpdateDisplayAsync runs into empty frames. This usually looks like a case where the first 20 or so frames of the reference frames in RenderBatch have been cleared.

My guess is that we've had this bug for some time, but now that we have more robust error handling, it's causing failures to be visible.

I'm investigating this now inside the renderer, as a possible sharing violation, or a possible double-return bug with the array pool.

This change prevents thread pool starvation when running a bunch of
selenium-based tests, by turning the blocking wait for a WebDriver to
start into an async wait.

This also seems to help with speed, and reliability since we're not
running too many browsers at once. I was experencing timeouts, and
seeing them in the debugger while running tests locally, this no longer
happens.
This is used from a bunch of static methods. Dictionary isn't thread
safe. Encountered this while debugging some other things.
Since we're using the ArrayPool, it's really essential that we prevent
use-after-free bugs. I'm currently tracking one down.
It turns out we frequently have errors in the browser console in cases
where we're hitting a "timeout".
@rynowak rynowak force-pushed the rynowak/forced-disconnection branch from c3e351b to 540cac0 Compare August 11, 2019 02:36
@rynowak rynowak merged commit 5d4c4d6 into release/3.0 Aug 11, 2019
rynowak pushed a commit that referenced this pull request Aug 11, 2019
Fixes: #11845

See: #12857 for detailed notes.
@rynowak
Copy link
Member Author

rynowak commented Aug 11, 2019

I'm investigating this now inside the renderer, as a possible sharing violation, or a possible double-return bug with the array pool.

The issue here is that it's possible for the RemoteRenderer to operate on a buffer that was returned during disposal by doing more rendering work after its been disposed.

I made a fix that harden ArrayBuilder and makes it throw when you try to reuse it after disposal (also preventing use-after-free on the underlying buffer). It looks like that turns the problem into an exception after the circuit is already disposed (harmless, but we should follow up on it).

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

@SteveSandersonMS
Copy link
Member

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Does that mean there was also a bug in the E2E tests? It sounds like there must be, since otherwise I wouldn't expect them to try to use buffers after disposal. Is this because some tests failed to stop background work after disposal or something, and if so, did you find out which tests it was so we can fix them?

@javiercn
Copy link
Member

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Sounds to me that it wasn't a test issue but a product issue that manifested in test flakyness? If that's the case, can we/should we do anything to prevent these types of bugs?

@SteveSandersonMS
Copy link
Member

It's not clear to me. It sounds like previously the product code was too tolerant to use-after-dispose, and Ryan has hardened that here.

What I don't know is whether our product code also intrinsically causes use-after-dispose (in which case we need to understand exactly how), or whether that was only in the test code (in which case we should fix tests to avoid perpetuating bad patterns). Let's find out from @rynowak later today.

@rynowak rynowak deleted the rynowak/forced-disconnection branch August 12, 2019 14:56
@rynowak
Copy link
Member Author

rynowak commented Aug 12, 2019

TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit.

Does that mean there was also a bug in the E2E tests? It sounds like there must be, since otherwise I wouldn't expect them to try to use buffers after disposal. Is this because some tests failed to stop background work after disposal or something, and if so, did you find out which tests it was so we can fix them?

This is (both is and was) definitely a product issue. Any pending renering work you have queued at the time of disposal will still render, and would hit this case. Before it was a use-after-free, not it's an ObjectDisposedException. This no longer causes flakiness because the circuit is already disposed, the exception just gets ignored. However it's still not a great design.

The interleaving looks like this (before):

  • Component is rendering in Circuit1
  • Component starts some async work (will queue more rendering in the future)
  • Circuit1 is shut downs and returns its buffer to the pool
  • Circuit2 asks for a buffer and gets the one that Circuit1 is using
  • Circuit1 has another queued render after async work completes, now Circuit1 and Circuit2 are sharing a buffer

The interleaving looks like this (after):

  • Component is rendering in Circuit1
  • Component starts some async work (will queue more rendering in the future)
  • Circuit1 is shut downs and returns its buffer to the pool
  • Circuit2 asks for a buffer and gets the one that Circuit1 was using (but Circuit1 is no longer using it)
  • Circuit1 has another queued render after async work completes, now Circuit1 throws an ObjectDisposedException.

see: #13056

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-blazor Includes: Blazor, Razor Components Blazor ♥ SignalR This issue is related to the experience of Signal R and Blazor working together tell-mode Indicates a PR which is being merged during tell-mode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants