Disconnect client/Dispose Circuit on error#12857
Conversation
8f1618a to
8cd6cf9
Compare
|
OK this is now ready for review. This included a lot of test updates and small tweaks to complicated fiddly logic. |
8cd6cf9 to
a71dee8
Compare
|
Nice label 😄 |
I dunno who made it. As is widely known, I am not able to create labels. |
|
Looks like Andrew probably did #10407 |
|
The race is on to create a "Blazor 💔 SignalR - now EF is my best friend" label. |
src/Components/test/testassets/BasicTestApp/ServerReliability/ReliabilityComponent.razor
Show resolved
Hide resolved
ada1cfe to
d71aa60
Compare
SteveSandersonMS
left a comment
There was a problem hiding this comment.
This is so detailed! It's good to see a lot of comments in here. Without them it would be very tough to keep track of how the responsibilities for error handling, disposal, etc. are distributed.
Marking as approved because this looks super-well-thought-out now and I don't want to delay you when you're ready to merge. @BrennanConroy's comments (e.g., typos) and this would still be worth addressing.
4635849 to
80c0e69
Compare
80c0e69 to
c3e351b
Compare
|
Where this is stuck right now... I finished cleaning up all of these cases, and I've added a lot of reliability and logging to our tests. We now fail somewhat unpredictably in all server-side tests when My guess is that we've had this bug for some time, but now that we have more robust error handling, it's causing failures to be visible. I'm investigating this now inside the renderer, as a possible sharing violation, or a possible double-return bug with the array pool. |
This change prevents thread pool starvation when running a bunch of selenium-based tests, by turning the blocking wait for a WebDriver to start into an async wait. This also seems to help with speed, and reliability since we're not running too many browsers at once. I was experencing timeouts, and seeing them in the debugger while running tests locally, this no longer happens.
This is used from a bunch of static methods. Dictionary isn't thread safe. Encountered this while debugging some other things.
Since we're using the ArrayPool, it's really essential that we prevent use-after-free bugs. I'm currently tracking one down.
It turns out we frequently have errors in the browser console in cases where we're hitting a "timeout".
c3e351b to
540cac0
Compare
The issue here is that it's possible for the I made a fix that harden TLDR we've had a reliability bug since pooling was introduced where a circuit could break other circuits by continuing to do work after disposal, and sharing the underlying buffer with another circuit. |
Does that mean there was also a bug in the E2E tests? It sounds like there must be, since otherwise I wouldn't expect them to try to use buffers after disposal. Is this because some tests failed to stop background work after disposal or something, and if so, did you find out which tests it was so we can fix them? |
Sounds to me that it wasn't a test issue but a product issue that manifested in test flakyness? If that's the case, can we/should we do anything to prevent these types of bugs? |
|
It's not clear to me. It sounds like previously the product code was too tolerant to use-after-dispose, and Ryan has hardened that here. What I don't know is whether our product code also intrinsically causes use-after-dispose (in which case we need to understand exactly how), or whether that was only in the test code (in which case we should fix tests to avoid perpetuating bad patterns). Let's find out from @rynowak later today. |
This is (both is and was) definitely a product issue. Any pending renering work you have queued at the time of disposal will still render, and would hit this case. Before it was a use-after-free, not it's an The interleaving looks like this (before):
The interleaving looks like this (after):
see: #13056 |
Fixes: #11845
Heads up on a few things if you are reviewing this bad-boi:
CircuitHost.This does however express some of the main ideas, along with fixing some of the things that were needed as follow ups from previous work I did in this area.
My goal is that at the end of this we all feel really happy with the overall strategy for error handling, and agree that it's complete.
Main points:
I think what's here covers all of these points pretty comprehensively, and I think these are already details we agree about in principle.
So the major change here is that we now subscribe to unhandled exceptions in the registry rather than the Hub. This is something I tried to address in an earlier PR because it was super wrong. We can't attach event handlers from the Hub, because Hubs need to go away. The registry is guaranteed to outlive a circuit and already knows how to clean one up.