-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix | Fixes incorrect data errors due to timeout running very late #907
Conversation
Looks reasonable. I've arrived at the timer state solution as well and mentioned it in the draft i posted earlier. We need a way to identify the item that the callback is targeted at. I'm not sure if the identity of the owner is unique enough for this, if you've traced the lifetime and it is then ok. I was considering just an auto incremented counter. My problem is I really won't want to create a timer for each invocation. Does this fix even the fast running cases from the original repo? |
I have only run some of Cheena's repros with 6000+ tasks that repro'd well for me and the one (from today) that interleaved multiple queries on the same connection with varying waits + cancels. I have not run the original repro. It would be great if someone else could run that one. I haven't had time to run them all. I'm even okay if someone wants to take my PR and run with it on their own. |
Yes, the root case related to Timeout delegate delays seems to be fixed here completely, to explain more, below issues are addressed with this PR:
I'm working on adding tests to ensure we can test these scenarios in pipelines with our repro available. @David-Engel this also needs to be done in NetFx source, so would request the same. :) |
Another note: In the high load scenario, we haven't been able to reliably dispose of the timer or keep it from firing when it shouldn't. That's the reason for this particular solution. |
I almost went with an auto-incremented counter, too, but noticed the owner link and thought that would be more reliable and easier than yet another counter. I couldn't think of a reliable way to ensure OnTimeout wouldn't run incorrectly without re-creating the timer with a new state reference. You and Cheena probably have a better overall picture of all the async lifetime and interleaving than I do, so I'd rely more on your confidence in this solution than my own. 😄 |
I thought this through overnight while I was supposed to be sleeping and believe that we have to recreate the timer. It's annoying but the only way to reach the state object is through the constructor because of how the internals of it work, even with private reflection we couldn't reliably do it. So I'm going to write this one off against the performance capital gained from other changes and just accept it. We have to recreate the timer to set the state object. We need the state object to be a ref type to avoid silly boxing behaviour so we'll probably end up with something like I believe that the changes in my draft PR #906 are still useful because they increase the reliability of the timeout. I think I should add the logic described above to that draft, what do you think? |
Just to complete this PR, tests are here: David-Engel#1 |
Add tests
Attempted fix for #659.
Many thanks to Cheena for the legwork in locating the troublesome area.
The underlying problem is that the timer that fires OnTimeout can fire very late when there are no free threads in the pool, even after the state object has been recycled and is put back into use on another query. My solution is to track the owning object that the timeout is supposed to be associated with and if it is has changed when the timer runs, OnTimeout should simply return rather than send attention for the now incorrect query.
Cheena also observed timeouts not working on subsequent re-uses of the state object. I think this also fixes those. We should add tests specific to these scenarios to this PR before merging.
This seems to fix the incorrect data repros for 659 that I've run. Hopefully others can validate it, too, so that we can get a solution finalized.
CC: @cheenamalhotra @Wraith2