New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop a stuck BlockingCollection.Take operation that caused thread leak on the client. #1351
Conversation
return null; | ||
} | ||
|
||
// Don't pass CancellationToken to Take. It causes too much spinning. | ||
Message msg = PendingInboundMessages.Take(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the comment at #1350 (comment) :
hmmm, that's very odd, but I guess it has to do with Disposing the blocking collection at the same time that the Take
is initiating, and getting into a weird state. I was trying to follow the code in the framework, but couldn't find an ArgumentNullException
being thrown from SemaphoreSlim.Wait
directly. Probably some function got inlined and it's not showing in the stack trace, and the code is very tough to follow. Any chance we can delay disposing somehow? If not, it might make to catch ArgumentNullException (although I dislike that, but it's better than having a catch-all clause for all exceptions to be treated equally, when we know this exception happens harmlessly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will just remove the Dispose completely. Its redundant. I should not have put it. CompleteAdding is just enough. No need to "smash kill with a hammer" if a clean stop does the work.
If we did want Dispose (which I think we do not now), then we need to wait for this thread to finish first and dispose only after we know there are no usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although this actually might have some unmanaged resources, since it's using semaphores. Still it might be fine.
Catching InvalidOperationException still makes sense though according to docs
Removed Dispose and added 2 extra exception catch blocks. |
Looks good. Do you want to squash before I merge? (and potentially address that minor comment?) |
5efcd6a
to
97f298d
Compare
Done: fixed logging, squashed, rebased. |
I assume this is going into master (1.2, right?). If there is a 1.1.3 could I get this fix in there as well? |
Alex, I not sure what is the release plan, but there is also an easy short term fix that you could do on your side: just put a short sleep in between the client init attempts, instead of a tight loop. It will cause much less threads to leak. Also worth mentionint that those threads are just stuck, not really consuming CPU. So it doesn't appear to be a critical bug. |
We are currently using polly with an exponential backoff on re-init. The threads do appear to be "just stuck" but when the total thread count gets excessive (400+ oooops) the context switch/sec kills the box. |
{ | ||
logger.Verbose(ErrorCode.ProxyClient_OperationCancelled, "Received operation cancelled exception -- exiting. {0}", oce); | ||
logger.Verbose(ErrorCode.ProxyClient_ReceiveError, "Received Invalid Operation exception -- exiting. {0}", exc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these new exceptions should log with a different error code than the general exception case (ErrorCode.ProxyClient_ReceiveError
). Maybe not worth creating an entirely new code, but at least maybe reuse ErrorCode.ProxyClient_OperationCancelled
, which is logically very similar to canceling the Take
call by passing a cancellation token, just a different implementation due to perf, but for the end user reading the log it should be no different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, if you agree I can pull locally and do the change before committing. No need to submit the change yourself if you aren't near your dev computer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree. Go ahead.
Also, those are at Verbose, so they wont appear in the log at all, since we very rarely run at Verbsoe.
@amccool glad that you could work around it. Not high priority I assume, but yes, if there is a 1.1.3 we can definitely include this fix as well. |
Merged with b7784c4 |
Fixes #1350.