Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop a stuck BlockingCollection.Take operation that caused thread leak on the client. #1351

Closed
wants to merge 1 commit into from

Conversation

gabikliot
Copy link
Contributor

Fixes #1350.

return null;
}

// Don't pass CancellationToken to Take. It causes too much spinning.
Message msg = PendingInboundMessages.Take();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the comment at #1350 (comment) :

hmmm, that's very odd, but I guess it has to do with Disposing the blocking collection at the same time that the Take is initiating, and getting into a weird state. I was trying to follow the code in the framework, but couldn't find an ArgumentNullException being thrown from SemaphoreSlim.Wait directly. Probably some function got inlined and it's not showing in the stack trace, and the code is very tough to follow. Any chance we can delay disposing somehow? If not, it might make to catch ArgumentNullException (although I dislike that, but it's better than having a catch-all clause for all exceptions to be treated equally, when we know this exception happens harmlessly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will just remove the Dispose completely. Its redundant. I should not have put it. CompleteAdding is just enough. No need to "smash kill with a hammer" if a clean stop does the work.
If we did want Dispose (which I think we do not now), then we need to wait for this thread to finish first and dispose only after we know there are no usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although this actually might have some unmanaged resources, since it's using semaphores. Still it might be fine.
Catching InvalidOperationException still makes sense though according to docs

@gabikliot
Copy link
Contributor Author

Removed Dispose and added 2 extra exception catch blocks.

@jdom
Copy link
Member

jdom commented Jan 30, 2016

Looks good. Do you want to squash before I merge? (and potentially address that minor comment?)

@gabikliot
Copy link
Contributor Author

Done: fixed logging, squashed, rebased.

@amccool
Copy link
Contributor

amccool commented Jan 31, 2016

I assume this is going into master (1.2, right?). If there is a 1.1.3 could I get this fix in there as well?

@gabikliot
Copy link
Contributor Author

Alex, I not sure what is the release plan, but there is also an easy short term fix that you could do on your side: just put a short sleep in between the client init attempts, instead of a tight loop. It will cause much less threads to leak. Also worth mentionint that those threads are just stuck, not really consuming CPU. So it doesn't appear to be a critical bug.

@amccool
Copy link
Contributor

amccool commented Feb 1, 2016

We are currently using polly with an exponential backoff on re-init. The threads do appear to be "just stuck" but when the total thread count gets excessive (400+ oooops) the context switch/sec kills the box.
Because we can pull nugets on a private repo I can live with not getting something released in the short term. I was just hoping to identify what release this will be going into.

{
logger.Verbose(ErrorCode.ProxyClient_OperationCancelled, "Received operation cancelled exception -- exiting. {0}", oce);
logger.Verbose(ErrorCode.ProxyClient_ReceiveError, "Received Invalid Operation exception -- exiting. {0}", exc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these new exceptions should log with a different error code than the general exception case (ErrorCode.ProxyClient_ReceiveError). Maybe not worth creating an entirely new code, but at least maybe reuse ErrorCode.ProxyClient_OperationCancelled, which is logically very similar to canceling the Take call by passing a cancellation token, just a different implementation due to perf, but for the end user reading the log it should be no different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, if you agree I can pull locally and do the change before committing. No need to submit the change yourself if you aren't near your dev computer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree. Go ahead.
Also, those are at Verbose, so they wont appear in the log at all, since we very rarely run at Verbsoe.

@jdom
Copy link
Member

jdom commented Feb 1, 2016

@amccool glad that you could work around it. Not high priority I assume, but yes, if there is a 1.1.3 we can definitely include this fix as well.

@jdom
Copy link
Member

jdom commented Feb 1, 2016

Merged with b7784c4

@jdom jdom closed this Feb 1, 2016
@gabikliot gabikliot deleted the CancelClientThread branch February 15, 2016 07:08
@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants