-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log-courier input plugin hanging in thread state WAITING (on object monitor) #243
Comments
The issue is definitely in the input plugin if the ACKN is successful. ACKN of the same payload over and over again means the input plugin isn't able to clear its queue so it's stalling the senders. The input thread ( Looking at the filter worker ( This generally means the elasticsearch output is stuck ( |
Thanks for the quick response!
Right, that was my explanation of what I was seeing as well. Thanks for confirming that.
That's what I suspected as well. I'm relatively certain then that the issue is somewhere in logstash itself here (quite possibly the elasticsearch output plugin as you mentioned) rather than in the log-courier input.
Unfortunately, there wasn't anything that stood out. I'm going to take another look at them but will probably have to resort to bumping it up to debug logging (which will likely also make me lose the reproduction case as logstash restarts) and see if anything comes out. I'm going to leave this issue open a little while longer just in case, but will update when I have more information. |
On a side note, would you consider accepting the following additions to log-courier?
The first would ensure unhealthy servers are detected and avoided (a naive implementation may still hit the same server a few times if it's unlucky, but eventually it should hit a healthy one) while the second would allow us to more easily detect this scenario. |
The additional fields are already in roadmap. |
I'll close this as the lc-admin pieces are already in the roadmap, and load balancing will be much better in 2.0 |
(Note: Please bear with me, there's a lot of info here to explain what steps we've taken to debug the situation so far. You may wish to skip to the section marked "Looking at the input plugin" first, then backtrack if needed)
We are running into an issue with log-courier/logstash/elasticsearch where it _appears_ that the log-courier input plugin ends up hanging and no longer sends data through to elasticsearch. It is unclear at this point whether this is a bug on the side of log-courier or whether the issue is with logstash itself.
Background
Our journey begins with Nagios alerts going off about servers running out of available disk space. Some investigation pointed to log-courier being the culprit, as it was keeping open large amounts of rotated files.
At first we believed log-courier was not properly detecting rotated files, however I have mostly eliminated that scenario. Rather, this appears to be by-design, as it seems that on the affected server(s), log-courier has not yet successfully sent all data from these logs to logstash.
lc-admin output shows orphaned files and pending payloads.
Turning the log level to debug shows that for some time, everything is fine but at Oct 18 11:51:45 log-courier loses connection to logstash and reconnects. It then subsequently starts receiving ACKNs for the same payload in an endless loop.
I first thought this may be an issue with log-courier on the client side (there's an strace here as well if you want it) but after some time I noticed there is just one specific logstash server out of our cluster that triggers this behavior.
Our log-courier agents pick a logstash server from a pool using round-robin DNS and in my experiments, I found that all servers in the pool are fine except for one. Connecting to this one immediately triggers the loop of identical ACKN messages. Connect to another server and everything goes through fine.
Looking at the input plugin
There is nothing noteworthy in the logs on the logstash side, though I should note that we haven't turned these up to debug yet. However, in issue #219 I found mention of some symptoms that appear to correspond with some things we are seeing.
I dumped thread traces of both a healthy as well the unhealthy logstash servers.
On the unhealthy one I am seeing that all the log-courier threads (all the ones marked
"Ruby-0-Thread-0123456: /opt/logstash/vendor/bundle/jruby/1.9/gems/log-courier-1.7/lib/log-courier/server_tcp.rb:166"
) are injava.lang.Thread.State: TIMED_WAITING (on object monitor)
.Comparing that to the healthy one, I see a lot of them in state
java.lang.Thread.State: RUNNABLE
.In #219 @driskell mentions:
I couldn't quite follow along with that statement but I fear that the same may be going on in our case. Could somebody more knowledgeable about Java threads and logstash internals perhaps confirm this somehow?
There is nothing in the logstash logsfiles that would shed any more light on this.
Additionally, some people here are doing the blame game and advocating dropping log-courier in favor of lumberjack/logstash-forwarder. I think courier is vastly superior over logstash-forwarder so if it is logstash itself that is to blame, such confirmation would be very welcome.
The text was updated successfully, but these errors were encountered: