Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log-courier input plugin hanging in thread state WAITING (on object monitor) #243

Closed
zoni opened this issue Oct 21, 2015 · 5 comments
Closed

Comments

@zoni
Copy link

zoni commented Oct 21, 2015

(Note: Please bear with me, there's a lot of info here to explain what steps we've taken to debug the situation so far. You may wish to skip to the section marked "Looking at the input plugin" first, then backtrack if needed)

We are running into an issue with log-courier/logstash/elasticsearch where it _appears_ that the log-courier input plugin ends up hanging and no longer sends data through to elasticsearch. It is unclear at this point whether this is a bug on the side of log-courier or whether the issue is with logstash itself.

Background

Our journey begins with Nagios alerts going off about servers running out of available disk space. Some investigation pointed to log-courier being the culprit, as it was keeping open large amounts of rotated files.

At first we believed log-courier was not properly detecting rotated files, however I have mostly eliminated that scenario. Rather, this appears to be by-design, as it seems that on the affected server(s), log-courier has not yet successfully sent all data from these logs to logstash.

lc-admin output shows orphaned files and pending payloads.

Turning the log level to debug shows that for some time, everything is fine but at Oct 18 11:51:45 log-courier loses connection to logstash and reconnects. It then subsequently starts receiving ACKNs for the same payload in an endless loop.

I first thought this may be an issue with log-courier on the client side (there's an strace here as well if you want it) but after some time I noticed there is just one specific logstash server out of our cluster that triggers this behavior.

Our log-courier agents pick a logstash server from a pool using round-robin DNS and in my experiments, I found that all servers in the pool are fine except for one. Connecting to this one immediately triggers the loop of identical ACKN messages. Connect to another server and everything goes through fine.

Looking at the input plugin

There is nothing noteworthy in the logs on the logstash side, though I should note that we haven't turned these up to debug yet. However, in issue #219 I found mention of some symptoms that appear to correspond with some things we are seeing.

I dumped thread traces of both a healthy as well the unhealthy logstash servers.

On the unhealthy one I am seeing that all the log-courier threads (all the ones marked "Ruby-0-Thread-0123456: /opt/logstash/vendor/bundle/jruby/1.9/gems/log-courier-1.7/lib/log-courier/server_tcp.rb:166") are in java.lang.Thread.State: TIMED_WAITING (on object monitor).

Comparing that to the healthy one, I see a lot of them in state java.lang.Thread.State: RUNNABLE.

In #219 @driskell mentions:

You'll notice courier threads trying to push to the filter queue and waiting because it's full. Then you'll notice >output trying to pop from the output queue but waiting because empty.

I couldn't quite follow along with that statement but I fear that the same may be going on in our case. Could somebody more knowledgeable about Java threads and logstash internals perhaps confirm this somehow?

There is nothing in the logstash logsfiles that would shed any more light on this.

Additionally, some people here are doing the blame game and advocating dropping log-courier in favor of lumberjack/logstash-forwarder. I think courier is vastly superior over logstash-forwarder so if it is logstash itself that is to blame, such confirmation would be very welcome.

@driskell
Copy link
Owner

The issue is definitely in the input plugin if the ACKN is successful. ACKN of the same payload over and over again means the input plugin isn't able to clear its queue so it's stalling the senders.

The input thread (<courier) is stuck on a push to SizedQueue, that's Logstash's own queue, so that's stalled too.

Looking at the filter worker (|worker) that's blocked on a push too to the output queue, so that's stalled too.

This generally means the elasticsearch output is stuck (>elasticsearch ones). Looking at healthy one we can see many are waiting on SizedQueue pop which takes an item from the queue, so that means output queue empty and it's waiting new events. On the unhealthy one we can see all of them are stuck in receive calls of some sort. I've not explored this kind of problem in detail before but generally just check elasticsearch is OK and that the connection to elasticsearch is fine and restart Logstash etc. Maybe there are some logs from the elasticsearch plugin on Logstash side?

@zoni
Copy link
Author

zoni commented Oct 21, 2015

Thanks for the quick response!

The issue is definitely in the input plugin if the ACKN is successful. ACKN of the same payload over and over again means the input plugin isn't able to clear its queue so it's stalling the senders.

Right, that was my explanation of what I was seeing as well. Thanks for confirming that.

This generally means the elasticsearch output is stuck (>elasticsearch ones).

That's what I suspected as well. I'm relatively certain then that the issue is somewhere in logstash itself here (quite possibly the elasticsearch output plugin as you mentioned) rather than in the log-courier input.

Maybe there are some logs from the elasticsearch plugin on Logstash side?

Unfortunately, there wasn't anything that stood out. I'm going to take another look at them but will probably have to resort to bumping it up to debug logging (which will likely also make me lose the reproduction case as logstash restarts) and see if anything comes out.

I'm going to leave this issue open a little while longer just in case, but will update when I have more information.

@zoni
Copy link
Author

zoni commented Oct 21, 2015

On a side note, would you consider accepting the following additions to log-courier?

  • An option that would allow it to trigger a re-connection on the client if it receives the same ACKN for <x> period of time
  • An additional field in the lc-admin output displaying the time (unix timestamp perhaps? both?) the last payload was accepted, or perhaps the time elapsed since the last payload was successfully delivered?

The first would ensure unhealthy servers are detected and avoided (a naive implementation may still hit the same server a few times if it's unlucky, but eventually it should hit a healthy one) while the second would allow us to more easily detect this scenario.

@driskell
Copy link
Owner

driskell commented Nov 2, 2015

The additional fields are already in roadmap.
The option to reconnect I think would occur once I get 2.0 out the door as it would be useful in a load balancing / failover scenario.
I have a new improved plugin in the works though I'm trying to clean up first, as a 2.0 plugin.

@driskell
Copy link
Owner

I'll close this as the lc-admin pieces are already in the roadmap, and load balancing will be much better in 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants