Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lumberjack input, maximum connection exceeded #3277

Closed
svenmueller opened this issue May 19, 2015 · 21 comments
Closed

Lumberjack input, maximum connection exceeded #3277

svenmueller opened this issue May 19, 2015 · 21 comments
Assignees

Comments

@svenmueller
Copy link

After running for a while logstash doesn't output any event to ES. Instead lot's of "warnings" are written to the log file.

{:timestamp=>"2015-05-19T15:13:13.366000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}

(Btw, why does it say ":max_clients=>nil"?)

When executing sudo lsof | grep logstash a lot of unclosed connections are listed.

Any idea why this happens from time to time?

@ph
Copy link
Contributor

ph commented May 19, 2015

:max_clients=>nil this is definitively a typo, will fix this.

The Lumberjack input, maximum connection exceeded, new connection are rejected message mean that logstash is applying the back pressure back to the LSF clients, instead of going OOM and die. The connection are keep open until they timeout when the queue is blocked for too long.

The back pressure happen when something is slowing doing the consumptions of the events and the queue is blocked for a certain periods of time, This slowdown can be multiple things, one output is slowing down the other, external webservice call or lot of text manipulations.

@svenmueller
Copy link
Author

Thx for the fast reply. We only use ES output plugin and ES seems to be running fine

you can see our full config here: https://gist.github.com/svenmueller/4faac33ac051263f1f96

@ph
Copy link
Contributor

ph commented May 19, 2015

@svenmueller ty for your config!

Do you have the maximum connection warning really often?

One way to have more throughput on logstash is to run it with multiples workers for the filtering (the -w option, 1 per default), the problem with this option is that the multiline FILTER is not thread safe, you have to replace it with multiple input using the multiline codecs.

@svenmueller
Copy link
Author

The problem occurs round about 1-2 per day. How can i find out the underlying problem? Is it possible to see which input/filter/output plugin is slowing down the whole logstash pipeline?

@svenmueller
Copy link
Author

i also noticed that stop/restarting the logstash service is not possible whenever the problem described above occurs:

log.nodexyz.com [2015-05-20 19:38:50]:~$ sudo tail -f /var/log/logstash/logstash.log
...
{:timestamp=>"2015-05-20T19:38:50.535000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}
{:timestamp=>"2015-05-20T19:38:50.546000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}
{:timestamp=>"2015-05-20T19:38:50.552000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}

log.nodexyz.com [2015-05-20 19:38:50]:~$ sudo service logstash restart
Killing logstash (pid 30181) with SIGTERM
Waiting logstash (pid 30181) to die...
Waiting logstash (pid 30181) to die...
Waiting logstash (pid 30181) to die...
Waiting logstash (pid 30181) to die...
Waiting logstash (pid 30181) to die...
logstash stop failed; still running.
logstash started.

log.nodexyz.com [2015-05-20 19:38:50]:~$ sudo tail -f /var/log/logstash/logstash.log
...
{:timestamp=>"2015-05-20T19:39:12.175000+0000", :message=>"The error reported is: \n  Address already in use - bind - Address already in use"}

@m1k3ga
Copy link

m1k3ga commented May 21, 2015

hi there,
we experience a similar behaviour in logstash.
After 3-4 hours our logstash receiver hangs with message: "Socket limit exceeded".
No more events are being processed from that time on.

We use lsf (forwarder 0.4.0) on about 20 machines and push the events (5MB/s) to a logstash instance (1.5.0) via lumberjack (with ssl).
The logstash instance than pushes the events to elasticsearch.

Before 1.5.0GA we regularily ran into OOM, no we exceed the sockets ;)

Next step is to use 3 logstash instances for receiving log events.

Any idea what else we can do?

@mogmismo
Copy link

Hey, we are also experiencing a similar behavior in logstash. 2-3 hours and we run out of sockets. Also using logstash-forwarder on about 8 machines that push to logstash over ssl and into ES. We are running the lastest logstash-1.5.0-1.noarch on Centos7.

Just like svenmuller, we can't stop logstash except with a kill command, and get the same message in the logs:

{:timestamp=>"2015-05-22T15:35:35.475000-0400", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}

@ph
Copy link
Contributor

ph commented May 26, 2015

Hello @m1k3ga would you mind testing a PR that implement another fix for the lumberjack input to prevent the OOM and hopefully the socket issue you are currently experimenting at logstash-plugins/logstash-input-lumberjack#12 ?

You can install this unreleased plugin by editing your Gemfile and changing the line:

gem "logstash-input-lumberjack"

For

gem "logstash-input-lumberjack", :github => "ph/logstash-input-lumberjack", :branch => "fix/circuit-breaker"

And run this command:

bin/plugin install --no-verify

@ph
Copy link
Contributor

ph commented May 26, 2015

Concerning the hang, we have a PR in the work #3211 to refactor the semantic for the shutdown of the input plugins. Some of them doesn't stop gracefully, and this is the case here.

@ph
Copy link
Contributor

ph commented May 26, 2015

@mogmismo this error is a symptom of a slow down somewhere, before where going oom now we ran out of sockets.. 😞

But this actually means there is a slow down somewhere either in the filter or in the inputs, do you have any other error in the log file ? Maybe retry timeouts on the elasticsearch output?

@mabdelfattah
Copy link

I've the same problem... So, how can I increase the workers if I am running logstash as service?

@ph
Copy link
Contributor

ph commented May 28, 2015

I think there is a misunderstanding here in this issue and if this message is problematic or not.

Let me do a bit of explaining about how Logstash work with the events, we have multiple type of plugins, between every stage of the pipeline we use a small queue which is fixed to 20 events.
So the system look a bit like this:

<INPUTs> -> SizeQueue.new(20) -> <FILTERS> -> SizeQueue(20) -> (outputs) -> external service

Now let's take the scenario that something is wrong with the external service side, it could be multiple things: the service is down, or we have a net slow down or the service cannot keep up with the throughput we have. The inputs need to be informed of this situation to stop accepting new events, the way we are currently doing is by using a blocking queue.

The maximum connection exceeded error you see are due to how the LSF client handle this scenario. The logstash-forwarder connect to the lumberjack input, this connection has a timeout. When the queue is blocked, the LSF will detect this as a timeout. When the forwarder detect a timeout he will try to reconnect, if the queue is blocked for a long time the LSF clients will reconnect multiples times and eventually makes logstash run out of connection until the queue is unblocked.

This limit is a way to prevent of logstash to go oom under back pressure, in 1.4.2 was going oom with this situation.

It is a problem?

  • If you see this error from time to time and logstash recover this is probably a transient error.
  • If logstash never recover from that error, this could could be a bug and we will address it.

What I can do to improve the situation?

  • Check if the outputs are not under allocated and can sustain the throughput.
  • Give a higher timeout to your LSF client to help logstash to recover.
  • If you have a lot of filters and you are not using the multiline filter (THIS FILTER IS NOT THREADSAFE), you can give more worker threads to logstash, this will speed up the processing of the events and could help.
  • Some people also use a broker setup with kafka or redis to buffer the events and distribute high load.

What the developers are working on

  • We are looking at improving the stability of the lumberjack.
  • Move to an async model for the network input.
  • add slow start to the logstash forwarder.

@ph
Copy link
Contributor

ph commented May 28, 2015

The need of doing a kill -9 is a bug, we are addressing with #3211

@mabdelfattah
Copy link

@ph, I'll try to stop that postgresql filter for a while (https://gist.github.com/bdelbosc/9508821), and will let you know if logstash stop again.

@ph
Copy link
Contributor

ph commented May 28, 2015

@mabdelfattah are you getting the "maximum connection exceeded" error in your log? Because this issue is related to the lumberjack input and I don't see any in your configuration. Can you create a new issue with details about your configuration and any errors messages you can find in logstash log?

@monstrocious
Copy link

I too am getting the same error on the logstash logs, also getting Read error looking for ack: read tcp xxxxxx i/o timeout on the logstash-forwarder logs...

@vankhoa011
Copy link

I got the same problem too. The logstash is died and can't parse logs anymore. it worked after restart logstash.

@phungle
Copy link

phungle commented Jun 2, 2015

You cannot restart logstash after that by command "sudo service logstash restart".
You should find old process by command "ps aux | grep logstash"
Then kill this process with command "kill -9 pid_you_found_above"
Then run "sudo service logstash restart".
Hope it helps!

@mabdelfattah
Copy link

@ph I'm sorry for late response, here is my lubmerjack input config

input {
  lumberjack {
   port => 5000
   type => "logs"
   ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
   ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
  }
}

@slava-vishnyakov
Copy link

For me this solved the problem of LogStash hanging and message in logs:

input {
  lumberjack {
    ...
    max_clients => 30000
    ...
  }
}

LogStash complains about deprecated option, but at least it is stable and works.

@suyograo suyograo added v1.5.2 and removed v1.5.1 labels Jun 17, 2015
@ph
Copy link
Contributor

ph commented Jun 19, 2015

I've released a new version of the lumberjack input that drop support for the max_clients, instead it uses an internal SizeQueue that can timeout if the input queue is blocked. So limiting the number of clients will be based on the ingestion capacity.

@ph ph closed this as completed Jun 19, 2015
@ph ph reopened this Jun 19, 2015
@ph ph closed this as completed Jun 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants