Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash heap size exceeds default memory cap of 500M (1.5.0.rc2-1) #3003

Closed
svenmueller opened this issue Apr 14, 2015 · 37 comments
Closed

Comments

@svenmueller
Copy link

The standard heap size is set to a max of 500M. It seems that this is not enough in our environment since logstash keeps dying often, mostly without logs. Sometimes the following event can be found in /var/log/logstash/logstash.err

$ cat /var/log/logstash/logstash.err
Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
  • I know that maximum of 20 events are stored in the pipeline for each phase (input, filter, output), but why does the heap still grow to more than 500M?
  • Which heap size should be used in production environment?
  • Any hints for debugging why logstash service keeps dying?

We are using the following plugins:

  • inputs
    • exec
    • lumberjack
    • tcp
    • pipe
  • filters
    • multiline
    • grok
    • mutate
    • drop
    • mutiline
    • date
    • json
  • outputs
    • ES

Thanks,
Sven

@ph ph self-assigned this Apr 14, 2015
@ph
Copy link
Contributor

ph commented Apr 14, 2015

@svenmueller I am investigating some report of logstash hanging or dying, did you upgrade from 1.4.2 to 1.5.0rc2? If it was stable before it could mean we have introduced a memory leak somewhere.

Would you mind copying your configuration, this will give us some clues on what kind of transformation you are doing? Can you also include the output of bin/plugin list --verbose in this issue?

@colinsurprenant
Copy link
Contributor

are you using the multiline codec by any chance? it has a known potential memory leak see logstash-plugins/logstash-codec-multiline#4

@svenmueller
Copy link
Author

@colinsurprenant yes, we are using the multiline plugin

@ph we upgraded from a very old logstash version and also had kind of the same problems before the migration

logstash config:
https://gist.github.com/svenmueller/acdf51b85d3c1cc86001

plugin list

logstash-codec-collectd (0.1.7)
logstash-codec-dots (0.1.5)
logstash-codec-edn (0.1.5)
logstash-codec-edn_lines (0.1.6)
logstash-codec-es_bulk (0.1.5)
logstash-codec-fluent (0.1.5)
logstash-codec-graphite (0.1.5)
logstash-codec-json (0.1.6)
logstash-codec-json_lines (0.1.6)
logstash-codec-line (0.1.5)
logstash-codec-msgpack (0.1.6)
logstash-codec-multiline (0.1.6)
logstash-codec-netflow (0.1.4)
logstash-codec-oldlogstashjson (0.1.5)
logstash-codec-plain (0.1.5)
logstash-codec-rubydebug (0.1.6)
logstash-filter-anonymize (0.1.4)
logstash-filter-checksum (0.1.5)
logstash-filter-clone (0.1.4)
logstash-filter-csv (0.1.4)
logstash-filter-date (0.1.5)
logstash-filter-dns (0.1.4)
logstash-filter-drop (0.1.4)
logstash-filter-fingerprint (0.1.4)
logstash-filter-geoip (0.1.7)
logstash-filter-grok (0.1.8)
logstash-filter-json (0.1.4)
logstash-filter-kv (0.1.5)
logstash-filter-metrics (0.1.6)
logstash-filter-multiline (0.1.5)
logstash-filter-mutate (0.1.5)
logstash-filter-ruby (0.1.4)
logstash-filter-sleep (0.1.4)
logstash-filter-split (0.1.5)
logstash-filter-syslog_pri (0.1.4)
logstash-filter-throttle (0.1.4)
logstash-filter-urldecode (0.1.4)
logstash-filter-useragent (0.1.6)
logstash-filter-uuid (0.1.4)
logstash-filter-xml (0.1.4)
logstash-input-couchdb_changes (0.1.3)
logstash-input-elasticsearch (0.1.4)
logstash-input-eventlog (0.1.4)
logstash-input-exec (0.1.3)
logstash-input-file (0.1.6)
logstash-input-ganglia (0.1.3)
logstash-input-gelf (0.1.3)
logstash-input-generator (0.1.3)
logstash-input-graphite (0.1.3)
logstash-input-imap (0.1.3)
logstash-input-irc (0.1.3)
logstash-input-kafka (0.1.11)
logstash-input-log4j (0.1.3)
logstash-input-lumberjack (0.1.4)
logstash-input-pipe (0.1.4)
logstash-input-rabbitmq (0.1.3)
logstash-input-redis (0.1.3)
logstash-input-s3 (0.1.8)
logstash-input-snmptrap (0.1.3)
logstash-input-sqs (0.1.3)
logstash-input-stdin (0.1.3)
logstash-input-syslog (0.1.3)
logstash-input-tcp (0.1.3)
logstash-input-twitter (0.1.5)
logstash-input-udp (0.1.3)
logstash-input-unix (0.1.3)
logstash-input-xmpp (0.1.3)
logstash-input-zeromq (0.1.3)
logstash-output-cloudwatch (0.1.3)
logstash-output-csv (0.1.3)
logstash-output-elasticsearch (0.1.18)
logstash-output-email (0.1.4)
logstash-output-exec (0.1.3)
logstash-output-file (0.1.5)
logstash-output-ganglia (0.1.3)
logstash-output-gelf (0.1.3)
logstash-output-graphite (0.1.6)
logstash-output-hipchat (0.1.3)
logstash-output-http (0.1.3)
logstash-output-irc (0.1.3)
logstash-output-juggernaut (0.1.3)
logstash-output-kafka (0.1.8)
logstash-output-lumberjack (0.1.4)
logstash-output-nagios (0.1.3)
logstash-output-nagios_nsca (0.1.3)
logstash-output-null (0.1.3)
logstash-output-opentsdb (0.1.3)
logstash-output-pagerduty (0.1.3)
logstash-output-pipe (0.1.3)
logstash-output-rabbitmq (0.1.4)
logstash-output-redis (0.1.3)
logstash-output-s3 (0.1.6)
logstash-output-sns (0.1.3)
logstash-output-sqs (0.1.3)
logstash-output-statsd (0.1.5)
logstash-output-stdout (0.1.4)
logstash-output-tcp (0.1.4)
logstash-output-udp (0.1.4)
logstash-output-xmpp (0.1.3)
logstash-output-zeromq (0.1.4)
logstash-patterns-core (0.1.6)

@ph
Copy link
Contributor

ph commented Apr 14, 2015

Not exactly related to your issue but you can replace this exec with this plugin https://github.com/logstash-plugins/logstash-input-heartbeat

  exec {
    command => "/usr/bin/touch /var/run/logstash-monit/input.touch && /bin/echo OK."
    interval => 60
    type => "heartbeat"
  }

@ph
Copy link
Contributor

ph commented Apr 14, 2015

@svenmueller How many clients do you have on the lumberjack side?

@suyograo
Copy link
Contributor

@svenmueller also could you get us the output of jmap -histo <pid>. this will tell us what objects are in heap

@ph
Copy link
Contributor

ph commented Apr 14, 2015

@svenmueller this it is true that 20 events are keep in each phase on the pipeline, but the multiline filter aggregate events in an array inside the plugin and flush them periodically, It is possible that this queue can get big and OOM logstash before flushing the content. running @suyograo's command will give us clue if something get accumulated.

@svenmueller
Copy link
Author

@suyograo
Copy link
Contributor

76 mb of byte[] 13 mb of org.jruby.RubyString

@suyograo suyograo added the bug label Apr 14, 2015
@colinsurprenant
Copy link
Contributor

@svenmueller per your config you are using the multiline filter not the codec - just clarifying. Normally the multiline filter will accumulate a maximum of 5 seconds of events if a multiline boundary is not reached so I doubt this is the problem here. If this was the case we would have a lot of LogStash::Event objects showing up in the heap.

@colinsurprenant
Copy link
Contributor

so, we are either dealing with a memory leak, which mean that regardless of the heap size you configure you will OOM at some point. I would like you to test that. could you increase your heap size to, say, 750M and see how that work? also I would suggest you install VisualVM and look and your heap usage pattern, if there is a memory leak it will be very easy to visually recognize with a slowly & constantly heap usage increasing graph.

On the other hand, given the complexity of your configuration, your setup may simply require more than 500M of heap.

Probably unrelated but: why are you using

  pipe {
    command => "tail -F /var/run/logstash-clb-access"
    type => "clb-access"
  }

instead of a file input plugin?

@ph
Copy link
Contributor

ph commented Apr 16, 2015

I've been doing some test with the multiline's plugins to investigate some memory leaks and I seriously doubt it to be the source of the problem.

I would follow @colinsurprenant advice to bump the memory a bit and monitor it. Also consider rewriting the pipe and the exec part of the configuration to use the file input and the heartbeat plugin and monitor the situation.

@colinsurprenant
Copy link
Contributor

I suggest doing 1 change at a time and start with increasing memory and checking usage in VisualVM. If we confirm a leak, let see if we can replace the pipe and see the result.

@mfeldheim
Copy link

We have the same problems with a "complex" config (lumberjack input, several mutations, grok and multiline filters, ES output)
Still crashing with heap max set to 8G.

Config
https://gist.github.com/mfeldheim/c1b333c0560e7c4d0351

Failed with error

Error: Your application used more memory than the safety cap of 8G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

jmap right after launch

sudo -u logstash jmap -histo 24097 

https://gist.github.com/mfeldheim/96bf3f35d7f3d53a1771

jmap ~100k events / 5 minutes later

https://gist.github.com/mfeldheim/65e057be7a84877fd22e

I have started with -w option and will post the output after crash

Not sure if that's affiliated:
logstash.log gives me those messages, not sure what causes them and what they mean

==> logstash.log <==
{:timestamp=>"2015-04-18T08:48:57.410000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.411000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.415000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.415000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.416000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.416000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.417000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.421000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.422000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2015-04-18T08:48:57.423000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}

@svenmueller
Copy link
Author

i can confirm that changing the memory settings (raise heap memory from 500M to 2048M) only means it will take a longer time until OOM happens (only changed memory settings, logstash config is the same)

Error: Your application used more memory than the safety cap of 2048M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

@mfeldheim
Copy link

any hint how to pass -w as hinted in the crash message?

# Arguments to pass to logstash agent
LS_OPTS="-w"

does seem to have another purpose

    -w, --filterworkers COUNT     Sets the number of filter workers to run.
                                   (default: 1)

@mfeldheim
Copy link

Memory graph for logstash over the last 6 hours.

image

steadily rising.
Last part is with minimal configuration:
Looks like filters do have a memory leak, but either lumberjack input, logstash itself or ES output have the bad leak.

input {
  lumberjack {
    port => 5000
    type => "logs"
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
  }
}

output {
  elasticsearch { host => localhost }
}

jmap close to max heap
https://gist.github.com/mfeldheim/c4a663d9a661ae8f22c5

@mfeldheim
Copy link

Short period only but you can clearly see that this looks healthier

image

disabled lumberjack input, using generator instead which generated 3.5m messages in the last 15 minutes

#input {
#  lumberjack {
#    port => 5000
#    type => "logs"
#    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
#    ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
#  }
#}

input {
    generator {
    }
}

output {
  elasticsearch { host => localhost }
}

@mfeldheim
Copy link

lumberjack input clearly is the big leak here

  • 06:00 first crash with full config
  • 08:45 second crash with just lumberjack input and ES output
  • since then, stable heap size with generator input and ES output

image

@torrancew
Copy link
Contributor

@mfeldheim Are your forwarders experiencing connectivity issues while Logstash is running? I noticed a socket leak related to lumberjack recently as well, and suspect the two may be related.

@mfeldheim
Copy link

@torrancew not on a regular basis, I am partially using public networks and forwarding from several locations around the globe, so I might have higher latency and eventually timeouts.
Of course, my forwarder logs are full with timeout errors because logstash crashed every few hours so far.

@suyograo
Copy link
Contributor

@mfeldheim thanks for narrowing down the problem, this is super useful. We'll reproduce this on our end

@suyograo suyograo added this to the v1.5.0 milestone Apr 22, 2015
@suyograo
Copy link
Contributor

After discussing this internally we concluded this is a known issue with Lumberjack input and a stalled pipeline. That said, we will look at fixing the OOM issue before GA

When an output stalls, Logstash by design will back pressure and stop accepting events into the
input<--> filter queue and filter<-->output queue. This propagated all the way back to the source of the events. We see evidence of stalled pipeline in both reports above. retrying failed action with response code: 429 indicates ES cluster is overwhelmed and cannot handle any more events.

Unfortunately, Lumberjack input does not handle this back pressure correctly, which eventually leads to an OOM. While the Logstash pipeline is stalled, Lumberjack server will continue to connect and send messages which is kept in memory in the input side. Lumberjack times out after N seconds of expecting an acknowledgement and tries again, resulting in more events getting stored in the heap

@mfeldheim
Copy link

@suyograo is backpressure handled correctly in other inputs like TCP input?

@ph
Copy link
Contributor

ph commented Apr 28, 2015

@mfeldheim Sadly yes, they could be affected by the same issue, depending how the connection is handled on the client side. Currently we spawn threads on new connection (Socket.accept), so if the client side is detecting a timeout due by a locked queue and try to reconnect this will create another blocked thread on our side. @suyograo I think it's worth it to give it more thoughts.

@timbunce
Copy link

@ph, I hadn't appreciated that. We've hit assorted kinds of random-seeming failures due to hitting thread-count, filehandle-count, and process-memory limits that I can now see were caused by this problem with the TCP input. They haven't happened very often, but often enough on production systems to cause significant pain (and concerns about the choice of logstash as a good tool for the job). I'd like to see this fixed, even if just via a band-aid like limiting the number of threads.

@mfeldheim
Copy link

@ph I've tested your PR logstash-plugins/logstash-input-lumberjack#10, which solves the problem for now. Thank you!

Had to drastically reduce the default thread pool limit to 100 (from 1000) to stay below 1GB heap size.

@ph
Copy link
Contributor

ph commented Apr 30, 2015

@mfeldheim ty for early feedback! I have another PR in the work that should not require this setting.

@dschneller
Copy link

Seeing defunct logstash instances as well, clients reporting connection timeouts like this:

May 15 14:00:27 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:27.498079 Registrar: processing 100 events
May 15 14:00:28 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:28.455298 Registrar: processing 100 events
May 15 14:00:29 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:29.232197 Registrar: processing 100 events
May 15 14:00:44 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:44.232208 Read error looking for ack: read tcp 10.102.9.18:5044: i/o timeout
May 15 14:00:44 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:44.232500 Setting trusted CA from file: /etc/logstash-forwarder/ssl/centerdevice-intermediate-ca.crt
May 15 14:00:44 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:44.235565 Connecting to [10.102.9.18]:5044 (logstash01.in-xchg.clusterb.centerdevice.local) 
May 15 14:00:44 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:44.364721 Connected to 10.102.9.18
May 15 14:00:59 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:59.365191 Read error looking for ack: read tcp 10.102.9.18:5044: i/o timeout
May 15 14:00:59 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:59.365502 Setting trusted CA from file: /etc/logstash-forwarder/ssl/centerdevice-intermediate-ca.crt
May 15 14:00:59 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:59.368829 Connecting to [10.102.9.18]:5044 (logstash01.in-xchg.clusterb.centerdevice.local) 
May 15 14:00:59 loadbalancer01 logstash-forwarder[6993]: 2015/05/15 14:00:59.496895 Connected to 10.102.9.18

When I restart the logstash server(s), a couple of events will be received, then the problem comes back.
Updated from 1.5.0-beta1 to 1.5.0rc4 today, no change. First I noticed an OutOfMemory error, so I increased the Heap vom 500MB to 1025MB. No effect. Shortly after restart the server enters the broken state again.

I created a stack dump of one the the running, but defunct servers (never mind the gist's file extension): https://gist.githubusercontent.com/dschneller/eddcdcbd7c5a15f775d4/raw/gistfile1.java

The server process has more than 1100 CLOSE_WAIT connections according to netstat.

Anything else I can provide?

@ph
Copy link
Contributor

ph commented May 15, 2015

@dschneller This scenario appear when the back pressure is applied from the output to the input, It mean something is slowing down events consumption somewhere in your architecture. Could be a slow output or some network problems, if you look at your logstash log you should see something.

When this problem occur the queue will block, LSF connection will timeout and retry to reconnect.
The lumberjack input will keep accepting them up to a maximum of ~1000 clients by default.

There is two workaround to mitigate the situation:

  • You can set a hard limit on the number of connection on the lumberjack input using max_clients option, set it higher than your actual number of LSF, this will prevent the input to go generate a lot of open connection threads and go OOM.
  • You can also change the LSF timeout tolerance to be higher, this will give more chance to logstash to consume the events.

We also have another PR that provide a more clean fix without the need to set a max_clients limit.
This PR need more testing and reviews to be released.
logstash-plugins/logstash-input-lumberjack#12

@felipegs
Copy link

I need to investigate more, but I think the logstash-input-zeromq have the same problem of logstash-input-lumberjack

@avcherkasov
Copy link

I have logstash 1.5.0 cluster.
My config looks like:

input {

heartbeat {
    interval => 300
    type => "heartbeat"
}

udp {
    port => 60606
    type => syslog
}

tcp {
    port => 10514
    type => syslog
}

tcp {
    port => 1514
    type => syslog
}

gelf {
    codec => "plain"
    host => "0.0.0.0"
    port => 12201
    type => "gelf"
}

udp {
    codec => "json"
    host => "0.0.0.0"
    port => 12211
    type => "erlang"
}

}


filter {
  if [type] == "syslog" {
    grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
      add_field => [ "received_at", "%{@timestamp}" ]
      add_field => [ "received_from", "%{host}" ]
    }
    syslog_pri { }
    date {
      match => [ "syslog_timestamp", "ISO8601", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]
      add_tag => "Syslog Timestamped"
      locale => "en"

    }
  }
}


output {

elasticsearch {
    cluster => "elasticsearch"
}

}

But i still see the increasing of logstash heap size. I havenot lumberjack.
Is the issue still not fixed for tcp input?
What else could be a reason for OOM?

@dschneller
Copy link

The issue is not with Lumberjack alone. In our case it turned out Elasticsearch did not accept data due to an out of memory condition we had run into. That led log stash to stall and connections from all sorts of inputs (lumberjack and TCP) to spawn new sockets without limit IIRC. We added the max_clients limit to at least not let logstash die as well. Underlying problem with ES was just too high load for the amount of hardware dedicated to it.

@ph
Copy link
Contributor

ph commented May 27, 2015

@avcherkasov like @dschneller said the tcp input does have the same behaviour of the old lumberjack and doesn't support a max_clients yet. I not a big fan of this solution (even If I've made the PR) this is a temporary fix. It's my plan to experiment with NIO to make it evented and not stop creating as much threads.

@avcherkasov
Copy link

@ph thaks for answer!
You will include the fix in the next release of logstash?

@colinsurprenant
Copy link
Contributor

is this solved with the new lumbejack input safety checks? for tcp input we can open a new issue specific to that?

@suyograo suyograo added v1.5.2 and removed v1.5.1 labels Jun 17, 2015
@ph
Copy link
Contributor

ph commented Jun 19, 2015

I've released a new version of lumberjack addressing the OOM issue, please update your plugin and try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants