-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash heap size exceeds default memory cap of 500M (1.5.0.rc2-1) #3003
Comments
@svenmueller I am investigating some report of logstash hanging or dying, did you upgrade from 1.4.2 to 1.5.0rc2? If it was stable before it could mean we have introduced a memory leak somewhere. Would you mind copying your configuration, this will give us some clues on what kind of transformation you are doing? Can you also include the output of |
are you using the multiline codec by any chance? it has a known potential memory leak see logstash-plugins/logstash-codec-multiline#4 |
@colinsurprenant yes, we are using the multiline plugin @ph we upgraded from a very old logstash version and also had kind of the same problems before the migration logstash config: plugin list logstash-codec-collectd (0.1.7)
logstash-codec-dots (0.1.5)
logstash-codec-edn (0.1.5)
logstash-codec-edn_lines (0.1.6)
logstash-codec-es_bulk (0.1.5)
logstash-codec-fluent (0.1.5)
logstash-codec-graphite (0.1.5)
logstash-codec-json (0.1.6)
logstash-codec-json_lines (0.1.6)
logstash-codec-line (0.1.5)
logstash-codec-msgpack (0.1.6)
logstash-codec-multiline (0.1.6)
logstash-codec-netflow (0.1.4)
logstash-codec-oldlogstashjson (0.1.5)
logstash-codec-plain (0.1.5)
logstash-codec-rubydebug (0.1.6)
logstash-filter-anonymize (0.1.4)
logstash-filter-checksum (0.1.5)
logstash-filter-clone (0.1.4)
logstash-filter-csv (0.1.4)
logstash-filter-date (0.1.5)
logstash-filter-dns (0.1.4)
logstash-filter-drop (0.1.4)
logstash-filter-fingerprint (0.1.4)
logstash-filter-geoip (0.1.7)
logstash-filter-grok (0.1.8)
logstash-filter-json (0.1.4)
logstash-filter-kv (0.1.5)
logstash-filter-metrics (0.1.6)
logstash-filter-multiline (0.1.5)
logstash-filter-mutate (0.1.5)
logstash-filter-ruby (0.1.4)
logstash-filter-sleep (0.1.4)
logstash-filter-split (0.1.5)
logstash-filter-syslog_pri (0.1.4)
logstash-filter-throttle (0.1.4)
logstash-filter-urldecode (0.1.4)
logstash-filter-useragent (0.1.6)
logstash-filter-uuid (0.1.4)
logstash-filter-xml (0.1.4)
logstash-input-couchdb_changes (0.1.3)
logstash-input-elasticsearch (0.1.4)
logstash-input-eventlog (0.1.4)
logstash-input-exec (0.1.3)
logstash-input-file (0.1.6)
logstash-input-ganglia (0.1.3)
logstash-input-gelf (0.1.3)
logstash-input-generator (0.1.3)
logstash-input-graphite (0.1.3)
logstash-input-imap (0.1.3)
logstash-input-irc (0.1.3)
logstash-input-kafka (0.1.11)
logstash-input-log4j (0.1.3)
logstash-input-lumberjack (0.1.4)
logstash-input-pipe (0.1.4)
logstash-input-rabbitmq (0.1.3)
logstash-input-redis (0.1.3)
logstash-input-s3 (0.1.8)
logstash-input-snmptrap (0.1.3)
logstash-input-sqs (0.1.3)
logstash-input-stdin (0.1.3)
logstash-input-syslog (0.1.3)
logstash-input-tcp (0.1.3)
logstash-input-twitter (0.1.5)
logstash-input-udp (0.1.3)
logstash-input-unix (0.1.3)
logstash-input-xmpp (0.1.3)
logstash-input-zeromq (0.1.3)
logstash-output-cloudwatch (0.1.3)
logstash-output-csv (0.1.3)
logstash-output-elasticsearch (0.1.18)
logstash-output-email (0.1.4)
logstash-output-exec (0.1.3)
logstash-output-file (0.1.5)
logstash-output-ganglia (0.1.3)
logstash-output-gelf (0.1.3)
logstash-output-graphite (0.1.6)
logstash-output-hipchat (0.1.3)
logstash-output-http (0.1.3)
logstash-output-irc (0.1.3)
logstash-output-juggernaut (0.1.3)
logstash-output-kafka (0.1.8)
logstash-output-lumberjack (0.1.4)
logstash-output-nagios (0.1.3)
logstash-output-nagios_nsca (0.1.3)
logstash-output-null (0.1.3)
logstash-output-opentsdb (0.1.3)
logstash-output-pagerduty (0.1.3)
logstash-output-pipe (0.1.3)
logstash-output-rabbitmq (0.1.4)
logstash-output-redis (0.1.3)
logstash-output-s3 (0.1.6)
logstash-output-sns (0.1.3)
logstash-output-sqs (0.1.3)
logstash-output-statsd (0.1.5)
logstash-output-stdout (0.1.4)
logstash-output-tcp (0.1.4)
logstash-output-udp (0.1.4)
logstash-output-xmpp (0.1.3)
logstash-output-zeromq (0.1.4)
logstash-patterns-core (0.1.6) |
Not exactly related to your issue but you can replace this exec with this plugin https://github.com/logstash-plugins/logstash-input-heartbeat
|
@svenmueller How many clients do you have on the lumberjack side? |
@svenmueller also could you get us the output of |
@svenmueller this it is true that 20 events are keep in each phase on the pipeline, but the multiline filter aggregate events in an array inside the plugin and flush them periodically, It is possible that this queue can get big and OOM logstash before flushing the content. running @suyograo's command will give us clue if something get accumulated. |
@suyograo here is the output: https://gist.github.com/svenmueller/15297d88d85665f26381 |
|
@svenmueller per your config you are using the multiline filter not the codec - just clarifying. Normally the multiline filter will accumulate a maximum of 5 seconds of events if a multiline boundary is not reached so I doubt this is the problem here. If this was the case we would have a lot of |
so, we are either dealing with a memory leak, which mean that regardless of the heap size you configure you will OOM at some point. I would like you to test that. could you increase your heap size to, say, 750M and see how that work? also I would suggest you install VisualVM and look and your heap usage pattern, if there is a memory leak it will be very easy to visually recognize with a slowly & constantly heap usage increasing graph. On the other hand, given the complexity of your configuration, your setup may simply require more than 500M of heap. Probably unrelated but: why are you using pipe {
command => "tail -F /var/run/logstash-clb-access"
type => "clb-access"
} instead of a file input plugin? |
I've been doing some test with the I would follow @colinsurprenant advice to bump the memory a bit and monitor it. Also consider rewriting the pipe and the exec part of the configuration to use the file input and the heartbeat plugin and monitor the situation. |
I suggest doing 1 change at a time and start with increasing memory and checking usage in VisualVM. If we confirm a leak, let see if we can replace the pipe and see the result. |
We have the same problems with a "complex" config (lumberjack input, several mutations, grok and multiline filters, ES output) Config Failed with error
jmap right after launch
https://gist.github.com/mfeldheim/96bf3f35d7f3d53a1771 jmap ~100k events / 5 minutes later https://gist.github.com/mfeldheim/65e057be7a84877fd22e I have started with -w option and will post the output after crash Not sure if that's affiliated:
|
i can confirm that changing the memory settings (raise heap memory from 500M to 2048M) only means it will take a longer time until OOM happens (only changed memory settings, logstash config is the same) Error: Your application used more memory than the safety cap of 2048M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace |
any hint how to pass
does seem to have another purpose
|
Memory graph for logstash over the last 6 hours. steadily rising.
jmap close to max heap |
Short period only but you can clearly see that this looks healthier disabled lumberjack input, using generator instead which generated 3.5m messages in the last 15 minutes
|
@mfeldheim Are your forwarders experiencing connectivity issues while Logstash is running? I noticed a socket leak related to lumberjack recently as well, and suspect the two may be related. |
@torrancew not on a regular basis, I am partially using public networks and forwarding from several locations around the globe, so I might have higher latency and eventually timeouts. |
@mfeldheim thanks for narrowing down the problem, this is super useful. We'll reproduce this on our end |
After discussing this internally we concluded this is a known issue with Lumberjack input and a stalled pipeline. That said, we will look at fixing the OOM issue before GA When an output stalls, Logstash by design will back pressure and stop accepting events into the Unfortunately, Lumberjack input does not handle this back pressure correctly, which eventually leads to an OOM. While the Logstash pipeline is stalled, Lumberjack server will continue to connect and send messages which is kept in memory in the input side. Lumberjack times out after N seconds of expecting an acknowledgement and tries again, resulting in more events getting stored in the heap |
@suyograo is backpressure handled correctly in other inputs like TCP input? |
@mfeldheim Sadly yes, they could be affected by the same issue, depending how the connection is handled on the client side. Currently we spawn threads on new connection (Socket.accept), so if the client side is detecting a timeout due by a locked queue and try to reconnect this will create another blocked thread on our side. @suyograo I think it's worth it to give it more thoughts. |
@ph, I hadn't appreciated that. We've hit assorted kinds of random-seeming failures due to hitting thread-count, filehandle-count, and process-memory limits that I can now see were caused by this problem with the TCP input. They haven't happened very often, but often enough on production systems to cause significant pain (and concerns about the choice of logstash as a good tool for the job). I'd like to see this fixed, even if just via a band-aid like limiting the number of threads. |
@ph I've tested your PR logstash-plugins/logstash-input-lumberjack#10, which solves the problem for now. Thank you! Had to drastically reduce the default thread pool limit to 100 (from 1000) to stay below 1GB heap size. |
@mfeldheim ty for early feedback! I have another PR in the work that should not require this setting. |
Seeing defunct logstash instances as well, clients reporting connection timeouts like this:
When I restart the logstash server(s), a couple of events will be received, then the problem comes back. I created a stack dump of one the the running, but defunct servers (never mind the gist's file extension): https://gist.githubusercontent.com/dschneller/eddcdcbd7c5a15f775d4/raw/gistfile1.java The server process has more than 1100 CLOSE_WAIT connections according to netstat. Anything else I can provide? |
@dschneller This scenario appear when the back pressure is applied from the output to the input, It mean something is slowing down events consumption somewhere in your architecture. Could be a slow output or some network problems, if you look at your logstash log you should see something. When this problem occur the queue will block, LSF connection will timeout and retry to reconnect. There is two workaround to mitigate the situation:
We also have another PR that provide a more clean fix without the need to set a |
I need to investigate more, but I think the logstash-input-zeromq have the same problem of logstash-input-lumberjack |
I have logstash 1.5.0 cluster.
But i still see the increasing of logstash heap size. I havenot lumberjack. |
The issue is not with Lumberjack alone. In our case it turned out Elasticsearch did not accept data due to an out of memory condition we had run into. That led log stash to stall and connections from all sorts of inputs (lumberjack and TCP) to spawn new sockets without limit IIRC. We added the max_clients limit to at least not let logstash die as well. Underlying problem with ES was just too high load for the amount of hardware dedicated to it. |
@avcherkasov like @dschneller said the tcp input does have the same behaviour of the old lumberjack and doesn't support a |
@ph thaks for answer! |
is this solved with the new lumbejack input safety checks? for tcp input we can open a new issue specific to that? |
I've released a new version of lumberjack addressing the OOM issue, please update your plugin and try it out. |
The standard heap size is set to a max of
500M
. It seems that this is not enough in our environment since logstash keeps dying often, mostly without logs. Sometimes the following event can be found in/var/log/logstash/logstash.err
500M
?We are using the following plugins:
Thanks,
Sven
The text was updated successfully, but these errors were encountered: