Logstash Auto-Reload may have resource leak #5235

kenwdelong · 2016-05-02T22:49:20Z

I'm running a small ELK stack in a Docker container. When I updated to Logstash 2.3.x, I started to experience resource leaks.

I finally traced it down to the auto-reload feature for Logstash, which I had unknowingly merged from the upstream of my fork.

With auto reload on, within 24 hours Logstash was running at 50% CPU and 16% memory on my EC2 t2.medium instance. Without it, Logstash stays at about 1-2% CPU and 5.5% memory indefinitely. My ELK stack takes in about 787,000 messages per day. Nothing else runs on that instance. I'm using 2016-03 Amazon Linux and Docker 1.9.1.

You can see what I'm running successfully here: https://github.com/kenwdelong/elk-docker/tree/ES232LS232K450, as compared to what I was running before unsuccessfully: https://github.com/kenwdelong/elk-docker/tree/ES231L231K450. If you diff the commits you'll see the difference in logstash-init. (there are also minor version upgrades to ES and LS, but I confirmed that was not the problem).

jsvd · 2016-05-02T23:45:01Z

Thanks for the detailed report @kenwdelong.

Could you post logs from logstash and a heapdump when logstash is consuming those resource? That would really help with debugging.

kenwdelong · 2016-05-03T00:34:06Z

I don't think I can get a heap dump - I'd have to put the buggy Docker image back into production for 24h, and that's not really an option. I did grab a thread dump when it was in a bad state. I'll attach that here. I didn't find anything interesting (at least to me).

logstashStackTrace.txt

The Docker build that was bad is tagged here: https://hub.docker.com/r/kenwdelong/elk-docker/tags/ as ES231LS231K450. I could try spinning up an EC2 instance, and then just running netcat to pump in dummy data and see if I can reproduce it.

I created a new t2.small instance and ran these commands:
sudo su
yum update
yum install docker
service docker start
docker pull kenwdelong/elk-docker:ES231L231K450
docker run -d -p 5601:5601 -p 5000:5000 --name elk kenwdelong/elk-docker:ES231L231K450
echo '{ "@timestamp": 1461677555757, "@Version": 1, message: "this is a test" }' > test.json
while true; do nc localhost 5000 < test.json; sleep .1; done &

I'll report back if/when I get some good data!

jsvd · 2016-05-03T11:28:12Z

I've also set up a plain generator -> stdout pipeline with a reload_interval set to 0.2 seconds, hopefuly I'll catch the same issue by forcing a lot more reload_state. I'll update after running for a couple of hours

kenwdelong · 2016-05-03T16:02:05Z

I was able to reproduce the problem with the above script.

When I started the process at 17:35 my time, top looked like this:

This morning, at 08:00, 14.5 hours later, it is like this:

In this container, user 103 is Elasticsearch and user 999 is logstash.

I was able to take a heap dump, using

sudo -u logstash jmap -dump:format=b,file=logstash.heap 126

The dump is here: https://www.dropbox.com/s/hwd6gyseiubpsqd/logstash.heap?dl=0.

I hope all this helps! I will let the EC2 instance run a bit longer in case you would like me to do more forensics on the instance. You should also be able to recreate this fairly easily with the above script.

PS the "top" screenshots were taken from the host. Inside the docker container, PID 126 is logstash.

jsvd · 2016-05-03T17:47:51Z

I believe I found the issue here, though I'm still not sure why JRuby behaves like this:

when checking if the config has changed, we create a new pipeline then throw it away (which in itself is a relatively light operation, takes around 0.01 seconds, specially when done every 3 seconds).
when creating a pipeline we parse the configuration and then generate code from the AST, and create instance variables on the new instance of the pipeline
these instance variables are created with an incremental counter in their name as to prevent collisions in the same config (2 tcp input plugins, 10 mutate filters, etc)
instead of resetting this global counter every time we create a pipeline, it grows continuously, which is not really a problem because it's a single number and even if it gets converted to Bignum it has a negligible memory footprint
so, theoretically, every pipeline that gets created and thrown away should result in allocations that would be cleaned up by the GC, so there should be no memory leak

However it seems JRuby stores some information about all ivars and methods a class has had since the application has started, even if they weren't defined at the class level, but at "instance" level, in an anonymous class.
I was able to replicate this using the following code:

$i = 1

class A
  def initialize
    instance_eval "define_singleton_method(:m_#{$i}) { print \".\" }"
    instance_eval "@v_#{$i} = 1"
    $i = $i + 1
  end
end

def new_instance
  A.new
  nil
end
while true do
  new_instance()
end

In visualvm, we can check that, doing perioding heap dumps, we see continously increasing numbers of instances of things like org.jruby.runtime.ivars.StampedVariableAccessor

This may explain the huge increase in cpu usage, as jruby is probably juggling a class that has had hunders of thousands of instance variables and methods, while the memory increase is noticeable but reasonably small.

I still need to better understand the scope of this leak, but a simple fix will be to ensure the global counter here is reset for each pipeline compilation, so that variable and method names are reused

jsvd · 2016-05-04T14:17:16Z

note: this leak also happens on JRuby 9k
I guess there's 2 ways of working around the leak:

ensuring we reset @@i between pipeline compilations
avoid compiling a pipeline if the fetched configuration string is not different from previous one

ph · 2016-05-04T15:25:55Z

@jsvd I think we need 1 and 2.

colinsurprenant · 2016-05-04T18:24:08Z

+1 on 1 + 2 :)

colinsurprenant · 2016-05-04T21:14:46Z

@jsvd I have been running these two scripts and monitoring both on VisualVM and so far, after ~1h running, I don't see any JVM leak, GC patterns are healthy.

class A
  def initialize(i)
    instance_eval "@v_#{i} = i"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end

class A
  def initialize(i)
    instance_eval "define_singleton_method(:m_#{i}) { print \".\" }"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end

[edit] I realized I tested with JRuby 1.7.22 (and Java 8). Will run them with 1.7.25 to see if it's different.

jsvd · 2016-05-04T21:58:16Z

Here's my latest script that shows the cpu slowdown:

$i = 1

class A
  def initialize
    instance_eval "@a_#{$i} = #{$i}"
    instance_eval "@b_#{$i} = #{$i}"
    instance_eval "@c_#{$i} = #{$i}"
    instance_eval "@d_#{$i} = #{$i}"
    instance_eval "@e_#{$i} = #{$i}"
    instance_eval "@f_#{$i} = #{$i}"
    instance_eval "@g_#{$i} = #{$i}"
    instance_eval "@h_#{$i} = #{$i}"
    instance_eval "@v = @a_#{$i}+@b_#{$i}+@c_#{$i}+@d_#{$i}+@e_#{$i}+@f_#{$i}+@g_#{$i}+@h_#{$i}"   
    $i = $i + 1
  end

  def m
    @v
  end
end

def new_instance
  i = A.new
  i.instance_variables.size + i.m
  nil
end

s = Time.now
1_000_000.times do |i|
  if i % 1000 == 0
    puts "last iteration took #{Time.now - s}" 
    s = Time.now
    new_instance()
    puts "#{i} - #{Time.now - s}"
  else
    new_instance()
  end
end

running it:

$ jruby leak_test.rb 
last iteration took 0.004
0 - 0.002
last iteration took 5.551
1000 - 0.008
last iteration took 9.554
2000 - 0.009
last iteration took 16.221
3000 - 0.033
last iteration took 28.257
4000 - 0.057
last iteration took 39.449
5000 - 0.061
last iteration took 52.973
6000 - 0.071
^C

colinsurprenant · 2016-05-04T22:25:24Z

@jsvd I can reproduce - I reduced the script to a simpler form. Note that only a single instance_eval produces the same result in the long run, it just takes a bit longer.

class A
  def initialize(i)
    instance_eval "@a_#{i} = #{i}"
    instance_eval "@b_#{i} = #{i}"
    instance_eval "@c_#{i} = #{i}"
    instance_eval "@d_#{i} = #{i}"
    instance_eval "@e_#{i} = #{i}"
    instance_eval "@f_#{i} = #{i}"
    instance_eval "@g_#{i} = #{i}"
    instance_eval "@h_#{i} = #{i}"
  end
end

i = 0
s = Time.now

while true
  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - s}"
    s = Time.now
    A.new(i)
    puts "#{i} new_instance took #{Time.now - s}"
  else
    A.new(i)
  end
  i += 1
end

1000 last iteration took 1.141
1000 new_instance took 0.001
2000 last iteration took 2.254
2000 new_instance took 0.004
3000 last iteration took 4.228
3000 new_instance took 0.005
4000 last iteration took 8.412
4000 new_instance took 0.014
5000 last iteration took 14.903
5000 new_instance took 0.017
6000 last iteration took 17.554
6000 new_instance took 0.019
7000 last iteration took 22.764
7000 new_instance took 0.032
8000 last iteration took 28.199
8000 new_instance took 0.03
9000 last iteration took 33.683
9000 new_instance took 0.035
10000 last iteration took 39.142
10000 new_instance took 0.049

I don't think the problem is GC/leak related - it feels more like a growing table with O(~n) kind of lookup...

jsvd · 2016-05-05T11:40:09Z

yeah I created the more complex script to make the cpu usage more evident.
This behavior occurs also in MRI https://gist.github.com/jsvd/e9769fdd219b8d549216ab93c0fe5390 so it's inherent to the ruby object model? Although I can find no class/object that shows this growing list of ivars.

Today I'll work on fixing this, a patch will come soon.
The workaround for now is to either turn off auto reload or large increase the reload interval, to once every 1800 seconds or similar

colinsurprenant · 2016-05-05T13:11:55Z

@jsvd we could also move the config file change verification out of the Pipeline class and make that check before going through the config parsing etc? this way, checks could be done at any faster interval without the need to instantiate a Pipeline object every time? I mean, regardless of anything else, there is no point in doing anything if there is no change in the config file in the first place... a simple hash of the config file content could be kept in memory?

colinsurprenant · 2016-05-06T19:45:30Z

@jsvd also, maybe instead of dynamically creating ivars and methods, we could set state values into a @state Hash and create lambdas and put them into a @procs Hash too?

For example, doing this:

class A
  def initialize(i)
    @state = {}
    @procs = {}

    @state["a_#{i}"] = i
    @procs["m_#{i}"] = lambda { print(i) }
  end
end

i = 0
start = Time.now

while true
  i += 1
  A.new(i)

  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - start}"
    start = Time.now
  end
end

there is no slowdown.

colinsurprenant · 2016-05-06T20:04:21Z

As suspected, per @headius explanation in jruby/jruby/issues/3859 - creating ivars with new names grows a mapping table from ivar names to objects, and the more it grows the slower it gets. We should definitely get rid of that pattern in the pipeline.

jsvd · 2016-05-17T07:52:19Z

fix in master, 5.0, 2.x and 2.3 branches, see #5250

headius · 2016-05-18T22:46:44Z

👍

treksler · 2016-06-29T23:46:01Z

confirmed fixed in 2.3.3

suyograo · 2016-06-30T00:05:36Z

Thanks @treksler

suyograo added bug unconfirmed labels May 2, 2016

suyograo assigned jsvd May 2, 2016

kenwdelong mentioned this issue May 3, 2016

Auto Reload considered dangerous? spujadas/elk-docker#41

Closed

jsvd mentioned this issue May 5, 2016

optimize resource usage when recompiling pipeline #5250

Closed

jsvd removed the unconfirmed label May 6, 2016

jsvd mentioned this issue May 6, 2016

ivar creation with instance_eval triggers lookup hell in other instances jruby/jruby#3859

Closed

jsvd closed this as completed May 17, 2016

suyograo added the v2.3.3 label Jun 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash Auto-Reload may have resource leak #5235

Logstash Auto-Reload may have resource leak #5235

kenwdelong commented May 2, 2016

jsvd commented May 2, 2016

kenwdelong commented May 3, 2016 •

edited

Loading

jsvd commented May 3, 2016

kenwdelong commented May 3, 2016 •

edited

Loading

jsvd commented May 3, 2016

jsvd commented May 4, 2016

ph commented May 4, 2016

colinsurprenant commented May 4, 2016

colinsurprenant commented May 4, 2016 •

edited

Loading

jsvd commented May 4, 2016 •

edited

Loading

colinsurprenant commented May 4, 2016 •

edited

Loading

jsvd commented May 5, 2016

colinsurprenant commented May 5, 2016

colinsurprenant commented May 6, 2016

colinsurprenant commented May 6, 2016

jsvd commented May 17, 2016

headius commented May 18, 2016

treksler commented Jun 29, 2016

suyograo commented Jun 30, 2016

Logstash Auto-Reload may have resource leak #5235

Logstash Auto-Reload may have resource leak #5235

Comments

kenwdelong commented May 2, 2016

jsvd commented May 2, 2016

kenwdelong commented May 3, 2016 • edited Loading

jsvd commented May 3, 2016

kenwdelong commented May 3, 2016 • edited Loading

jsvd commented May 3, 2016

jsvd commented May 4, 2016

ph commented May 4, 2016

colinsurprenant commented May 4, 2016

colinsurprenant commented May 4, 2016 • edited Loading

jsvd commented May 4, 2016 • edited Loading

colinsurprenant commented May 4, 2016 • edited Loading

jsvd commented May 5, 2016

colinsurprenant commented May 5, 2016

colinsurprenant commented May 6, 2016

colinsurprenant commented May 6, 2016

jsvd commented May 17, 2016

headius commented May 18, 2016

treksler commented Jun 29, 2016

suyograo commented Jun 30, 2016

kenwdelong commented May 3, 2016 •

edited

Loading

kenwdelong commented May 3, 2016 •

edited

Loading

colinsurprenant commented May 4, 2016 •

edited

Loading

jsvd commented May 4, 2016 •

edited

Loading

colinsurprenant commented May 4, 2016 •

edited

Loading