New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash Auto-Reload may have resource leak #5235

Closed
kenwdelong opened this Issue May 2, 2016 · 19 comments

Comments

Projects
None yet
7 participants
@kenwdelong

kenwdelong commented May 2, 2016

I'm running a small ELK stack in a Docker container. When I updated to Logstash 2.3.x, I started to experience resource leaks.

image

I finally traced it down to the auto-reload feature for Logstash, which I had unknowingly merged from the upstream of my fork.

With auto reload on, within 24 hours Logstash was running at 50% CPU and 16% memory on my EC2 t2.medium instance. Without it, Logstash stays at about 1-2% CPU and 5.5% memory indefinitely. My ELK stack takes in about 787,000 messages per day. Nothing else runs on that instance. I'm using 2016-03 Amazon Linux and Docker 1.9.1.

You can see what I'm running successfully here: https://github.com/kenwdelong/elk-docker/tree/ES232LS232K450, as compared to what I was running before unsuccessfully: https://github.com/kenwdelong/elk-docker/tree/ES231L231K450. If you diff the commits you'll see the difference in logstash-init. (there are also minor version upgrades to ES and LS, but I confirmed that was not the problem).

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 2, 2016

Member

Thanks for the detailed report @kenwdelong.

Could you post logs from logstash and a heapdump when logstash is consuming those resource? That would really help with debugging.

Member

jsvd commented May 2, 2016

Thanks for the detailed report @kenwdelong.

Could you post logs from logstash and a heapdump when logstash is consuming those resource? That would really help with debugging.

@kenwdelong

This comment has been minimized.

Show comment
Hide comment
@kenwdelong

kenwdelong May 3, 2016

I don't think I can get a heap dump - I'd have to put the buggy Docker image back into production for 24h, and that's not really an option. I did grab a thread dump when it was in a bad state. I'll attach that here. I didn't find anything interesting (at least to me).

logstashStackTrace.txt

The Docker build that was bad is tagged here: https://hub.docker.com/r/kenwdelong/elk-docker/tags/ as ES231LS231K450. I could try spinning up an EC2 instance, and then just running netcat to pump in dummy data and see if I can reproduce it.

I created a new t2.small instance and ran these commands:
sudo su
yum update
yum install docker
service docker start
docker pull kenwdelong/elk-docker:ES231L231K450
docker run -d -p 5601:5601 -p 5000:5000 --name elk kenwdelong/elk-docker:ES231L231K450
echo '{ "@timestamp": 1461677555757, "@Version": 1, message: "this is a test" }' > test.json
while true; do nc localhost 5000 < test.json; sleep .1; done &

I'll report back if/when I get some good data!

kenwdelong commented May 3, 2016

I don't think I can get a heap dump - I'd have to put the buggy Docker image back into production for 24h, and that's not really an option. I did grab a thread dump when it was in a bad state. I'll attach that here. I didn't find anything interesting (at least to me).

logstashStackTrace.txt

The Docker build that was bad is tagged here: https://hub.docker.com/r/kenwdelong/elk-docker/tags/ as ES231LS231K450. I could try spinning up an EC2 instance, and then just running netcat to pump in dummy data and see if I can reproduce it.

I created a new t2.small instance and ran these commands:
sudo su
yum update
yum install docker
service docker start
docker pull kenwdelong/elk-docker:ES231L231K450
docker run -d -p 5601:5601 -p 5000:5000 --name elk kenwdelong/elk-docker:ES231L231K450
echo '{ "@timestamp": 1461677555757, "@Version": 1, message: "this is a test" }' > test.json
while true; do nc localhost 5000 < test.json; sleep .1; done &

I'll report back if/when I get some good data!

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 3, 2016

Member

I've also set up a plain generator -> stdout pipeline with a reload_interval set to 0.2 seconds, hopefuly I'll catch the same issue by forcing a lot more reload_state. I'll update after running for a couple of hours

Member

jsvd commented May 3, 2016

I've also set up a plain generator -> stdout pipeline with a reload_interval set to 0.2 seconds, hopefuly I'll catch the same issue by forcing a lot more reload_state. I'll update after running for a couple of hours

@kenwdelong

This comment has been minimized.

Show comment
Hide comment
@kenwdelong

kenwdelong May 3, 2016

I was able to reproduce the problem with the above script.

When I started the process at 17:35 my time, top looked like this:
top17_35

This morning, at 08:00, 14.5 hours later, it is like this:
top0800

In this container, user 103 is Elasticsearch and user 999 is logstash.

I was able to take a heap dump, using

sudo -u logstash jmap -dump:format=b,file=logstash.heap 126

The dump is here: https://www.dropbox.com/s/hwd6gyseiubpsqd/logstash.heap?dl=0.

I hope all this helps! I will let the EC2 instance run a bit longer in case you would like me to do more forensics on the instance. You should also be able to recreate this fairly easily with the above script.

PS the "top" screenshots were taken from the host. Inside the docker container, PID 126 is logstash.

kenwdelong commented May 3, 2016

I was able to reproduce the problem with the above script.

When I started the process at 17:35 my time, top looked like this:
top17_35

This morning, at 08:00, 14.5 hours later, it is like this:
top0800

In this container, user 103 is Elasticsearch and user 999 is logstash.

I was able to take a heap dump, using

sudo -u logstash jmap -dump:format=b,file=logstash.heap 126

The dump is here: https://www.dropbox.com/s/hwd6gyseiubpsqd/logstash.heap?dl=0.

I hope all this helps! I will let the EC2 instance run a bit longer in case you would like me to do more forensics on the instance. You should also be able to recreate this fairly easily with the above script.

PS the "top" screenshots were taken from the host. Inside the docker container, PID 126 is logstash.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 3, 2016

Member

I believe I found the issue here, though I'm still not sure why JRuby behaves like this:

  • when checking if the config has changed, we create a new pipeline then throw it away (which in itself is a relatively light operation, takes around 0.01 seconds, specially when done every 3 seconds).
  • when creating a pipeline we parse the configuration and then generate code from the AST, and create instance variables on the new instance of the pipeline
  • these instance variables are created with an incremental counter in their name as to prevent collisions in the same config (2 tcp input plugins, 10 mutate filters, etc)
  • instead of resetting this global counter every time we create a pipeline, it grows continuously, which is not really a problem because it's a single number and even if it gets converted to Bignum it has a negligible memory footprint
  • so, theoretically, every pipeline that gets created and thrown away should result in allocations that would be cleaned up by the GC, so there should be no memory leak

However it seems JRuby stores some information about all ivars and methods a class has had since the application has started, even if they weren't defined at the class level, but at "instance" level, in an anonymous class.
I was able to replicate this using the following code:

$i = 1

class A
  def initialize
    instance_eval "define_singleton_method(:m_#{$i}) { print \".\" }"
    instance_eval "@v_#{$i} = 1"
    $i = $i + 1
  end
end

def new_instance
  A.new
  nil
end
while true do
  new_instance()
end

In visualvm, we can check that, doing perioding heap dumps, we see continously increasing numbers of instances of things like org.jruby.runtime.ivars.StampedVariableAccessor

This may explain the huge increase in cpu usage, as jruby is probably juggling a class that has had hunders of thousands of instance variables and methods, while the memory increase is noticeable but reasonably small.

I still need to better understand the scope of this leak, but a simple fix will be to ensure the global counter here is reset for each pipeline compilation, so that variable and method names are reused

Member

jsvd commented May 3, 2016

I believe I found the issue here, though I'm still not sure why JRuby behaves like this:

  • when checking if the config has changed, we create a new pipeline then throw it away (which in itself is a relatively light operation, takes around 0.01 seconds, specially when done every 3 seconds).
  • when creating a pipeline we parse the configuration and then generate code from the AST, and create instance variables on the new instance of the pipeline
  • these instance variables are created with an incremental counter in their name as to prevent collisions in the same config (2 tcp input plugins, 10 mutate filters, etc)
  • instead of resetting this global counter every time we create a pipeline, it grows continuously, which is not really a problem because it's a single number and even if it gets converted to Bignum it has a negligible memory footprint
  • so, theoretically, every pipeline that gets created and thrown away should result in allocations that would be cleaned up by the GC, so there should be no memory leak

However it seems JRuby stores some information about all ivars and methods a class has had since the application has started, even if they weren't defined at the class level, but at "instance" level, in an anonymous class.
I was able to replicate this using the following code:

$i = 1

class A
  def initialize
    instance_eval "define_singleton_method(:m_#{$i}) { print \".\" }"
    instance_eval "@v_#{$i} = 1"
    $i = $i + 1
  end
end

def new_instance
  A.new
  nil
end
while true do
  new_instance()
end

In visualvm, we can check that, doing perioding heap dumps, we see continously increasing numbers of instances of things like org.jruby.runtime.ivars.StampedVariableAccessor

This may explain the huge increase in cpu usage, as jruby is probably juggling a class that has had hunders of thousands of instance variables and methods, while the memory increase is noticeable but reasonably small.

I still need to better understand the scope of this leak, but a simple fix will be to ensure the global counter here is reset for each pipeline compilation, so that variable and method names are reused

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 4, 2016

Member

note: this leak also happens on JRuby 9k
I guess there's 2 ways of working around the leak:

  1. ensuring we reset @@i between pipeline compilations
  2. avoid compiling a pipeline if the fetched configuration string is not different from previous one
Member

jsvd commented May 4, 2016

note: this leak also happens on JRuby 9k
I guess there's 2 ways of working around the leak:

  1. ensuring we reset @@i between pipeline compilations
  2. avoid compiling a pipeline if the fetched configuration string is not different from previous one
@ph

This comment has been minimized.

Show comment
Hide comment
@ph

ph May 4, 2016

Member

@jsvd I think we need 1 and 2.

Member

ph commented May 4, 2016

@jsvd I think we need 1 and 2.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 4, 2016

Contributor

+1 on 1 + 2 :)

Contributor

colinsurprenant commented May 4, 2016

+1 on 1 + 2 :)

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 4, 2016

Contributor

@jsvd I have been running these two scripts and monitoring both on VisualVM and so far, after ~1h running, I don't see any JVM leak, GC patterns are healthy.

class A
  def initialize(i)
    instance_eval "@v_#{i} = i"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end
class A
  def initialize(i)
    instance_eval "define_singleton_method(:m_#{i}) { print \".\" }"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end

[edit] I realized I tested with JRuby 1.7.22 (and Java 8). Will run them with 1.7.25 to see if it's different.

Contributor

colinsurprenant commented May 4, 2016

@jsvd I have been running these two scripts and monitoring both on VisualVM and so far, after ~1h running, I don't see any JVM leak, GC patterns are healthy.

class A
  def initialize(i)
    instance_eval "@v_#{i} = i"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end
class A
  def initialize(i)
    instance_eval "define_singleton_method(:m_#{i}) { print \".\" }"
  end
end

def new_instance(i)
  A.new(i)
end

i = 0
while true do
  new_instance(i)
  i += 1
end

[edit] I realized I tested with JRuby 1.7.22 (and Java 8). Will run them with 1.7.25 to see if it's different.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 4, 2016

Member

Here's my latest script that shows the cpu slowdown:

$i = 1

class A
  def initialize
    instance_eval "@a_#{$i} = #{$i}"
    instance_eval "@b_#{$i} = #{$i}"
    instance_eval "@c_#{$i} = #{$i}"
    instance_eval "@d_#{$i} = #{$i}"
    instance_eval "@e_#{$i} = #{$i}"
    instance_eval "@f_#{$i} = #{$i}"
    instance_eval "@g_#{$i} = #{$i}"
    instance_eval "@h_#{$i} = #{$i}"
    instance_eval "@v = @a_#{$i}+@b_#{$i}+@c_#{$i}+@d_#{$i}+@e_#{$i}+@f_#{$i}+@g_#{$i}+@h_#{$i}"   
    $i = $i + 1
  end

  def m
    @v
  end
end

def new_instance
  i = A.new
  i.instance_variables.size + i.m
  nil
end

s = Time.now
1_000_000.times do |i|
  if i % 1000 == 0
    puts "last iteration took #{Time.now - s}" 
    s = Time.now
    new_instance()
    puts "#{i} - #{Time.now - s}"
  else
    new_instance()
  end
end

running it:

$ jruby leak_test.rb 
last iteration took 0.004
0 - 0.002
last iteration took 5.551
1000 - 0.008
last iteration took 9.554
2000 - 0.009
last iteration took 16.221
3000 - 0.033
last iteration took 28.257
4000 - 0.057
last iteration took 39.449
5000 - 0.061
last iteration took 52.973
6000 - 0.071
^C
Member

jsvd commented May 4, 2016

Here's my latest script that shows the cpu slowdown:

$i = 1

class A
  def initialize
    instance_eval "@a_#{$i} = #{$i}"
    instance_eval "@b_#{$i} = #{$i}"
    instance_eval "@c_#{$i} = #{$i}"
    instance_eval "@d_#{$i} = #{$i}"
    instance_eval "@e_#{$i} = #{$i}"
    instance_eval "@f_#{$i} = #{$i}"
    instance_eval "@g_#{$i} = #{$i}"
    instance_eval "@h_#{$i} = #{$i}"
    instance_eval "@v = @a_#{$i}+@b_#{$i}+@c_#{$i}+@d_#{$i}+@e_#{$i}+@f_#{$i}+@g_#{$i}+@h_#{$i}"   
    $i = $i + 1
  end

  def m
    @v
  end
end

def new_instance
  i = A.new
  i.instance_variables.size + i.m
  nil
end

s = Time.now
1_000_000.times do |i|
  if i % 1000 == 0
    puts "last iteration took #{Time.now - s}" 
    s = Time.now
    new_instance()
    puts "#{i} - #{Time.now - s}"
  else
    new_instance()
  end
end

running it:

$ jruby leak_test.rb 
last iteration took 0.004
0 - 0.002
last iteration took 5.551
1000 - 0.008
last iteration took 9.554
2000 - 0.009
last iteration took 16.221
3000 - 0.033
last iteration took 28.257
4000 - 0.057
last iteration took 39.449
5000 - 0.061
last iteration took 52.973
6000 - 0.071
^C
@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 4, 2016

Contributor

@jsvd I can reproduce - I reduced the script to a simpler form. Note that only a single instance_eval produces the same result in the long run, it just takes a bit longer.

class A
  def initialize(i)
    instance_eval "@a_#{i} = #{i}"
    instance_eval "@b_#{i} = #{i}"
    instance_eval "@c_#{i} = #{i}"
    instance_eval "@d_#{i} = #{i}"
    instance_eval "@e_#{i} = #{i}"
    instance_eval "@f_#{i} = #{i}"
    instance_eval "@g_#{i} = #{i}"
    instance_eval "@h_#{i} = #{i}"
  end
end

i = 0
s = Time.now

while true
  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - s}"
    s = Time.now
    A.new(i)
    puts "#{i} new_instance took #{Time.now - s}"
  else
    A.new(i)
  end
  i += 1
end
1000 last iteration took 1.141
1000 new_instance took 0.001
2000 last iteration took 2.254
2000 new_instance took 0.004
3000 last iteration took 4.228
3000 new_instance took 0.005
4000 last iteration took 8.412
4000 new_instance took 0.014
5000 last iteration took 14.903
5000 new_instance took 0.017
6000 last iteration took 17.554
6000 new_instance took 0.019
7000 last iteration took 22.764
7000 new_instance took 0.032
8000 last iteration took 28.199
8000 new_instance took 0.03
9000 last iteration took 33.683
9000 new_instance took 0.035
10000 last iteration took 39.142
10000 new_instance took 0.049

I don't think the problem is GC/leak related - it feels more like a growing table with O(~n) kind of lookup...

Contributor

colinsurprenant commented May 4, 2016

@jsvd I can reproduce - I reduced the script to a simpler form. Note that only a single instance_eval produces the same result in the long run, it just takes a bit longer.

class A
  def initialize(i)
    instance_eval "@a_#{i} = #{i}"
    instance_eval "@b_#{i} = #{i}"
    instance_eval "@c_#{i} = #{i}"
    instance_eval "@d_#{i} = #{i}"
    instance_eval "@e_#{i} = #{i}"
    instance_eval "@f_#{i} = #{i}"
    instance_eval "@g_#{i} = #{i}"
    instance_eval "@h_#{i} = #{i}"
  end
end

i = 0
s = Time.now

while true
  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - s}"
    s = Time.now
    A.new(i)
    puts "#{i} new_instance took #{Time.now - s}"
  else
    A.new(i)
  end
  i += 1
end
1000 last iteration took 1.141
1000 new_instance took 0.001
2000 last iteration took 2.254
2000 new_instance took 0.004
3000 last iteration took 4.228
3000 new_instance took 0.005
4000 last iteration took 8.412
4000 new_instance took 0.014
5000 last iteration took 14.903
5000 new_instance took 0.017
6000 last iteration took 17.554
6000 new_instance took 0.019
7000 last iteration took 22.764
7000 new_instance took 0.032
8000 last iteration took 28.199
8000 new_instance took 0.03
9000 last iteration took 33.683
9000 new_instance took 0.035
10000 last iteration took 39.142
10000 new_instance took 0.049

I don't think the problem is GC/leak related - it feels more like a growing table with O(~n) kind of lookup...

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 5, 2016

Member

yeah I created the more complex script to make the cpu usage more evident.
This behavior occurs also in MRI https://gist.github.com/jsvd/e9769fdd219b8d549216ab93c0fe5390 so it's inherent to the ruby object model? Although I can find no class/object that shows this growing list of ivars.

Today I'll work on fixing this, a patch will come soon.
The workaround for now is to either turn off auto reload or large increase the reload interval, to once every 1800 seconds or similar

Member

jsvd commented May 5, 2016

yeah I created the more complex script to make the cpu usage more evident.
This behavior occurs also in MRI https://gist.github.com/jsvd/e9769fdd219b8d549216ab93c0fe5390 so it's inherent to the ruby object model? Although I can find no class/object that shows this growing list of ivars.

Today I'll work on fixing this, a patch will come soon.
The workaround for now is to either turn off auto reload or large increase the reload interval, to once every 1800 seconds or similar

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 5, 2016

Contributor

@jsvd we could also move the config file change verification out of the Pipeline class and make that check before going through the config parsing etc? this way, checks could be done at any faster interval without the need to instantiate a Pipeline object every time? I mean, regardless of anything else, there is no point in doing anything if there is no change in the config file in the first place... a simple hash of the config file content could be kept in memory?

Contributor

colinsurprenant commented May 5, 2016

@jsvd we could also move the config file change verification out of the Pipeline class and make that check before going through the config parsing etc? this way, checks could be done at any faster interval without the need to instantiate a Pipeline object every time? I mean, regardless of anything else, there is no point in doing anything if there is no change in the config file in the first place... a simple hash of the config file content could be kept in memory?

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 6, 2016

Contributor

@jsvd also, maybe instead of dynamically creating ivars and methods, we could set state values into a @state Hash and create lambdas and put them into a @procs Hash too?

For example, doing this:

class A
  def initialize(i)
    @state = {}
    @procs = {}

    @state["a_#{i}"] = i
    @procs["m_#{i}"] = lambda { print(i) }
  end
end

i = 0
start = Time.now

while true
  i += 1
  A.new(i)

  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - start}"
    start = Time.now
  end
end

there is no slowdown.

Contributor

colinsurprenant commented May 6, 2016

@jsvd also, maybe instead of dynamically creating ivars and methods, we could set state values into a @state Hash and create lambdas and put them into a @procs Hash too?

For example, doing this:

class A
  def initialize(i)
    @state = {}
    @procs = {}

    @state["a_#{i}"] = i
    @procs["m_#{i}"] = lambda { print(i) }
  end
end

i = 0
start = Time.now

while true
  i += 1
  A.new(i)

  if (i % 1000) == 0
    puts "#{i} last iteration took #{Time.now - start}"
    start = Time.now
  end
end

there is no slowdown.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant May 6, 2016

Contributor

As suspected, per @headius explanation in jruby/jruby/issues/3859 - creating ivars with new names grows a mapping table from ivar names to objects, and the more it grows the slower it gets. We should definitely get rid of that pattern in the pipeline.

Contributor

colinsurprenant commented May 6, 2016

As suspected, per @headius explanation in jruby/jruby/issues/3859 - creating ivars with new names grows a mapping table from ivar names to objects, and the more it grows the slower it gets. We should definitely get rid of that pattern in the pipeline.

@jsvd

This comment has been minimized.

Show comment
Hide comment
@jsvd

jsvd May 17, 2016

Member

fix in master, 5.0, 2.x and 2.3 branches, see #5250

Member

jsvd commented May 17, 2016

fix in master, 5.0, 2.x and 2.3 branches, see #5250

@jsvd jsvd closed this May 17, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius commented May 18, 2016

👍

@suyograo suyograo added the v2.3.3 label Jun 14, 2016

@treksler

This comment has been minimized.

Show comment
Hide comment
@treksler

treksler Jun 29, 2016

confirmed fixed in 2.3.3

treksler commented Jun 29, 2016

confirmed fixed in 2.3.3

@suyograo

This comment has been minimized.

Show comment
Hide comment
@suyograo
Member

suyograo commented Jun 30, 2016

Thanks @treksler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment