-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash Auto-Reload may have resource leak #5235
Comments
Thanks for the detailed report @kenwdelong. Could you post logs from logstash and a heapdump when logstash is consuming those resource? That would really help with debugging. |
I don't think I can get a heap dump - I'd have to put the buggy Docker image back into production for 24h, and that's not really an option. I did grab a thread dump when it was in a bad state. I'll attach that here. I didn't find anything interesting (at least to me). The Docker build that was bad is tagged here: https://hub.docker.com/r/kenwdelong/elk-docker/tags/ as ES231LS231K450. I could try spinning up an EC2 instance, and then just running netcat to pump in dummy data and see if I can reproduce it. I created a new t2.small instance and ran these commands: I'll report back if/when I get some good data! |
I've also set up a plain generator -> stdout pipeline with a reload_interval set to 0.2 seconds, hopefuly I'll catch the same issue by forcing a lot more reload_state. I'll update after running for a couple of hours |
I was able to reproduce the problem with the above script. When I started the process at 17:35 my time, top looked like this: This morning, at 08:00, 14.5 hours later, it is like this: In this container, user 103 is Elasticsearch and user 999 is logstash. I was able to take a heap dump, using
The dump is here: https://www.dropbox.com/s/hwd6gyseiubpsqd/logstash.heap?dl=0. I hope all this helps! I will let the EC2 instance run a bit longer in case you would like me to do more forensics on the instance. You should also be able to recreate this fairly easily with the above script. PS the "top" screenshots were taken from the host. Inside the docker container, PID 126 is logstash. |
I believe I found the issue here, though I'm still not sure why JRuby behaves like this:
However it seems JRuby stores some information about all ivars and methods a class has had since the application has started, even if they weren't defined at the class level, but at "instance" level, in an anonymous class. $i = 1
class A
def initialize
instance_eval "define_singleton_method(:m_#{$i}) { print \".\" }"
instance_eval "@v_#{$i} = 1"
$i = $i + 1
end
end
def new_instance
A.new
nil
end
while true do
new_instance()
end In visualvm, we can check that, doing perioding heap dumps, we see continously increasing numbers of instances of things like This may explain the huge increase in cpu usage, as jruby is probably juggling a class that has had hunders of thousands of instance variables and methods, while the memory increase is noticeable but reasonably small. I still need to better understand the scope of this leak, but a simple fix will be to ensure the global counter here is reset for each pipeline compilation, so that variable and method names are reused |
note: this leak also happens on JRuby 9k
|
@jsvd I think we need 1 and 2. |
+1 on 1 + 2 :) |
@jsvd I have been running these two scripts and monitoring both on VisualVM and so far, after ~1h running, I don't see any JVM leak, GC patterns are healthy. class A
def initialize(i)
instance_eval "@v_#{i} = i"
end
end
def new_instance(i)
A.new(i)
end
i = 0
while true do
new_instance(i)
i += 1
end class A
def initialize(i)
instance_eval "define_singleton_method(:m_#{i}) { print \".\" }"
end
end
def new_instance(i)
A.new(i)
end
i = 0
while true do
new_instance(i)
i += 1
end [edit] I realized I tested with JRuby 1.7.22 (and Java 8). Will run them with 1.7.25 to see if it's different. |
Here's my latest script that shows the cpu slowdown: $i = 1
class A
def initialize
instance_eval "@a_#{$i} = #{$i}"
instance_eval "@b_#{$i} = #{$i}"
instance_eval "@c_#{$i} = #{$i}"
instance_eval "@d_#{$i} = #{$i}"
instance_eval "@e_#{$i} = #{$i}"
instance_eval "@f_#{$i} = #{$i}"
instance_eval "@g_#{$i} = #{$i}"
instance_eval "@h_#{$i} = #{$i}"
instance_eval "@v = @a_#{$i}+@b_#{$i}+@c_#{$i}+@d_#{$i}+@e_#{$i}+@f_#{$i}+@g_#{$i}+@h_#{$i}"
$i = $i + 1
end
def m
@v
end
end
def new_instance
i = A.new
i.instance_variables.size + i.m
nil
end
s = Time.now
1_000_000.times do |i|
if i % 1000 == 0
puts "last iteration took #{Time.now - s}"
s = Time.now
new_instance()
puts "#{i} - #{Time.now - s}"
else
new_instance()
end
end running it:
|
@jsvd I can reproduce - I reduced the script to a simpler form. Note that only a single class A
def initialize(i)
instance_eval "@a_#{i} = #{i}"
instance_eval "@b_#{i} = #{i}"
instance_eval "@c_#{i} = #{i}"
instance_eval "@d_#{i} = #{i}"
instance_eval "@e_#{i} = #{i}"
instance_eval "@f_#{i} = #{i}"
instance_eval "@g_#{i} = #{i}"
instance_eval "@h_#{i} = #{i}"
end
end
i = 0
s = Time.now
while true
if (i % 1000) == 0
puts "#{i} last iteration took #{Time.now - s}"
s = Time.now
A.new(i)
puts "#{i} new_instance took #{Time.now - s}"
else
A.new(i)
end
i += 1
end
I don't think the problem is GC/leak related - it feels more like a growing table with O(~n) kind of lookup... |
yeah I created the more complex script to make the cpu usage more evident. Today I'll work on fixing this, a patch will come soon. |
@jsvd we could also move the config file change verification out of the |
@jsvd also, maybe instead of dynamically creating ivars and methods, we could set state values into a For example, doing this: class A
def initialize(i)
@state = {}
@procs = {}
@state["a_#{i}"] = i
@procs["m_#{i}"] = lambda { print(i) }
end
end
i = 0
start = Time.now
while true
i += 1
A.new(i)
if (i % 1000) == 0
puts "#{i} last iteration took #{Time.now - start}"
start = Time.now
end
end there is no slowdown. |
As suspected, per @headius explanation in jruby/jruby/issues/3859 - creating ivars with new names grows a mapping table from ivar names to objects, and the more it grows the slower it gets. We should definitely get rid of that pattern in the pipeline. |
fix in master, 5.0, 2.x and 2.3 branches, see #5250 |
👍 |
confirmed fixed in 2.3.3 |
Thanks @treksler |
I'm running a small ELK stack in a Docker container. When I updated to Logstash 2.3.x, I started to experience resource leaks.
I finally traced it down to the auto-reload feature for Logstash, which I had unknowingly merged from the upstream of my fork.
With auto reload on, within 24 hours Logstash was running at 50% CPU and 16% memory on my EC2 t2.medium instance. Without it, Logstash stays at about 1-2% CPU and 5.5% memory indefinitely. My ELK stack takes in about 787,000 messages per day. Nothing else runs on that instance. I'm using 2016-03 Amazon Linux and Docker 1.9.1.
You can see what I'm running successfully here: https://github.com/kenwdelong/elk-docker/tree/ES232LS232K450, as compared to what I was running before unsuccessfully: https://github.com/kenwdelong/elk-docker/tree/ES231L231K450. If you diff the commits you'll see the difference in logstash-init. (there are also minor version upgrades to ES and LS, but I confirmed that was not the problem).
The text was updated successfully, but these errors were encountered: