-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentd 0.14.11 deadlocked in file output #1549
Comments
Thanks for the report. |
It was reproduceable in my environment for about a day, but I can no longer reproduce after rolling back and then forward again. I was hoping the sigdump would have enough information to debug. |
I caught something similar running td-agent 0.14.16-
There are a ton more threads, and I'm not sure if it's the same or not, but the symptoms are the same- deadlocked daemon with no files getting written. |
I am also experiencing what appears to be the same issue. RHEL 6, fluentd gem 0.14.16. sigdump from the deadlocked worker is below, followed by the sigdump from a functional worker (note the lack of locks). Edit: Moved to attachments instead of giant code blocks. |
Here is another threaddump, from after all 4 worker processes deadlocked. I have just updated to fluentd 0.14.20, I'll let you know if the issue persists. |
0.14.20 has had one of our 4 worker processes deadlock already, here is it's threaddump. I expect the others to deadlock shortly. |
Thanks for the testing. Could you paste your configuration and traffic info? |
Greetings we are seeing what appears to be deadlocks like these. We're processing a few hundred million events a day on 4 servers. What can I do to help with diagnostics? These are all 14.20 and a couple of the servers (though they have identical gems and versions) run well, while others have issues. |
I'm waiting on some corporate approval on how much information I can share. What I can say right now is that our config looks generally like this, and it's operating on messages originally received from syslog.
|
Thanks for the info! I will try to reproduce it on my environment. |
Hmm... I use the modified version of scottbuettner's conf, but I still can't reproduce the problem on my Mac and Ubuntu 16.04.
require 'fluent-logger'
require 'json'
port = ARGV[0].nil? ? 24224 : Integer(ARGV[0])
log = Fluent::Logger::FluentLogger.new(nil, :host=>'localhost', :port => port)
msgs = []
# hoge.json contains the number of nginx logs. Different content is ok.
File.read('hoge.json').each_line { |l|
m = JSON.parse(l)
m['host'] = "host#{rand(50)}"
m['test'] = "test#{rand(10)}"
msgs << m
}
i = 0
size = msgs.size
loop {
(0..100000).each { |i|
msg = msgs[i % size]
msg['num'] = i
log.post("test.tag", msg)
sleep 0.0001
}
sleep 1
} I run 2 or 3 clients in parallel but the problem is not reproduced in 12 hours. |
@repeatedly Can you track how many messages you're actually sending with that? We're receiving somewhere around 3000/second in the system that fails, and it usually does so within a few hours. I don't think it's relevant, but we're running on RHEL/CentOS 6, rather than Ubuntu. td-agent bundles the ruby interpreter so I don't think it would be much difference there, but I thought I'd mention it. |
Reporter here, also running CentOS 6. |
I setup CentOS 6 box and restarted reproduce step with 3000msgs/second. |
Hmm... I can't still reproduce this problem on CentOS 6. And if you hit the problem again, could you try In addition, ruby 2.4 has better deadlock detection. |
t2.large in AWS- it was never doing much throughput, cpu usage was pretty minimal, even during bursts. |
More deadlock info:
|
note @repeatedly the deadlock posted above happens with flush_thread_count > 1 (I had it set to 8) almost immediately. With flush_thread_count set to 1 the deadlock certainly does not happen immediately, but I am waiting to see if it happens over a longer period of time. Hopefully that helps. |
With threads at 1, it took about 2 hours to fail. |
pid2line.rb result from slack
|
@repeatedly I've been working through this deadlock issue and I had a question about the retry process, is that a separate thread? or does it consume the thread that is active? How is retry stored? memory? is it reasonable that a huge number of chunks in a retry state could lead to what we are seeing here, or at least in my case? I was looking at some libraries and behavior and it seems that some of the plugins were stuck in a retry loop because of a library version that was upgraded from the gemspec that has an issue, the question is whether that matters, if that only consumes 1 thread and just keeps retrying the 1 chunk until it's tossed then it won't cause an increasing resource usage, if retries are piled up though the behavior would be different |
Does "retry process" mean buffer chunk retry, right?
Yes. See the code: fluentd/lib/fluent/plugin/output.rb Line 1134 in 629c470
Does this 'stuck' is different from this issue? Is it known issue between plugin/library versions?
Yeah. retry itself doesn't much increase resource usage. Does this deadlock happen after some errors? |
I tried to reproduce this issue in 629c470 by the following steps and I found that deadlock occurs if an input plugin thread locks chunks in 1. Add the following changediff --git a/lib/fluent/plugin/buffer.rb b/lib/fluent/plugin/buffer.rb
index be66f3e8..d4c6be87 100644
--- a/lib/fluent/plugin/buffer.rb
+++ b/lib/fluent/plugin/buffer.rb
@@ -496,6 +496,7 @@ module Fluent
if stored
block.call(chunk, adding_bytesize)
+ sleep 3
end
end 2. Prepare a configuration file
3. Launch fluentd
4. Post eventsrequire 'fluent-logger'
logger = Fluent::Logger::FluentLogger.new(nil, host: 'localhost', port: 24224)
logger.send(:write, ['test.a', [[Time.now.to_i - 86400, { msg: 0 }], [Time.now.to_i, { msg: 1 }]]]) 5. sigdump
|
@abicky Thanks. I will check it! |
It is reproduced on my env. |
@repeatedly I'm not 100% up on fluentd internals so just trying to be helpful, but is it possible to create a system like used for timekey and timekey_wait, but in reverse. Stop writing to a buffer chunk before it is enqueud, then when enqueue occurs the lock is simply unnecessary. This could operate by monitoring "flush_percentage" or something similar. If the lock is problematic, try to avoid needing one. |
@repeatedly Thank you for your effort :) |
@abicky If you have a patch for this, it is very helpful. Because we don't have actual production environment for this problem, so if your patch resolve your production issue, this is reliable for other users :) |
@repeatedly Okay, I'll try to write a patch 💪 |
@abicky Currently, I am facing the same issue. I hope this patch, too ! |
@repeatedly @abicky We are seeing this issue happening all the time (v14.22.rc1). Our archive servers deadlock in about 5 minutes from restart. Using append and compress may make this happen more easily(?). Here's our out_file configuration:
|
This commit resolves fluent#1549. For example, deadlock occurs by the following steps before this commit: 1. input plugin thread receives multiple events (metadata_and_data.size > 1) 2. input plugin thread processes the first metadata_and_data element and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write) 3. enqueue thread aquires the lock of buffer (synchronize in Buffer#enqueue_chunk) 4. enqueue thread tries to aquire the lock of buffer (chunk.synchronize in Buffer#enqueue_chunk) 5. input plugin thread processes the second metadata_and_data element and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
I've created #1721 to resolve this issue. |
Does anyone test above patch on your traffic / setting? |
This commit resolves fluent#1549. For example, deadlock occurs by the following steps before this commit: 1. input plugin thread receives multiple events (metadata_and_data.size > 1) 2. input plugin thread processes the first metadata_and_data element and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write) 3. enqueue thread aquires the lock of buffer (synchronize in Buffer#enqueue_chunk) 4. enqueue thread tries to aquire the lock of buffer (chunk.synchronize in Buffer#enqueue_chunk) 5. input plugin thread processes the second metadata_and_data element and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
This commit resolves fluent#1549. For example, deadlock occurs by the following steps before this commit: 1. input plugin thread receives multiple events (metadata_and_data.size > 1) 2. input plugin thread processes the first metadata_and_data element and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write) 3. enqueue thread aquires the lock of buffer (synchronize in Buffer#enqueue_chunk) 4. enqueue thread tries to aquire the lock of buffer (chunk.synchronize in Buffer#enqueue_chunk) 5. input plugin thread processes the second metadata_and_data element and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
This commit resolves fluent#1549. For example, deadlock occurs by the following steps before this commit: 1. input plugin thread receives multiple events (metadata_and_data.size > 1) 2. input plugin thread processes the first metadata_and_data element and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write) 3. enqueue thread aquires the lock of buffer (synchronize in Buffer#enqueue_chunk) 4. enqueue thread tries to aquire the lock of buffer (chunk.synchronize in Buffer#enqueue_chunk) 5. input plugin thread processes the second metadata_and_data element and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
As I commented in #1721 (comment), I've checked it worked fine in our development environment although the number of events is less than the number in the production environment. Of course, I checked the patch resolved the problem in the case of #1549 (comment). |
I've just applied the patched gem and have started testing it. Usually my environment was deadlocking in 1-3 hours, I should be able to report back tomorrow with the results. |
This commit resolves fluent#1549. For example, deadlock occurs by the following steps before this commit: 1. input plugin thread receives multiple events (metadata_and_data.size > 1) 2. input plugin thread processes the first metadata_and_data element and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write) 3. enqueue thread aquires the lock of buffer (synchronize in Buffer#enqueue_chunk) 4. enqueue thread tries to aquire the lock of buffer (chunk.synchronize in Buffer#enqueue_chunk) 5. input plugin thread processes the second metadata_and_data element and tries to aquire the lock of buffer (synchronize in Buffer#write_once)
so far so good! 22 hours in and no deadlocks, running with 4 worker processes. @repeatedly any ideas when this can hit an RC or actual release version? |
@scottbuettner any notable impact on performance? I'm very encouraged that this patch will solve the issues we've been seeing also! |
@GLStephen I don't have any benchmarks to compare against or anything that detailed, but I didn't notice a drop in performance from just comparing CPU/load/etc. |
I released v0.14.22.rc2 for testing. |
Looks promising for us too! No deadlocks seen so far with .22.rc2. Thanks a lot @abicky for the patch! |
thanks for the patch @bicky , @repeatedly this seems to fix our issue as well, we'll fully load it over the weekend and know for sure, but we are not seeing the problematic behaviors we were seeing before at all |
td-agent 3/fluentd 0.14.11, running on CentOS 6, writing to a file. Senders were all buffering. Multiple sigdumps were identical other than the allocation count, so I feel fairly confident things were deadlocked.
I had a hard time killing the daemon when it was in this state
We rolled back to td-agent 2/fluentd 0.12, and may have seen the same thing, but I am not sure, and we can no longer reproduce the problem.
The text was updated successfully, but these errors were encountered: