New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always lock chunks first to avoid deadlock #1721

Merged
merged 4 commits into from Oct 23, 2017

Conversation

Projects
None yet
2 participants
@abicky
Copy link
Contributor

abicky commented Oct 20, 2017

This PR resolves #1549.

For example, deadlock occurs by the following steps before this commit:

  1. input plugin thread receives multiple events
    (metadata_and_data.size > 1)
  2. input plugin thread processes the first metadata_and_data element
    and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write)
  3. enqueue thread aquires the lock of buffer
    (synchronize in Buffer#enqueue_chunk)
  4. enqueue thread tries to aquire the lock of chunk1
    (chunk.synchronize in Buffer#enqueue_chunk)
  5. input plugin thread processes the second metadata_and_data element
    and tries to aquire the lock of buffer (synchronize in Buffer#write_once)

cf. #1549 (comment)

Refactoring: Remove unnecessary conditions
All chunk classes which inherit Chunk have enqueued! method.
@abicky

This comment has been minimized.

Copy link
Contributor

abicky commented Oct 20, 2017

@repeatedly Could you review the change? I want to try the latest version (v0.14.22.rc2?) which includes this change in our production environment if it looks good to you.

@repeatedly repeatedly self-requested a review Oct 20, 2017

@repeatedly repeatedly self-assigned this Oct 20, 2017

@repeatedly

This comment has been minimized.

Copy link
Member

repeatedly commented Oct 20, 2017

Thanks! I will review it!

@abicky abicky changed the title Always lock chunks first to avoid deadlock [WIP] Always lock chunks first to avoid deadlock Oct 20, 2017

@abicky

This comment has been minimized.

Copy link
Contributor

abicky commented Oct 20, 2017

I found there were errors like unexpected error while checking flushed chunks. ignored. error_class=RuntimeError error="can't enqueue buffer file: path = /fluentd/log/..., error = 'closed stream'" and I'll investigate the cause 🙇

Always lock chunks first to avoid deadlock
This commit resolves #1549.

For example, deadlock occurs by the following steps before this commit:

1. input plugin thread receives multiple events
  (metadata_and_data.size > 1)
2. input plugin thread processes the first metadata_and_data element
  and aquires the lock of chunk1 (chunk.mon_exit in Buffer#write)
3. enqueue thread aquires the lock of buffer
  (synchronize in Buffer#enqueue_chunk)
4. enqueue thread tries to aquire the lock of buffer
  (chunk.synchronize in Buffer#enqueue_chunk)
5. input plugin thread processes the second metadata_and_data element
  and tries to aquire the lock of buffer (synchronize in Buffer#write_once)

@abicky abicky force-pushed the abicky:hotfix/lock-chunks-first branch from 93088bf to 723038c Oct 20, 2017

Process chunks with buffer lock in enqueue_chunk
In flush threads @buffer could call destructive
methods of chunks in @Queue without chunk lock,
so buffer lock is required.

@abicky abicky force-pushed the abicky:hotfix/lock-chunks-first branch from 723038c to 1a6f16b Oct 21, 2017

@abicky abicky changed the title [WIP] Always lock chunks first to avoid deadlock Always lock chunks first to avoid deadlock Oct 21, 2017

@abicky

This comment has been minimized.

Copy link
Contributor

abicky commented Oct 21, 2017

I fixed "closed stream" error in 1a6f16b.

@repeatedly Please review. I've checked the behavior in our development environment and it works fine. I want your review before I try it in our production environment.
Some tests failed in Travis CI, but these tests seem to sometimes fails also in master branch, aren't they?

end
if block_given?
synchronize{ @stage.keys }.each do |metadata|
chunk = @stage[metadata]

This comment has been minimized.

@repeatedly

repeatedly Oct 23, 2017

Member

No need synchronized guard here?

This comment has been minimized.

@repeatedly

repeatedly Oct 23, 2017

Member

Hmm... deleting metadata from @stage happens in only enqueue_chunk so it seems safe in CRuby.

This comment has been minimized.

@abicky

abicky Oct 23, 2017

Contributor

I think data race doesn't occur in CRuby thanks to GIL and synchronized is not necessary, but I'll change the code like below for other Ruby implementations:

diff --git a/lib/fluent/plugin/buffer.rb b/lib/fluent/plugin/buffer.rb
index 337b3b79..61a904ed 100644
--- a/lib/fluent/plugin/buffer.rb
+++ b/lib/fluent/plugin/buffer.rb
@@ -351,11 +351,15 @@ module Fluent
       def enqueue_all
         log.trace "enqueueing all chunks in buffer", instance: self.object_id
         if block_given?
-          synchronize{ @stage.keys }.each do |metadata|
-            chunk = @stage[metadata]
-            next unless chunk
-            v = yield metadata, chunk
-            enqueue_chunk(metadata) if v
+          metadata_array = []
+          synchronize do
+            @stage.each do |metadata, chunk|
+              v = yield metadata, chunk
+              metadata_array << metadata if v
+            end
+          end
+          metadata_array.each do |metadata|
+            enqueue_chunk(metadata)
           end
         else
           synchronize{ @stage.keys }.each do |metadata|

This comment has been minimized.

@repeatedly

repeatedly Oct 23, 2017

Member

Yes and currently fluentd focus on CRuby so I think code change is not needed for now.
But adding comment is better for the future.

This comment has been minimized.

@abicky

abicky Oct 23, 2017

Contributor

Okey, I added a comment 2261428

@repeatedly

This comment has been minimized.

Copy link
Member

repeatedly commented Oct 23, 2017

I re-run failed tests and all tests are passed.
So your patch doesn't break existing expected behaviour.

abicky added a commit to abicky/fluentd that referenced this pull request Oct 23, 2017

@repeatedly repeatedly merged commit 0848224 into fluent:master Oct 23, 2017

@abicky abicky deleted the abicky:hotfix/lock-chunks-first branch Oct 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment