Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: prevent wrongly unwatching with follow_inodes #4237

Merged
merged 1 commit into from Jul 13, 2023

Conversation

daipom
Copy link
Contributor

@daipom daipom commented Jul 12, 2023

Which issue(s) this PR fixes:

What this PR does / why we need it:
We must not unwatch targets that still exist.
If unwatching an existing target, it causes log duplication.

This problem exists from v1.13.3 (the following fix).

This can occur when enabling follow_inodes.

In the existing implementation, update_watcher needs to unwatch the old TailWatcher since refresh_watcher can't unwatch it when update_watcher is called first (when rotation happens).
It is because update_watcher discards the old TailWatcher and refresh_watcher can't recognize the old inode is disappeared.

if @follow_inodes
# When follow_inodes is true, it's not cleaned up by refresh_watcher.
# So it should be unwatched here explicitly.
tail_watcher.unwatched = true

However, it can wrongly unwatch an existing inode because the old inode may still exist.
(See the diff of test cases.)

Thus, we need a new mechanism to unwatch targets correctly.

This fix is based on the idea that we should unwatch based on directly on PositionFile's data.

Docs Changes:
No need.

Release Note:
Same as the title.

How to Reproduce

Config:

<source>
  @type tail
  tag test
  path /test/fluentd/input4/test.log*
  pos_file /test/fluentd/pos/pos
  read_from_head true
  follow_inodes true
  refresh_interval 15s
  enable_stat_watcher false
  <parse>
    @type none
  </parse>
</source>

<match test.**>
  @type stdout
</match>
  • note about enable_stat_watcher
    • enable_stat_watcher false is for making sure to detect rotation.
      • I think this is problematic too... If stat_watcher notifies too quickly before the new current file is created, TailWatcher sets nil to it's rotate_handler. This causes some problems. (I will create an issue or PR about this later.)
      • This may have prevented this problem from occurring in a real environment. If in_tail can't detect the rotation of the current file because of this, then this problem does not occur. The focus is on whether the rotation of the current file can be detected when enable_stat_watcher is enabled. If not, then this can occur only when disabling enable_stat_watcher.

Files:

$ ls /test/fluentd/input4
test.log

$ cat /test/fluentd/input4/test.log
foooooaaa
Barrrrraaa

Script to rotate:

mv /test/fluentd/input4/test.log /test/fluentd/input4/test.log.1
echo "aaa" > /test/fluentd/input4/test.log

Run:

  • Run Fluentd
  • Run the script to rotate

Result:

2023-06-29 09:55:07 +0900 [info]: #0 starting fluentd worker pid=686256 ppid=686236 worker=0
2023-06-29 09:55:07 +0900 [info]: #0 following tail of /test/fluentd/input4/test.log
2023-06-29 09:55:07.232909021 +0900 test: {"message":"foooooaaa"}
2023-06-29 09:55:07.232911541 +0900 test: {"message":"Barrrrraaa"}
2023-06-29 09:55:07 +0900 [info]: #0 fluentd worker is now running worker=0
2023-06-29 09:55:12 +0900 [info]: #0 detected rotation of /test/fluentd/input4/test.log; waiting 5 seconds
2023-06-29 09:55:12 +0900 [info]: #0 following tail of /test/fluentd/input4/test.log
2023-06-29 09:55:12.234238467 +0900 test: {"message":"aaa"}
2023-06-29 09:55:22 +0900 [info]: #0 following tail of /test/fluentd/input4/test.log.1
2023-06-29 09:55:22.234924294 +0900 test: {"message":"foooooaaa"}
2023-06-29 09:55:22.234935622 +0900 test: {"message":"Barrrrraaa"}
  • The old rotated file was read in duplicate.
  • This occurs if detach is done before refresh_watcher.
    • It is because detach of update_watcher wrongly unwatches the existing inode.

lib/fluent/plugin/in_tail.rb Outdated Show resolved Hide resolved
@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

TODO: I'm checking if it is OK to unwatch the position tied to TailWatcher that is not detached yet.

@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

https://github.com/fluent/fluentd/actions/runs/5526720226/jobs/10081747103?pr=4237

The test test_move_truncate_move_back looks unstable on macOS. I will check if it is related to this fix.

Windows too. It may have something to do with this fix.

https://github.com/fluent/fluentd/actions/runs/5526720222/jobs/10081747201?pr=4237

log.debug {
target_paths_str = target_paths_hash.collect { |key, target_info| target_info.path }.join(",")
existence_paths_str = existence_paths_hash.collect { |key, target_info| target_info.path }.join(",")
"tailing paths: target = #{target_paths_str} | existing = #{existence_paths_str}"
}

unwatched_hash = existence_paths_hash.reject {|key, value| target_paths_hash.key?(key)}
@pf&.unwatch_removed_targets(target_paths_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I might overlook something, but could you let me ask the following?

I considered the pattern that /var/log/test1.log was removed then its inode(inode-1) was reused by /var/log/test2.log in the setting path = /var/log/*.log*.

  • target_paths_hash will include inode-1, since /var/log/test2.log has the same inode as the removed /var/log/test1.log.
  • So @pf&.unwatch_removed_targets(target_paths_hash) will skip inode-1 so the position info of inode-1 won't be reset.
  • removed_hash also won't include inode-1 due to the same reason.

In this case, won't fluentd misrecognize the previous position for /var/log/test2.log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering it!
That scenario can happen in the current master branch, right?

I think this logic should work in that scenario, and the position should be reset to zero.
I'm not sure this works in that scenario, though...
We should check it.

# rotated file has the same inode number with the last file.
# assuming following situation:
# a) file was once renamed and backed, or
# b) symlink or hardlink to the same file is recreated
# in either case of a and b, seek to the saved position
# c) file was once renamed, truncated and then backed
# in this case, consider it truncated
@pe.update(inode, 0) if fsize < @pe.read_pos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That scenario can happen in the current master branch, right?

I wanted to say that #4237 (comment) is the side effect introduced by adding @pf&.unwatch_removed_targets(target_paths_hash) in your patch.

But maybe I was wrong, because added_hash also won't include inode-1 in my suggested pattern.
I rewrote the scenario a bit as follows.

In the pattern that /var/log/test1.log was removed then its inode(inode-1) was reused by /var/log/test2.log in the setting path = /var/log/*.log*:

  • target_paths_hash will include inode-1, since /var/log/test2.log has the same inode as the removed /var/log/test1.log.
    • It means that @pf&.unwatch_removed_targets(target_paths_hash) will skip inode-1 so the position info of inode-1 won't be reset.
    • removed_hash also won't include inode-1 due to the same reason.
    • added_hash also won't include inode-1 due to the same reason.
  • So refresh_watchers() will call neither start_watchers() nor stop_watchers() for both of /var/log/test1.log and /var/log/test2.log.

Yes, the above scenario can happen even in the current master branch.

The situation will be resolved after a new /var/log/test1.log was created and @tails["/var/log/test1.log"] got updated with a new inode.
So it won't be a big problem in real environment, I believe...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it rotation? (Is the test1.log the same file as test2.log?)

If so, it is correct that refresh_watcher does nothing for inode-1 because it means the target inode-1 still exists.
In this case, update_watcher update the TailWatcher.

However, I wonder the old TailWathcer can collect duplicate log in rotate_wait interval...
I'll check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I wonder the old TailWathcer can collect duplicate log in rotate_wait interval...
I'll check it.

This may not be the point that you were originally concerned about, but I found this can cause log duplication.
This seems to be a problem that has existed for a long time.

This is a similar problem to #4237 (comment)
It seems that in_tail has a log duplication problem since a long time ago with rotate_wait.

I will explain this in another comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this in #4237 (comment)

We must not unwatch targets that still exist.
If unwatching an existing target, it causes log duplication.

In the existing implementation, `update_watcher` needs to unwatch
the old TailWatcher since `refresh_watcher` can't unwatch it when
`update_watcher` is called first.
It is because `update_watcher` discards the old TailWatcher and
`refresh_watcher` can't recognize the old inode is disappeared.

However, it can wrongly unwatch an existing inode because the
old inode may still exist.
(See the diff of test cases.)

Thus, we need a new mechanism to correctly unwatch targets when
follow_inodes.

This fix is based on the idea that we should unwatch based on
directly on PositionFile's data.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Co-authored-by: Takuro Ashie <ashie@clear-code.com>
@daipom daipom force-pushed the in_tail-prevent-wrongly-unwatching-2 branch from d9bfe4a to 51848da Compare July 12, 2023 09:00
@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

I fixed this so that this logic works only for follow_inodes true.
Now this does not change the behavior of follow_indoes false.

@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

TODO: I'm checking if it is OK to unwatch the position tied to TailWatcher that is not detached yet.

This fix changes that unwatching is done immediately (without waiting rotate_wait).
I worry that this can change some behaviors...
To limit the impact, we should ensure that unwatching is done at the same time as detaching, just as before.
However, ensuring this will be a high-impact fix. So I gave up. 😢

Even if the existing TailWatcher's position is unwatched, the target file is already removed.
So I think we don't need to wait rotate_wait for unwatching...
but I worry that it might result in unexpected changes in behavior...

@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

but I worry that it might result in unexpected changes in behavior...

The following test failure is one example of this.
This test is for follow_inodes false, so now it will succeed, but I think the same phenomenon can occur for follow_inodes true as well.

https://github.com/fluent/fluentd/actions/runs/5526720226/jobs/10081747103?pr=4237

The test test_move_truncate_move_back looks unstable on macOS. I will check if it is related to this fix.

Windows too. It may have something to do with this fix.

https://github.com/fluent/fluentd/actions/runs/5526720222/jobs/10081747201?pr=4237

It is difficult to explain why this test was unstable before rebasing this branch.
We need to understand the following scenario and mechanism first.

  • While the existing TailWatcher is detaching and waiting for the rotate_wait interval, new TailWatcher can be created that collects the same target.
  • In this case, there are multiple TailWatchers for the same target, so this risks log duplication.
  • However, log duplication does not occur because those multiple TailWatchers refer to the same PositionEntry.

This fix changes this behavior.
If referring to the inode that is already unwatched, then PositionFile returns a new PositionEntry.
So, if the old TailWatcher is unwatched immediately without waiting rotate_wait, the new TailWatcher for the same target will have another PositionEntry for the same target.

Because of the following logic, TailWathcer can update the position to zero even after being unwatched.

# rotated file has the same inode number with the last file.
# assuming following situation:
# a) file was once renamed and backed, or
# b) symlink or hardlink to the same file is recreated
# in either case of a and b, seek to the saved position
# c) file was once renamed, truncated and then backed
# in this case, consider it truncated
@pe.update(inode, 0) if fsize < @pe.read_pos

Thus, this fix can cause the log duplication in this scenario.

This happens only when the target having the same inode resurrects (disappears and appears again) with the smaller file size in the interval rotate_wait.

@daipom
Copy link
Contributor Author

daipom commented Jul 12, 2023

About #4237 (comment), I think we should fix the entire logic so that a new TailWatcher for the same target should be added after the detach of the old TailWatcher is completed.

But it doesn't look like we'll have time to get there before the next release of 1.16... 😢

@daipom
Copy link
Contributor Author

daipom commented Jul 13, 2023

I have commented on some cases where multiple TailWatcher exists due to rotate_wait.
I have checked the behavior of v1.16.1, and I found out that in_tail has this problem since a long time ago.

So now I think we should not be concerned about that point in this PR.
(It's been wrong since a long time ago)

#4237 (comment)
#4237 (comment)

The scenario: rotate -> update_watcher -> refresh_watcher -> rotate_wait elapsed

  • Initial state
path-1: inode-b, TailWatcher-1 
path-2: inode-a, TailWathcer-2
  • Rotate happens
path-1: inode-c, TailWatcher-1-new
path-2: inode-b, TailWathcer-2-new, TailWatcher-1-detaching
path-3: inode-a, TailWatcher-2-detaching
  • refresh_watcher
path-1: inode-c, TailWatcher-1-new
path-2: inode-b, TailWathcer-2-new, TailWatcher-1-detaching
path-3: inode-a, TailWatcher-2-detaching, TailWatcher-3
  • rotate_wait elapsed
path-1: inode-c, TailWatcher-1-new
path-2: inode-b, TailWathcer-2-new
path-3: inode-a, TailWatcher-3

As above, rotate_wait causes the situation that multiple TailWatchers exist for the same target temporarily.
This causes log duplication.
I have confirmed this on v1.16.1.

<source>
  @type tail
  tag test
  path /path/to/test.log*
  pos_file /test/fluentd/pos/pos
  read_from_head true
  follow_inodes true
  refresh_interval 5s
  enable_stat_watcher false # To ensure that TailWathcer recognizes rotation
  rotate_wait 30s
  <parse>
    @type none
  </parse>
</source>

<match test.**>
  @type stdout
</match>
  • Start to tail the following file
test.log
  • Rotate
test.log
test.log.1
  • Append a log to test.log.1.
    • When refresh_interval NOT elapsed and rotate_wait NOT elapsed
      • No duplication
    • When refresh_interval elapsed and rotate_wait NOT elapsed
      • DUPLICATION
    • When refresh_interval elapsed and rotate_wait elapsed
      • No duplication

@daipom
Copy link
Contributor Author

daipom commented Jul 13, 2023

In some cases, as in the case test_move_truncate_move_back , it seems that referring to the same PositionEntry can prevent duplication, but in other cases, this has not been possible before.
(We can reproduce it easily like #4237 (comment))

This problem should be solved fundamentally.
(in_tail should wait for the old TailWatcher being completely detached)

@ashie
Copy link
Member

ashie commented Jul 13, 2023

The test test_move_truncate_move_back looks unstable on macOS. I will check if it is related to this fix.

The situation that this test cares seems rare, and it doesn't necessarily have to be resolved on follow_inodes case, because files are assumed to be rotated normally without such odd behaviour in this case.

This happens only when the target having the same inode resurrects (disappears and appears again) with the smaller file size in the interval rotate_wait.

There are 2 possible reasons for same inode resurrection:

  1. File is moved and moved back like test test_move_truncate_move_back
  2. The inode is reused for a new file

For 1, we don't always need to resolve it (at least in this PR) as I described above.

For 2, it's not possible in this case because the TailWatcher keeps the file opened while rotate_wait.
(Although there is an exception open_on_every_update, rotate_wait doesn't happen in this case.)
When someone still open the file, the inode won't be reused.
Here is a simplified verification for it:

test.rb:

#!/usr/bin/env ruby                                                                                                                                           

require 'fileutils'

FileUtils.rm_rf("hoge.txt")

File.open("hoge.txt", "wb") { |f| f.puts("hoge") }

s1 = File.stat("hoge.txt")
puts "ino: #{s1.ino}"

if ARGV[0] == "keep-opened"
  f = File.open("hoge.txt", "wb")
end

FileUtils.rm_rf("hoge.txt")

File.open("hoge.txt", "wb") { |f| f.puts("hage") }

s2 = File.stat("hoge.txt")
puts "ino: #{s2.ino}, same-inode?: #{s1.ino == s2.ino}"
$ ./test.rb
ino: 20192736
ino: 20192736, same-inode?: true
$ ./test.rb keep-opened
ino: 20192736
ino: 20197162, same-inode?: false

Copy link
Member

@ashie ashie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented on some cases where multiple TailWatcher exists due to rotate_wait.
I have checked the behavior of v1.16.1, and I found out that in_tail has this problem since a long time ago.

Not needed to resolve all of this kine of problems by this PR.
Since we need more time to resolve it fundamentally, I think that's enough for now as it solves part of them.

@daipom
Copy link
Contributor Author

daipom commented Jul 13, 2023

Thanks for your review! I agree.
I will create an issue for #4237 (comment) later

@ashie ashie merged commit d05f592 into fluent:master Jul 13, 2023
12 of 16 checks passed
@ashie
Copy link
Member

ashie commented Jul 13, 2023

Thanks!

@daipom daipom deleted the in_tail-prevent-wrongly-unwatching-2 branch July 13, 2023 08:52
@daipom
Copy link
Contributor Author

daipom commented Jul 13, 2023

Thanks! @ashie @masaki-hatada

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants