Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call get_fd() to avoid missing file rotations #65

Closed
wants to merge 1 commit into from

Conversation

portante
Copy link

@portante portante commented Jun 5, 2018

See systemd/systemd#7998 and rsyslog/rsyslog#2437 for more information and background.

@errm
Copy link
Member

errm commented Jun 10, 2018

Thanks for the patch ...

Could you explain the issue you are seeing?
Is there a way we can add a test to exercise the behaviour? I am not a big fan of merging bugfixes without a test or at the very least a detailed comment to explain the issue ... not pinning this down is a recipe for a regression later ...

Also looking at CI this breaks many tests ... I haven't looked closely, but it does seem that this change does indeed break the build.

I would also like to document the issue upstream with https://github.com/ledbettj/systemd-journal if we can

@portante
Copy link
Author

@errm, here is the comment that describes what is happening in systemd/systemd#7998 (comment).

If a journald client does not immediately call sd_journal_get_fd() (which sets up the internal-to-journal API inotify FD for all open journal files), any journal file rotation that happens before the first sd_journal_*() API that sets up the inotify FD will result in "leaked" journal files, marked for delete but still held open by clients. In certain configurations, this can lead to disk space problems on the volume where those files live (/run/log or /var/log).

To test this one has to:

  1. start a journald client running (which does not immediately call sd_journal_get_fd())
  2. start a load on the journal to cause journal file rotation
  3. use lsof -p <client pid> | grep -E "/(run|var)/log" | grep deleted to observe the deleted journal files
  4. stop the load on the journal
  5. observe that the deleted journal files are still there over time

With the fix in place, one will still periodically observe deleted journal files, but over time they'll be cleaned up.

Regarding the existing unit test failures, it seems odd that making such a call changes the behavior that way. It is likely causing journal to ignore the corrupt journal fixture, and that we'll need to "corrupt" them another way.

Regarding upstream, filing an issue is fine, but not sure how much that code base should change if the journald APIs haven't changed (they did fix journald to close and vacate journal files, see systemd/systemd#8144).

@errm
Copy link
Member

errm commented Jun 11, 2018

The error occurs because you are calling a method #get_fd that does not exist ...

#<NoMethodError: undefined method `get_fd' for #<Systemd::Journal:0x0000000273d690>>

In order to ensure we call the native get_fd we have two options ...

  • @journal.send(:file_descriptor) - Not a fan of relying on a private method here though
  • Change the @jounal.wait(0) on the following line to @jounal.wait(0, select: true)

@ledbettj do you have a view about what we should do here?

  • We could make a change upstream so all calls to wait call the native get_fd
  • Or we could make the (currently private) file_descriptor from waitable part of the public api of systemd-journal ...

@ledbettj
Copy link

I don't see any issue with exposing file_descriptor, I think I marked it private only because I didn't think there was any external use for it.

If there is an auto-magical fix for this that we can make we can do that as well. The linked issue is supposedly resolved by systemd/systemd#8144 -- so is this not a problem in newer versions of libsystemd?

@portante
Copy link
Author

@ledbettj, newer versions of systemd still leak FDs, but they won't take up diskspace on systems that let you "vacuum" out the contents of a file on disk. It is not a complete fix for all environments, unfortunately. By always calling sd_journal_get_fd() immediately following a sd_journal_open() gives the inotify setup a chance to avoid loosing track of a file rotation (there is still a small window between the open and get_fd calls, though).

Hiding this in the open method might be the best way to avoid having to force clients to perform this sequence properly.

@reevoo-samuel
Copy link

Er, hello?

This Pull Request is too damn old! Merge or close this, sucka.

@errm
Copy link
Member

errm commented Jun 12, 2018

I opened a PR upstream that I think should solve these issues ledbettj/systemd-journal#78

I think it will also mean that I can clean up a lot of the mess around dealing with rotations ...

errm added a commit that referenced this pull request Jun 12, 2018
@errm errm closed this in #67 Jun 12, 2018
@errm
Copy link
Member

errm commented Jun 12, 2018

Thanks for your help @portante I just pushed 1.0.1 to rubygems ...

@portante portante deleted the patch-1 branch June 12, 2018 23:55
errm added a commit to errm/fluentd-kubernetes-daemonset that referenced this pull request Jun 13, 2018
Pulls in version 1.0.1 or 0.0.11  (for v0.12) of fluent-plugin-systemd

See fluent-plugin-systemd/fluent-plugin-systemd#65 for more
info about the issue that it fixes.

This may manifest in journal files not being correctly deleted by
some versions of systemd after they have been rotated.
errm added a commit to errm/fluentd-kubernetes-daemonset that referenced this pull request Jun 13, 2018
Pulls in version 1.0.1 or 0.0.11  (for v0.12) of fluent-plugin-systemd

See fluent-plugin-systemd/fluent-plugin-systemd#65 for more
info about the issue that it fixes.

This may manifest in journal files not being correctly deleted by
some versions of systemd after they have been rotated.

Signed-off-by: Ed Robinson <edward-robinson@cookpad.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants