Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix filebeat file rotation registrar issue #1281 #1375

Merged
merged 1 commit into from
Apr 18, 2016

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Apr 12, 2016

When multiple files were rotated at the same time, the first rotation was overwriting the states for the other rotated files. This happened because the new file directly had new content inside and the state for all the other files was fetched from disk, resetting the changes which were in memory.

The problem is now fixed by not writing the state to disk every time state changes are discovered by the registrar, as it is up to the prospector to decide if these changes are the most recent ones. Now changes get written to disk every time a scan is done. The downside of this solution is that the registrar is also updated, if no changes did happen.

@ruflin ruflin added the Filebeat Filebeat label Apr 12, 2016
@ruflin ruflin changed the title Add system test to verify https://github.com/elastic/beats/issues/1281 Add system test to verify #1281 Apr 12, 2016
@ruflin ruflin changed the title Add system test to verify #1281 Fix filebeat file rotation registrar issue #1281 Apr 12, 2016
@ruflin ruflin added the review label Apr 12, 2016
@tsg
Copy link
Contributor

tsg commented Apr 12, 2016

There are some test failures which might or might not be related.

@ruflin
Copy link
Member Author

ruflin commented Apr 12, 2016

Windows errors are definitively related. I need to exclude these tests for windows as I check for the inode specifically. For Jenkins it is strange that the tests were working locally and on travis, but no Jenkins. Will look into it, could be a race condition.

@@ -140,6 +140,9 @@ func (p *ProspectorLog) scanGlob(glob string) {
// Track the stat data for this file for later comparison to check for
// rotation/etc
p.prospectorList[h.Path] = *h.Stat
// TODO: Registrar now updated also if not changes happened. Improve efficiency here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/not/no/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ruflin
Copy link
Member Author

ruflin commented Apr 13, 2016

@urso @andrewkroh Seems to be finally stable. Will rebase and squash.


copy := make(map[string]FileState)
for k, v := range r.state {
copy[k] = *v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out definition of FileState. The FileStateOS is not copied (only pointer). Is there a chance the content in FileStateOS can be overwritten when operations are run on this copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All occurence here except for Decode which happens only once at the begining, should be read only. So this should not be an issue right now, but I should still refactor and decouple it in the future to prevent potential problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though "Decode" happens once at startup, the method that does it, LoadState, is public and could be called at any time. The write operation within LoadState should be protected by the mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Lets see if it breaks something ;-)

@andrewkroh andrewkroh added the in progress Pull request is currently in progress. label Apr 14, 2016
@ruflin ruflin force-pushed the fb-registry branch 2 times, most recently from 98fff3d to 2ac9daa Compare April 15, 2016 08:55
@ruflin ruflin removed the in progress Pull request is currently in progress. label Apr 15, 2016
@ruflin ruflin force-pushed the fb-registry branch 2 times, most recently from fdd0243 to 91b6957 Compare April 15, 2016 14:20
@@ -25,7 +26,8 @@ import (
type Harvester struct {
Path string /* the file path to harvest */
Config *config.HarvesterConfig
Offset int64
offset int64
offsetLock sync.Mutex
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why still required? Due to prospector asking for state in order to forward to registrar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I hope to remove this in the future again.

ignoreOlder="1s",
closeOlder="1s",
ignoreOlder="10s",
closeOlder="2s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure, are these changes intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional, as in a previous version of the change the file was sometimes closed too fast. I will reset it to 1s again as I think with the most recent changes filebeat should be faster again with rotating files. But I renamed the tests to close_older instead of ignore_older as the name is not accurate anymore. It tests close_older. Will push a new commit.

h.offsetLock.Lock()
defer h.offsetLock.Unlock()

h.offset = offset
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking, but wouldn't atomic.Store/Load be as good as mutex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also work. Would it make a difference? If not, would keep it like this for the reason that it seems to working now and that it is hopefully removed anyways in the future ...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just leave it as is.

When multiple files were rotated at the same time, the first rotation was overwriting the states for the other rotated files. This happened because the new file directly had new content inside and the state for all the other files was fetched from disk, resetting the changes which were in memory.

The problem is now fixed by not writing the state to disk every time state changes are discovered by the registrar, as it is up to the prospector to decide if these changes are the most recent ones. Now changes get written to disk every time a scan is done. The downside of this solution is that the registrar is also updated, if no changes did happen.

* System test added to verify that issue is fixed
* Closes elastic#1281
* Refactor some variables for better sync handling and fix tests
* Make offset private and only set it through method
* Rename system tests to close_older as this is what is actually checked
* Update registrar only when file was rotated or changed
* Add test to check state between shutdown and rotation.Only update registry when changes happen.
@urso urso merged commit a4d9086 into elastic:master Apr 18, 2016
@ruflin ruflin deleted the fb-registry branch April 18, 2016 12:03
ruflin added a commit to ruflin/beats that referenced this pull request Apr 19, 2016
A direct commit backport was not possible as too many things changed in the prospector and harvester
tsg pushed a commit that referenced this pull request Apr 19, 2016
A direct commit backport was not possible as too many things changed in the prospector and harvester
monicasarbu pushed a commit that referenced this pull request Apr 19, 2016
* New versions for the stack elements

* Remove link to github repo for dashboards

* Change config example that shows how to turn off process monitoring

* Clarify meaning of spool_size and harvester_buffer_size (#1388)

* Add more info to doc about configuring TLS (#1414)

* Backport of #1375 (#1418)

A direct commit backport was not possible as too many things changed in the prospector and harvester
@aellisv aellisv mentioned this pull request Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filebeat 1.1.2 does not record correct inode values of rotated logs on exit
4 participants