Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

Closed
wants to merge 37 commits into from

Conversation

driskell
Copy link
Contributor

@driskell driskell commented Feb 1, 2014

Hi Jordan / People,

I needed to do some changes to get the forwarder to work safely for me as the registrar persistence was not working properly at all. On each iteration, it only saves persistence for log files mentioned in the current batch of events due to a scoping problem on the state. But even fixing the scope means during startup, until all log files have been touched there is risk of state loss if another restart happens.

  • Fixed the above by adding a new channel during startup, which allows resume_tracking to flag that it resumed a file. The previous persistence is loaded in main() now, and then all entries that resume_tracking starts harvesters on are passed to registrar, so that the initial re-write of the persistence ALWAYS contains the FULL list of files that still exist and where it was continued from.

Also, if a read ACK occurred (seems very often caused when the remote is logstash with redis output where redis has hit congestion_threshold) it would reconnect immediately and spam reconnect due to missing sleep. When it does sleep on other forms of errors, it reconnect in 1s too, which for me was too fast.

  • I added a reconnect configuration option to the network settings in the JSON config file. This defaults to 1s. I've replaced the sleeps with this, and also added a sleep for when an ACK error occurs.

Lastly, I needed old files to be reopened even if the scanning stopped after 24 hours. Currently, the harvester stops after 24 hours on a file. The prospector has no knowledge of this so keeps the file in fileinfo[path] and never starts it again even if the file is modified again.

  • I've done substantial changes to prospector. The dead time of 24h is now also (re?)implemented. I saw the bits in config.go and finished them so that each file can have a different dead time configured. When a harvester finishes it passes it's last known offset into a channel. If the prospector detects this it will then check modification time of the file and if it's no longer within the dead time it will respawn the harvester with its last known offset.

I also found a possibly logic problem during rotation where it might reopen the rotated file and reimport it all if it detected the new file BEFORE the rotated one. When detecting new file it would overwrite the fileinfo[path] for the previous, meaning the is_renamed_file fails when it does get to the rotated file. I think nobody sees this as Glob() maybe returning in alphabetical order, and most log rotate is in such a way it maintains correct order. Minor - but I adjusted to avoid this edge case as it would be bound to confuse the hell out of the first person that messes with the log rotate settings.

  • This just needed the old file to be remember, so I added a missing_files[] to compliment the fileinfo[] since there was no way to keep two files same path in fileinfo[], and prefixing it or something just seemed dirty. If I'm honest I'm not fully happy with this, but it works.

I have posted in the following pull regarding the changes I have done:
#123

I've also seen this just appear by jhanson for the dead time config which conflicts with mine unfortunately. His approach was different and added it as a command line option.
#150

I've signed the CLA for a Kibana change I did earlier so that should be sorted. I really do apologise for not discussing this all in mailing list. I did look through the mailing list but seems activity for forwarder is low. It was something I just needed to get working and I did get a little carried away.

I welcome all feedback and comments, not all of my changes will be perfect or desired. I've cleaned up my commits today in a new branch and I'm willing to create other branches with cherry picked if desired.

Also, apologies for the long post. I do ramble on sometimes!

Regards,

Jason

@mleklund
Copy link

Looks like travis is failing on this PR. Looks like stud is Gemlock'd to 0.0.5 and Stud.Temporary was not introduced until 0.0.14

@jordansissel
Copy link
Contributor

Sorry for the delay in response. We'll get to this soon!

@driskell
Copy link
Contributor Author

No worries Jordan! I noticed the gofmt commits so just merging them in to fix the conflicts.
Will do some testing today.

… startup, while running always start beginning (match LogStash behaviour) - also store the start offset straight away in statefile to ensure it persists through restart even if file does not change
@driskell
Copy link
Contributor Author

OK, so I fixed 144 too in my pull as I needed it too and it didn't make sense to fix it twice on two branches.

If we want to reduce changes let me know and we can close this PR and I'll reopen with just the original fixes in the title (I kept a branch for it) - then when happy I can resubmit 144.

I also fixed the Gemfile.lock version for stub in hopes of fixing Travis-CI build - the commit message vanished in my rebase though.

Fixed: file offset still calculated wrong. File was detected as
truncated again :-(
Fixed: loosing the beginning of a partial line
Fixed: A CR is also removed from the EOL

Known issues:
If the last line was logged with CRLF as EOF then the file offset is
not correct calculated (missing one byte for the CR) in the Registry
@driskell
Copy link
Contributor Author

OK that didn't fix it because it still refers to lumberjack... I've fixed the tests now and I will add some new tests to test the bugs we're fixing. Just adding one for #164 at the moment too.

@driskell
Copy link
Contributor Author

OK, this PR passes 4/5 tests from the tests I put over in PR #172.
The last test will be passed by PR #164 once its finished.

@driskell
Copy link
Contributor Author

driskell commented Mar 1, 2014

Cool. I have merged #164 which is now complete, this PR #151, and also the tests PR #172 into https://github.com/driskell/logstash-forwarder/tree/driskell_merge_pr164
Travis now passes all the tests for the various bugs I've built tests for:
https://travis-ci.org/driskell/logstash-forwarder/builds/19859652

@jordansissel
Copy link
Contributor

I'm working with @polyfractal to get another Elasticsearch dev up to speed on this project so we can hopefully put more energy into fixes and PR management. Again, sorry for the delay.

@driskell
Copy link
Contributor Author

Thanks @jordansissel

I've done some more restructuring on the branch for this pull - but not pushed it to GitHub just yet:

  • Bring new functionality to Windows (can detect file renames - but only while running, not during restart, but better than nothing.)
  • During startup, detect file renames that happened when the service was stopped
  • DRY work in Prospector, logic tidying, and made it an object

I also branched from this PR to implement a multiline codec with "negate" and "pattern" options - default is to use a codec called "plain" which simply passes each line onwards. This is merged into driskell_merge_pr164 (turned out to be a shit branch name) along with the other changes on this branch so are visible online there.

Lastly, the protocol stuff has smoothed my network beyond bounds and fixed all of our duplication issues - throughput on evening shipping from legacy systems is also phenomenally smoother and quicker! I stuck with my email and did the less weird data frame version 2 for partial ack - added benefit there is its compatible with old LogStash server gem.
I also think I fixed the issue regarding ssl connections getting severed when a single connection failed. Also implemented a :logger option so LogStash can pass @logger in - and the server will then log socket failures along with which peer it was.

Jason

@driskell
Copy link
Contributor Author

driskell commented Apr 8, 2014

Hi @jordansissel

I'm keen to help out to fix this up. Is it best we chat somehow at some point (IRC?) or have you started working on your own fixes?

The channel from harvester to prospector to facilitate resume on dead time should be ok.

The design deviation going from harvester/prospector direct to registrar I think is probably the main concern here. I think it's important all offset changes are saved as soon as they need to be, on start as well as on acknowledgement. But I can split that into another branch and another PR if it unblocked this one.

Thanks for your time and I appreciate the pain in such a large code review!

@kiranos
Copy link

kiranos commented Apr 17, 2014

it would be great if directions on how to proceed with this would be discussed. (splitting to multiple pull requests or reviewing it as a whole)

@driskell
Copy link
Contributor Author

@jordansissel

Not sure if you made any decision yet, on whether to do things yourself and only use this PR for reference, or if you were still open to merging if certain things are changed.

Will it help if I remove the harvester -> registrar channel (that saves initial offsets immediately) to reduce the PR? It will put it back to the original design you had too, with the only adjustment the two-way prospector<->harvester for offset resume after dead time:

Prospector <-> Harvester -> Spooler -> Publisher -> Registar

I know some of the variable names are a bit off too and I can go through making them more explanatory - or that can wait if you're happy with them.

@driskell
Copy link
Contributor Author

driskell commented May 1, 2014

Tidies a bit and improved some variable naming. Two functional tweaks: always save a state even if its empty - simplifies code; don't clean out memorised file state for removed files until harvester stops (more logical)

Removed the harvester->registrar channel (my plan is another PR in future with an event channel to allow runtime clean up of deleted files from state file - current way to clean is restart LSF)

@driskell
Copy link
Contributor Author

driskell commented May 1, 2014

Merged master to resolve conflict with merged #198

…ry so we can safely track file rotations across restarts and maintain exact behaviour across all platforms. If the Go library changes to become incompatible a WARNING message will print and we will resort to old behaviour: having no ability to track rotations across restarts.
@driskell
Copy link
Contributor Author

driskell commented May 4, 2014

Managed to get Windows to save the vol/idxhi/idxlo which Go uses for "os.SameFile" so we can detect file rotation during startup. Seems to work. Might be something you want or not but it's there for review.

@driskell
Copy link
Contributor Author

driskell commented May 4, 2014

OK FYI - there will be no more change to this. (Unless to tidy and requested.)

This contains only the wide changes I did to fix the various issues with rename detection, start offsets, dead time and Windows compatibility. The only additional feature should be the configurable reconnect timeout which can be removed if needed. I can split the bug fixes further but from the beginning I'd fixed so much they tied together so it may be difficult.

If this is not to be and you're doing your own thing then let me know.

@driskell
Copy link
Contributor Author

driskell commented May 5, 2014

Gonna give you a cut down with minimal change to start with.
Worst case I'll just keep a fork but hopefully gradually can get some changes across.

@driskell driskell closed this May 5, 2014
@kiranos
Copy link

kiranos commented May 5, 2014

hm what happended to this? I was eagerly waiting for logrotate support. Why was it closed?

@jordansissel
Copy link
Contributor

@kiranos read @driskell's last comment

@teu
Copy link

teu commented Jul 8, 2014

Hi Guys,

I'm not sure what the problem is, but I am restarting LF on a constantly changing file and I am still getting "Skipped old file". The file is an apache access json log which obviously does not sleep a long time. Any way to manually tell LF to reconsider it? (I use version 0.3.1)

@wyaeld
Copy link

wyaeld commented Sep 28, 2014

@teu simplest workaround seems to be touch the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants