Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

driskell · 2014-02-01T10:11:16Z

Hi Jordan / People,

I needed to do some changes to get the forwarder to work safely for me as the registrar persistence was not working properly at all. On each iteration, it only saves persistence for log files mentioned in the current batch of events due to a scoping problem on the state. But even fixing the scope means during startup, until all log files have been touched there is risk of state loss if another restart happens.

Fixed the above by adding a new channel during startup, which allows resume_tracking to flag that it resumed a file. The previous persistence is loaded in main() now, and then all entries that resume_tracking starts harvesters on are passed to registrar, so that the initial re-write of the persistence ALWAYS contains the FULL list of files that still exist and where it was continued from.

Also, if a read ACK occurred (seems very often caused when the remote is logstash with redis output where redis has hit congestion_threshold) it would reconnect immediately and spam reconnect due to missing sleep. When it does sleep on other forms of errors, it reconnect in 1s too, which for me was too fast.

I added a reconnect configuration option to the network settings in the JSON config file. This defaults to 1s. I've replaced the sleeps with this, and also added a sleep for when an ACK error occurs.

Lastly, I needed old files to be reopened even if the scanning stopped after 24 hours. Currently, the harvester stops after 24 hours on a file. The prospector has no knowledge of this so keeps the file in fileinfo[path] and never starts it again even if the file is modified again.

I've done substantial changes to prospector. The dead time of 24h is now also (re?)implemented. I saw the bits in config.go and finished them so that each file can have a different dead time configured. When a harvester finishes it passes it's last known offset into a channel. If the prospector detects this it will then check modification time of the file and if it's no longer within the dead time it will respawn the harvester with its last known offset.

I also found a possibly logic problem during rotation where it might reopen the rotated file and reimport it all if it detected the new file BEFORE the rotated one. When detecting new file it would overwrite the fileinfo[path] for the previous, meaning the is_renamed_file fails when it does get to the rotated file. I think nobody sees this as Glob() maybe returning in alphabetical order, and most log rotate is in such a way it maintains correct order. Minor - but I adjusted to avoid this edge case as it would be bound to confuse the hell out of the first person that messes with the log rotate settings.

This just needed the old file to be remember, so I added a missing_files[] to compliment the fileinfo[] since there was no way to keep two files same path in fileinfo[], and prefixing it or something just seemed dirty. If I'm honest I'm not fully happy with this, but it works.

I have posted in the following pull regarding the changes I have done:
#123

I've also seen this just appear by jhanson for the dead time config which conflicts with mine unfortunately. His approach was different and added it as a command line option.
#150

I've signed the CLA for a Kibana change I did earlier so that should be sorted. I really do apologise for not discussing this all in mailing list. I did look through the mailing list but seems activity for forwarder is low. It was something I just needed to get working and I did get a little carried away.

I welcome all feedback and comments, not all of my changes will be perfect or desired. I've cleaned up my commits today in a new branch and I'm willing to create other branches with cherry picked if desired.

Also, apologies for the long post. I do ramble on sometimes!

Regards,

Jason

… anything fails

…sed logic

…hen read error during ack

…, and (re?)implement configurable dead_time

I found a solution for the issue elastic#149

mleklund · 2014-02-25T17:30:45Z

Looks like travis is failing on this PR. Looks like stud is Gemlock'd to 0.0.5 and Stud.Temporary was not introduced until 0.0.14

jordansissel · 2014-02-25T18:11:37Z

Sorry for the delay in response. We'll get to this soon!

driskell · 2014-02-28T11:04:08Z

No worries Jordan! I noticed the gofmt commits so just merging them in to fix the conflicts.
Will do some testing today.

… startup, while running always start beginning (match LogStash behaviour) - also store the start offset straight away in statefile to ensure it persists through restart even if file does not change

driskell · 2014-02-28T14:18:03Z

OK, so I fixed 144 too in my pull as I needed it too and it didn't make sense to fix it twice on two branches.

If we want to reduce changes let me know and we can close this PR and I'll reopen with just the original fixes in the title (I kept a branch for it) - then when happy I can resubmit 144.

I also fixed the Gemfile.lock version for stub in hopes of fixing Travis-CI build - the commit message vanished in my rebase though.

Fixed: file offset still calculated wrong. File was detected as truncated again :-( Fixed: loosing the beginning of a partial line Fixed: A CR is also removed from the EOL Known issues: If the last line was logged with CRLF as EOF then the file offset is not correct calculated (missing one byte for the CR) in the Registry

driskell · 2014-02-28T16:27:55Z

OK that didn't fix it because it still refers to lumberjack... I've fixed the tests now and I will add some new tests to test the bugs we're fixing. Just adding one for #164 at the moment too.

driskell · 2014-02-28T18:45:46Z

OK, this PR passes 4/5 tests from the tests I put over in PR #172.
The last test will be passed by PR #164 once its finished.

Thanks to https://github.com/driskell

driskell · 2014-03-01T11:12:48Z

Cool. I have merged #164 which is now complete, this PR #151, and also the tests PR #172 into https://github.com/driskell/logstash-forwarder/tree/driskell_merge_pr164
Travis now passes all the tests for the various bugs I've built tests for:
https://travis-ci.org/driskell/logstash-forwarder/builds/19859652

…g, reported in issue elastic#144

jordansissel · 2014-03-15T22:39:40Z

I'm working with @polyfractal to get another Elasticsearch dev up to speed on this project so we can hopefully put more energy into fixes and PR management. Again, sorry for the delay.

…olang os.SameFile() that works on windows; implement windows specific file state code to persist vol/idxhi/idxlo for rotation detection

…on/rename through restarts) due to the inaccessible os pkg struct entries we would need

driskell · 2014-03-16T17:46:15Z

Thanks @jordansissel

I've done some more restructuring on the branch for this pull - but not pushed it to GitHub just yet:

Bring new functionality to Windows (can detect file renames - but only while running, not during restart, but better than nothing.)
During startup, detect file renames that happened when the service was stopped
DRY work in Prospector, logic tidying, and made it an object

I also branched from this PR to implement a multiline codec with "negate" and "pattern" options - default is to use a codec called "plain" which simply passes each line onwards. This is merged into driskell_merge_pr164 (turned out to be a shit branch name) along with the other changes on this branch so are visible online there.

Lastly, the protocol stuff has smoothed my network beyond bounds and fixed all of our duplication issues - throughput on evening shipping from legacy systems is also phenomenally smoother and quicker! I stuck with my email and did the less weird data frame version 2 for partial ack - added benefit there is its compatible with old LogStash server gem.
I also think I fixed the issue regarding ssl connections getting severed when a single connection failed. Also implemented a :logger option so LogStash can pass @logger in - and the server will then log socket failures along with which peer it was.

Jason

driskell · 2014-04-08T18:17:24Z

Hi @jordansissel

I'm keen to help out to fix this up. Is it best we chat somehow at some point (IRC?) or have you started working on your own fixes?

The channel from harvester to prospector to facilitate resume on dead time should be ok.

The design deviation going from harvester/prospector direct to registrar I think is probably the main concern here. I think it's important all offset changes are saved as soon as they need to be, on start as well as on acknowledgement. But I can split that into another branch and another PR if it unblocked this one.

Thanks for your time and I appreciate the pain in such a large code review!

kiranos · 2014-04-17T05:18:46Z

it would be great if directions on how to proceed with this would be discussed. (splitting to multiple pull requests or reviewing it as a whole)

driskell · 2014-04-29T08:14:34Z

@jordansissel

Not sure if you made any decision yet, on whether to do things yourself and only use this PR for reference, or if you were still open to merging if certain things are changed.

Will it help if I remove the harvester -> registrar channel (that saves initial offsets immediately) to reduce the PR? It will put it back to the original design you had too, with the only adjustment the two-way prospector<->harvester for offset resume after dead time:

Prospector <-> Harvester -> Spooler -> Publisher -> Registar

I know some of the variable names are a bit off too and I can go through making them more explanatory - or that can wait if you're happy with them.

driskell · 2014-05-01T19:22:57Z

Tidies a bit and improved some variable naming. Two functional tweaks: always save a state even if its empty - simplifies code; don't clean out memorised file state for removed files until harvester stops (more logical)

Removed the harvester->registrar channel (my plan is another PR in future with an event channel to allow runtime clean up of deleted files from state file - current way to clean is restart LSF)

…rwarder into driskell

driskell · 2014-05-01T19:26:18Z

Merged master to resolve conflict with merged #198

…ry so we can safely track file rotations across restarts and maintain exact behaviour across all platforms. If the Go library changes to become incompatible a WARNING message will print and we will resort to old behaviour: having no ability to track rotations across restarts.

driskell · 2014-05-04T18:33:50Z

Managed to get Windows to save the vol/idxhi/idxlo which Go uses for "os.SameFile" so we can detect file rotation during startup. Seems to work. Might be something you want or not but it's there for review.

driskell · 2014-05-04T18:51:54Z

OK FYI - there will be no more change to this. (Unless to tidy and requested.)

This contains only the wide changes I did to fix the various issues with rename detection, start offsets, dead time and Windows compatibility. The only additional feature should be the configurable reconnect timeout which can be removed if needed. I can split the bug fixes further but from the beginning I'd fixed so much they tied together so it may be difficult.

If this is not to be and you're doing your own thing then let me know.

driskell · 2014-05-05T18:01:00Z

Gonna give you a cut down with minimal change to start with.
Worst case I'll just keep a fork but hopefully gradually can get some changes across.

kiranos · 2014-05-05T18:25:39Z

hm what happended to this? I was eagerly waiting for logrotate support. Why was it closed?

jordansissel · 2014-05-05T18:27:34Z

@kiranos read @driskell's last comment

teu · 2014-07-08T14:03:29Z

Hi Guys,

I'm not sure what the problem is, but I am restarting LF on a constantly changing file and I am still getting "Skipped old file". The file is an apache access json log which obviously does not sleep a long time. Any way to manually tell LF to reconsider it? (I use version 0.3.1)

wyaeld · 2014-09-28T22:14:06Z

@teu simplest workaround seems to be touch the file

driskell added 4 commits February 1, 2014 09:23

Check stat error on config file and correctly return a non-nil err if…

27c3ff9

… anything fails

Add a couple of notes about line number counter and some possibly unu…

e7f52fe

…sed logic

Make reconnect time (1s) adjustable in config, and fix 0s reconnect w…

2545cd6

…hen read error during ack

Fix broken registrar persistence, multiple prospector tracking issues…

315b609

…, and (re?)implement configurable dead_time

driskell mentioned this pull request Feb 1, 2014

Always harvest a log that is configured #123

Closed

Bugfix for logging lines without or delayed newlines

2c7d915

I found a solution for the issue elastic#149

This was referenced Feb 25, 2014

Daemon overwrites its own state file when watching multiple types of logs. #167

Closed

Missing log entries in log rotated files #144

Closed

Merge branch 'master' into driskell (gofmt etc)

5a6d0c0

Fix issue elastic#144 - only obey from-beginning for new files during…

e399833

… startup, while running always start beginning (match LogStash behaviour) - also store the start offset straight away in statefile to ensure it persists through restart even if file does not change

Small optimizations

eb6309a

Thanks to https://github.com/driskell

Improve prospector so it detects file rotation while it is not runnin…

3d17c18

…g, reported in issue elastic#144

driskell added 6 commits March 16, 2014 13:43

Document in readme the reconnect network config option

e0cd1b2

DRY - Restructure prospector

bae69fc

Add an extra log message for file rename to improve log readability

c4a8440

Further log readability improvements

9a116d5

Consolidate common file state functions and structures; make use of g…

5337ba3

…olang os.SameFile() that works on windows; implement windows specific file state code to persist vol/idxhi/idxlo for rotation detection

Note the missing Windows registrar functionality (detection of rotati…

cf207a8

…on/rename through restarts) due to the inaccessible os pkg struct entries we would need

driskell added 2 commits March 17, 2014 19:05

Open on Windows will no longer lock files

42ddb6f

Tweak logging to remove any confusion

0365abe

gofmt -tabs=false -tabwidth=2

f4ba38b

driskell mentioned this pull request Apr 7, 2014

lumberjack traffic creates an endless loop in conjunction with logstash #191

Closed

driskell mentioned this pull request Apr 23, 2014

Support for both Windows and Linux newlines and don't split lines #193

Closed

atwardowski and others added 2 commits April 25, 2014 15:32

Enable build on freebsd (elastic#132 / elastic#112)

bf5000b

Enable build on openbsd to fix elastic#194

48ba24e

driskell mentioned this pull request Apr 28, 2014

Rotated files can lead to silently losing data #197

Closed

driskell added 8 commits May 1, 2014 19:38

Only clean out prospector info of files we've stopped harvesting

9f53b33

Remove obsolete statement

0a61b1a

Remove offset channel from harvester directly to registrar

0619f6c

Tidy and improve variable naming

71f7bef

Remove redundant comment

e4c782a

Merge branch 'tzahari/master' into driskell

47d7f58

Tidy registrar

ba24c92

Tweak startup events and always write registry

15e76e3

Merge branch 'master' of https://github.com/elasticsearch/logstash-fo…

d4c217c

…rwarder into driskell

driskell mentioned this pull request May 4, 2014

logstash-forwarder keep deleted files open #200

Closed

driskell closed this May 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

driskell commented Feb 1, 2014

mleklund commented Feb 25, 2014

jordansissel commented Feb 25, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Mar 1, 2014

jordansissel commented Mar 15, 2014

driskell commented Mar 16, 2014

driskell commented Apr 8, 2014

kiranos commented Apr 17, 2014

driskell commented Apr 29, 2014

driskell commented May 1, 2014

driskell commented May 1, 2014

driskell commented May 4, 2014

driskell commented May 4, 2014

driskell commented May 5, 2014

kiranos commented May 5, 2014

jordansissel commented May 5, 2014

teu commented Jul 8, 2014

wyaeld commented Sep 28, 2014

Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

Fix broken state w/ multiple logs; improve rename handling; add reconnect timeout option; resume old files; start from beginning of file if created after startup (same as LogStash) #151

Conversation

driskell commented Feb 1, 2014

mleklund commented Feb 25, 2014

jordansissel commented Feb 25, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Feb 28, 2014

driskell commented Mar 1, 2014

jordansissel commented Mar 15, 2014

driskell commented Mar 16, 2014

driskell commented Apr 8, 2014

kiranos commented Apr 17, 2014

driskell commented Apr 29, 2014

driskell commented May 1, 2014

driskell commented May 1, 2014

driskell commented May 4, 2014

driskell commented May 4, 2014

driskell commented May 5, 2014

kiranos commented May 5, 2014

jordansissel commented May 5, 2014

teu commented Jul 8, 2014

wyaeld commented Sep 28, 2014