Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for gzip compressed files #3070

Closed
wants to merge 1 commit into from

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Nov 29, 2016

This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed.

The implementation with compression allows to add other compressed files in the future.

This implementation has currently the following limitations:

  • No (correct) offset is reported for reading gzip files. This has the following consequences below.
  • In case a gzip file changes its size, the full file will be read again from the beginning.
  • If the reading of a gzip file is interrupted, the state will be set to the file size on closing the harvester and the data will not be sent again.
  • Filebeat cannot detect if content now inside the gzip files was read before as a normal file as the inode of the file changes.

This is an alternative implementation of #2227

@ruflin ruflin added Filebeat Filebeat in progress Pull request is currently in progress. labels Nov 29, 2016
@@ -286,6 +285,15 @@ func (h *Harvester) close() {
// If file was never opened, it can't be closed
if h.file != nil {

if h.config.Compression == config.GZipCompression {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this only if compression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all other cases, offset == size of the file in case the harvester finished reading. For gzip files the offset is > Size(). But to not have to change the prospector logic, it is set to the size of the file. It is kind of a hack, means I should add a doc comment here.


func NewFile(file *os.File, compression string) (FileSource, error) {

if config.GZipCompression == compression {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I see multiple places in code with logic specific to GZip. When new compression type is supported, we'd need to modify all these places.. Is it possible to have all logic per compression type encapsulated in a single place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitively extract it and potentially abstract it if we introduce further compressions. But for now I think it is ok as long as we only have 1 special case. My usual approach is to start abstracting as soon as there are > 2 specific implementation.

@ruflin ruflin force-pushed the gz-support branch 2 times, most recently from e022217 to 6ffb5a8 Compare January 31, 2017 14:04
@ruflin ruflin added review and removed in progress Pull request is currently in progress. labels Jan 31, 2017
@tsg
Copy link
Contributor

tsg commented Jan 31, 2017

Is the option intentionally not added to the full config?

@ruflin
Copy link
Member Author

ruflin commented Jan 31, 2017

@tsg No, I think I added it at the wrong spot and run make update ... Will add it again.

@ruflin
Copy link
Member Author

ruflin commented Feb 3, 2017

We decided to enable gzip support in the first version only when the -once flag is enabled. This heavily reduced the possible edge cases.

One issue with this at the moment that harvesters are not aware of the -once flag but only harvesters are aware of the compression so currently it's not possible to check both in one place. This needs further refactoring.

This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed.

The implementation with compression allows to add other compressed files in the future.

This implementation has currently the following limitations:

* No (correct) offset is reported for reading gzip files. This has the following consequences below.
* In case a gzip file changes its size, the full file will be read again from the beginning.
* If the reading of a gzip file is interrupted, the state will be set to the file size on closing the harvester and the data will not be sent again.
* Filebeat cannot detect if content now inside the gzip files was read before as a normal file as the inode of the file changes.

This is an alternative implementation of elastic#2227
@ruflin ruflin added in progress Pull request is currently in progress. and removed review labels Feb 6, 2017
@ninja-
Copy link

ninja- commented Apr 19, 2017

any progress on this?

@ruflin
Copy link
Member Author

ruflin commented Apr 20, 2017

@ninja- Not at the moment. We put this on hold as it didn't cover the use cases that were popping up. Can you share some details on how you would use this feature?

@ninja-
Copy link

ninja- commented Apr 26, 2017

@ruflin I wanted to properly support log rotation with gzip, in a scenario when the logstash database could go down for few hours, so it will be properly backfilled...would it be better now to disable log compression in my app? and manually remove old files when free disk space gets low because of the logs?

@ruflin
Copy link
Member Author

ruflin commented Apr 28, 2017

@ninja- That is exactly the case most people are looking into but would not work with this solution here. The reason is that the inode of the gzip is different then the previous file, so filebeat would resend all data as it can't tell if the file unzipped and zipped is the same.

For your use the case probably the question remains how many files to you rotate before compression and if loosing some log lines are ok. If you are ok loosing log lines, have a look into close_timeout.

@ruflin
Copy link
Member Author

ruflin commented Oct 9, 2017

The implementation of prospectors completely changed in the meantime in FB, means prospector types can now be registered. This also lays the foundation for a different implementation having this as a separate prospector type. I'm closing this PR but if we pick this up again, parts of the code in here can be reused for a new prospector type.

@ruflin ruflin closed this Oct 9, 2017
@sgreszcz
Copy link

sgreszcz commented Oct 9, 2017

@ruflin is gzip support ever going to be a developed feature as part of FB? Is there anywhere else to follow this topic?

@ruflin
Copy link
Member Author

ruflin commented Oct 10, 2017

@sgreszcz For me this is definitively still on the list to be developed one day. Best follow this issue here: #637

@Denagar
Copy link

Denagar commented Nov 6, 2017

+1 for gz support

@Kareny
Copy link

Kareny commented Dec 7, 2017

Hi @ruflin, which version of beats is this change based on? I tried to apply the same change on version 5.6. The beats seems running, but nothing got output. I'm using file output to test.

Is there still some work to make it work? Can you provide some guideline? Thanks!

Here's the logs:

2017/12/07 20:02:16.664811 beat.go:297: INFO Home path: [/home/gzipbeat] Config path: [/home/gzipbeat] Data path: [/home/data] Logs path: [/home/gzipbeat/logs]
2017/12/07 20:02:16.664855 beat.go:192: INFO Setup Beat: filebeat; Version: 5.6.5
2017/12/07 20:02:16.664924 file.go:54: INFO File output path set to: /home
2017/12/07 20:02:16.664947 file.go:55: INFO File output base filename set to: gzipoutput
2017/12/07 20:02:16.664960 file.go:58: INFO Rotate every bytes set to: 10485760
2017/12/07 20:02:16.664977 file.go:62: INFO Number of files set to: 7
2017/12/07 20:02:16.664999 outputs.go:108: INFO Activated file as output plugin.
2017/12/07 20:02:16.664947 metrics.go:23: INFO Metrics logging every 30s
2017/12/07 20:02:16.665089 publish.go:300: INFO Publisher name: ip-10-10-2-93.us-west-2.compute.internal
2017/12/07 20:02:16.665737 async.go:63: INFO Flush Interval set to: -1s
2017/12/07 20:02:16.665753 async.go:64: INFO Max Bulk Size set to: -1
2017/12/07 20:02:16.665962 beat.go:233: INFO filebeat start running.
2017/12/07 20:02:16.665998 registrar.go:85: INFO Registry file set to: /home/gzipbeat/data/registry
2017/12/07 20:02:16.666041 registrar.go:106: INFO Loading registrar data from /home/gzipbeat/data/registry
2017/12/07 20:02:16.666210 registrar.go:123: INFO States Loaded from registrar: 2
2017/12/07 20:02:16.666268 crawler.go:38: INFO Loading Prospectors: 1
2017/12/07 20:02:16.666409 prospector_log.go:65: INFO Prospector with previous states loaded: 1
2017/12/07 20:02:16.666473 prospector.go:124: INFO Starting prospector of type: log; id: 17958342878063090343 
2017/12/07 20:02:16.666486 crawler.go:58: INFO Loading and starting Prospectors completed. Enabled prospectors: 1
2017/12/07 20:02:16.666511 registrar.go:236: INFO Starting Registrar
2017/12/07 20:02:16.666529 sync.go:41: INFO Start sending events to output
2017/12/07 20:02:16.666568 spooler.go:62: INFO Starting spooler: spool_size: 2048; idle_timeout: 5s
2017/12/07 20:02:16.666813 log.go:91: INFO Harvester started for file: /home/logs/test.log.gz
2017/12/07 20:02:46.665847 metrics.go:39: INFO Non-zero metrics in the last 30s: filebeat.harvester.open_files=1 filebeat.harvester.running=1 filebeat.harvester.started=1 publish.events=2 registrar.states.current=2 registrar.states.update=2 registrar.writes=1
2017/12/07 20:03:16.665347 metrics.go:34: INFO No non-zero metrics in the last 30s
^C2017/12/07 20:03:37.391476 filebeat.go:267: INFO Stopping filebeat
2017/12/07 20:03:37.391535 crawler.go:90: INFO Stopping Crawler
2017/12/07 20:03:37.391547 crawler.go:100: INFO Stopping 1 prospectors
2017/12/07 20:03:37.391583 prospector.go:180: INFO Prospector ticker stopped
2017/12/07 20:03:37.391594 prospector.go:232: INFO Stopping Prospector: 17958342878063090343
2017/12/07 20:03:37.391634 log.go:112: INFO Reader was closed: /home/logs/test.log.gz. Closing.
2017/12/07 20:03:37.391637 prospector.go:137: INFO Prospector channel stopped because beat is stopping.
2017/12/07 20:03:37.391684 crawler.go:112: INFO Crawler stopped
2017/12/07 20:03:37.391693 spooler.go:101: INFO Stopping spooler
2017/12/07 20:03:37.391708 registrar.go:291: INFO Stopping Registrar
2017/12/07 20:03:37.391716 registrar.go:248: INFO Ending Registrar
2017/12/07 20:03:37.394018 metrics.go:51: INFO Total non-zero values:  filebeat.harvester.closed=1 filebeat.harvester.started=1 publish.events=2 registrar.states.current=2 registrar.states.update=2 registrar.writes=2
2017/12/07 20:03:37.394041 metrics.go:52: INFO Uptime: 1m20.733076233s
2017/12/07 20:03:37.394055 beat.go:237: INFO filebeat stopped.```

@ruflin
Copy link
Member Author

ruflin commented Dec 8, 2017

@Kareny Seeing that my last rebased happened in February this year I assume you would have to try on top of 5.1 or 5.2 I would think. But now that we have better support for different prospectors, its probably better to try to redo these changes on top of the 6.0 code base.

@Kareny
Copy link

Kareny commented Jun 13, 2018

@ruflin I tested this PR based on 5.2 and found the following issue on windows:

  • if multiple gz logs are added together at the beginning, all logs can be ingested.
  • if the gz logs are added to the monitoring path one by one, only the first one got ingested. For newly added logs, the offset advanced, but no events are published. Here's a sample log:
2018-06-13T05:43:12Z DBG  Start harvester for new file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z WARN EXPERIMENTAL: Compression setting is enabled.
2018-06-13T05:43:12Z DBG  Setting offset for file based on seek: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Setting offset for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz. Offset: 0 
2018-06-13T05:43:12Z DBG  New state added for C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Check file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:43:12Z DBG  Update existing file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init.log.gz, offset: 4250
2018-06-13T05:43:12Z DBG  File didn't change: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:43:12Z DBG  Prospector states cleaned up. Before: 2, After: 2
2018-06-13T05:43:12Z INFO Harvester started for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Source is not continuable: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z INFO End of file reached: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz. Closing because close_eof is enabled.
2018-06-13T05:43:12Z DBG  Stopping harvester for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Closing file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Update state: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz, offset: 3788
2018-06-13T05:43:12Z DBG  Flushing spooler because of timeout. Events flushed: 220
2018-06-13T05:43:17Z DBG  Flushing spooler because of timeout. Events flushed: 0

I tried the same test on linux and it works. I'm not sure what to look for to fix this issue. Do you have any clues for why this is happening?

@ruflin ruflin deleted the gz-support branch June 13, 2018 13:15
@ruflin
Copy link
Member Author

ruflin commented Jun 13, 2018

@Kareny You have to keep in mind that this PR was a proposal we decided not to persue further as too many things have changed. The PR was never fully tested and more a POC so it might have quite a few flaws I haven't though of yet.

Based on your log above, there is one 1 .gz file?

@Kareny
Copy link

Kareny commented Jun 13, 2018

@ruflin there're two .gz files. I added the gz file one by one. Only the first spooler flush seems working. In my other test, if I add a new gz file to the watching folder in the middle of the harvesting of the first file, only the events flushed at the first time got published. In the following example only 203 events got published successfully.

I went through the filebeat code from filebeat.go to spooler.go. I didn't notice a function that is windows specific that might causing this issue on windows platform. Did I miss something?

The log provided in my last comment is for the second one I added to the monitoring path. Here's the full log:

2018-06-13T05:40:42Z INFO Publisher name: WIN-I5R5J89EMPU
2018-06-13T05:40:42Z INFO Flush Interval set to: 1s
2018-06-13T05:40:42Z INFO Max Bulk Size set to: 2048
2018-06-13T05:40:42Z DBG  create bulk processing worker (interval=1s, bulk size=2048)
2018-06-13T05:40:42Z INFO Flush Interval set to: 1s
2018-06-13T05:40:42Z INFO Max Bulk Size set to: 2048
2018-06-13T05:40:42Z DBG  create bulk processing worker (interval=1s, bulk size=2048)
2018-06-13T05:40:42Z INFO gzipbeat start running.
2018-06-13T05:40:42Z DBG  Windows is interactive: false
2018-06-13T05:40:42Z INFO No registry file found under: C:\Program Files\Exabeam\Windows Gzip Collector\data\registry. Creating a new registry file.
2018-06-13T05:40:42Z DBG  Write registry file: C:\Program Files\Exabeam\Windows Gzip Collector\data\registry
2018-06-13T05:40:42Z DBG  delete old: remove C:\Program Files\Exabeam\Windows Gzip Collector\data\registry.old: The system cannot find the file specified.
2018-06-13T05:40:42Z DBG  rotate to old: rename C:\Program Files\Exabeam\Windows Gzip Collector\data\registry C:\Program Files\Exabeam\Windows Gzip Collector\data\registry.old: The system cannot find the file specified.
2018-06-13T05:40:42Z DBG  Registry file updated. 0 states written.
2018-06-13T05:40:42Z INFO Loading registrar data from C:\Program Files\Exabeam\Windows Gzip Collector\data\registry
2018-06-13T05:40:42Z INFO States Loaded from registrar: 0
2018-06-13T05:40:42Z INFO Loading Prospectors: 1
2018-06-13T05:40:42Z DBG  exclude_files: []
2018-06-13T05:40:42Z INFO Prospector with previous states loaded: 0
2018-06-13T05:40:42Z WARN EXPERIMENTAL: Compression setting is enabled.
2018-06-13T05:40:42Z DBG  File Configs: [C:\\Users\\Administrator\\Desktop\\test\*]
2018-06-13T05:40:42Z INFO Loading Prospectors completed. Number of prospectors: 1
2018-06-13T05:40:42Z INFO All prospectors are initialised and running with 0 states to persist
2018-06-13T05:40:42Z INFO Starting Registrar
2018-06-13T05:40:42Z INFO Start sending events to output
2018-06-13T05:40:42Z INFO Starting spooler: spool_size: 2048; idle_timeout: 5s
2018-06-13T05:40:42Z DBG  Starting prospector 0
2018-06-13T05:40:42Z INFO Starting prospector of type: log
2018-06-13T05:40:42Z DBG  Start next scan
2018-06-13T05:40:42Z DBG  Check file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Start harvester for new file: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z WARN EXPERIMENTAL: Compression setting is enabled.
2018-06-13T05:40:42Z DBG  Setting offset for file based on seek: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Setting offset for file: C:\Users\Administrator\Desktop\test\cloud-init.log.gz. Offset: 0 
2018-06-13T05:40:42Z DBG  New state added for C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Prospector states cleaned up. Before: 1, After: 1
2018-06-13T05:40:42Z INFO Harvester started for file: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Source is not continuable: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z INFO End of file reached: C:\Users\Administrator\Desktop\test\cloud-init.log.gz. Closing because close_eof is enabled.
2018-06-13T05:40:42Z DBG  Stopping harvester for file: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Closing file: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:40:42Z DBG  Update state: C:\Users\Administrator\Desktop\test\cloud-init.log.gz, offset: 4250
2018-06-13T05:40:47Z DBG  Flushing spooler because of timeout. Events flushed: 203
2018-06-13T05:40:47Z DBG  Publish: {
  "@timestamp": "2018-06-13T05:40:42.662Z",
  "beat": {
    "hostname": "WIN-I5R5J89EMPU",
    "name": "WIN-I5R5J89EMPU",
    "version": "5.2.3"
  },
  "input_type": "log",
  "message": "May 18 16:53:24 localhost cloud-init: Cloud-init v. 0.7.5 running 'init-local' at Fri, 18 May 2018 16:53:24 +0000. Up 10.68 seconds.",
  "offset": 133,
  "source": "C:\\Users\\Administrator\\Desktop\\test\\cloud-init.log.gz",
  "type": "log"
}
2018-06-13T05:40:47Z DBG  Publish: {
  "@timestamp": "2018-06-13T05:40:42.662Z",
  "beat": {
    "hostname": "WIN-I5R5J89EMPU",
    "name": "WIN-I5R5J89EMPU",
    "version": "5.2.3"
  },
  "input_type": "log",
  "message": "May 18 16:53:25 localhost cloud-init: Cloud-init v. 0.7.5 running 'init' at Fri, 18 May 2018 16:53:25 +0000. Up 11.38 seconds.",
  "offset": 260,
  "source": "C:\\Users\\Administrator\\Desktop\\test\\cloud-init.log.gz",
  "type": "log"
}
...
2018-06-13T05:43:12Z DBG  Run prospector
2018-06-13T05:43:12Z DBG  Start next scan
2018-06-13T05:43:12Z DBG  Check file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Start harvester for new file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z WARN EXPERIMENTAL: Compression setting is enabled.
2018-06-13T05:43:12Z DBG  Setting offset for file based on seek: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Setting offset for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz. Offset: 0 
2018-06-13T05:43:12Z DBG  New state added for C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Check file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:43:12Z DBG  Update existing file for harvesting: C:\Users\Administrator\Desktop\test\cloud-init.log.gz, offset: 4250
2018-06-13T05:43:12Z DBG  File didn't change: C:\Users\Administrator\Desktop\test\cloud-init.log.gz
2018-06-13T05:43:12Z DBG  Prospector states cleaned up. Before: 2, After: 2
2018-06-13T05:43:12Z INFO Harvester started for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Source is not continuable: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z INFO End of file reached: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz. Closing because close_eof is enabled.
2018-06-13T05:43:12Z DBG  Stopping harvester for file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Closing file: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz
2018-06-13T05:43:12Z DBG  Update state: C:\Users\Administrator\Desktop\test\cloud-init-output.log.gz, offset: 3788
2018-06-13T05:43:12Z DBG  Flushing spooler because of timeout. Events flushed: 220
2018-06-13T05:43:17Z DBG  Flushing spooler because of timeout. Events flushed: 0

@ruflin
Copy link
Member Author

ruflin commented Jun 15, 2018

Let's take a step back here. What is your long term to get the above working? Assuming you get it working, what is next? As lots of the underlying changed you will be stuck with a self build Filebeat version.

I suggest we rather spend the time on figuring out the details on what should be built.

@Kareny
Copy link

Kareny commented Jun 16, 2018

My goal is to get this work first for the urgent needs we have now and wait for the official release later. I really appreciate your help. I'm pretty stuck at this point, if there's no further help I can get from here I'll have to work on my own to figure that out...

I really hope that I can get help from the community.

@ruflin
Copy link
Member Author

ruflin commented Jun 19, 2018

It's not that I don't want to help but I would prefer to spend our time in getting the feature in or something similar in a future release.

My way of approaching the above problem would be as following: I would create additional system tests to reproduce the above problem. This would give you a easy way to reproduce the problem multiple times and very quickly. Based on your description, it seems like something goes wrong on a second scan. I would investigate on why the file is not discovered, or if it's discovered and opened, on what content the harvester tries to read. I remember one of the trickies parts for the gz files was hacking around the way our current state works as the state that needs to be stored is for the .gz file and not the content inside. I hope this brings you a step closer to the solution.

@jundl77
Copy link

jundl77 commented Jul 7, 2018

For us two, we would need an urgent solution. @Kareny, have you made any progress? Have you built your own version of Filebeat?

@westernspion
Copy link

bump for .gz support
A lot of what I'm trying to ship are saved on archive drives as zipped files. It would be handy to read these out then stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat in progress Pull request is currently in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants