-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for gzip compressed files #3070
Conversation
filebeat/harvester/log.go
Outdated
@@ -286,6 +285,15 @@ func (h *Harvester) close() { | |||
// If file was never opened, it can't be closed | |||
if h.file != nil { | |||
|
|||
if h.config.Compression == config.GZipCompression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this only if compression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all other cases, offset == size of the file in case the harvester finished reading. For gzip files the offset is > Size(). But to not have to change the prospector logic, it is set to the size of the file. It is kind of a hack, means I should add a doc comment here.
|
||
func NewFile(file *os.File, compression string) (FileSource, error) { | ||
|
||
if config.GZipCompression == compression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I see multiple places in code with logic specific to GZip. When new compression type is supported, we'd need to modify all these places.. Is it possible to have all logic per compression type encapsulated in a single place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitively extract it and potentially abstract it if we introduce further compressions. But for now I think it is ok as long as we only have 1 special case. My usual approach is to start abstracting as soon as there are > 2 specific implementation.
e022217
to
6ffb5a8
Compare
Is the option intentionally not added to the full config? |
@tsg No, I think I added it at the wrong spot and run |
We decided to enable One issue with this at the moment that harvesters are not aware of the |
This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed. The implementation with compression allows to add other compressed files in the future. This implementation has currently the following limitations: * No (correct) offset is reported for reading gzip files. This has the following consequences below. * In case a gzip file changes its size, the full file will be read again from the beginning. * If the reading of a gzip file is interrupted, the state will be set to the file size on closing the harvester and the data will not be sent again. * Filebeat cannot detect if content now inside the gzip files was read before as a normal file as the inode of the file changes. This is an alternative implementation of elastic#2227
any progress on this? |
@ninja- Not at the moment. We put this on hold as it didn't cover the use cases that were popping up. Can you share some details on how you would use this feature? |
@ruflin I wanted to properly support log rotation with gzip, in a scenario when the logstash database could go down for few hours, so it will be properly backfilled...would it be better now to disable log compression in my app? and manually remove old files when free disk space gets low because of the logs? |
@ninja- That is exactly the case most people are looking into but would not work with this solution here. The reason is that the inode of the gzip is different then the previous file, so filebeat would resend all data as it can't tell if the file unzipped and zipped is the same. For your use the case probably the question remains how many files to you rotate before compression and if loosing some log lines are ok. If you are ok loosing log lines, have a look into |
The implementation of prospectors completely changed in the meantime in FB, means prospector types can now be registered. This also lays the foundation for a different implementation having this as a separate prospector type. I'm closing this PR but if we pick this up again, parts of the code in here can be reused for a new prospector type. |
@ruflin is gzip support ever going to be a developed feature as part of FB? Is there anywhere else to follow this topic? |
+1 for gz support |
Hi @ruflin, which version of beats is this change based on? I tried to apply the same change on version 5.6. The beats seems running, but nothing got output. I'm using file output to test. Is there still some work to make it work? Can you provide some guideline? Thanks! Here's the logs:
|
@Kareny Seeing that my last rebased happened in February this year I assume you would have to try on top of 5.1 or 5.2 I would think. But now that we have better support for different prospectors, its probably better to try to redo these changes on top of the 6.0 code base. |
@ruflin I tested this PR based on 5.2 and found the following issue on windows:
I tried the same test on linux and it works. I'm not sure what to look for to fix this issue. Do you have any clues for why this is happening? |
@Kareny You have to keep in mind that this PR was a proposal we decided not to persue further as too many things have changed. The PR was never fully tested and more a POC so it might have quite a few flaws I haven't though of yet. Based on your log above, there is one 1 .gz file? |
@ruflin there're two .gz files. I added the gz file one by one. Only the first spooler flush seems working. In my other test, if I add a new gz file to the watching folder in the middle of the harvesting of the first file, only the events flushed at the first time got published. In the following example only 203 events got published successfully. I went through the filebeat code from filebeat.go to spooler.go. I didn't notice a function that is windows specific that might causing this issue on windows platform. Did I miss something? The log provided in my last comment is for the second one I added to the monitoring path. Here's the full log:
|
Let's take a step back here. What is your long term to get the above working? Assuming you get it working, what is next? As lots of the underlying changed you will be stuck with a self build Filebeat version. I suggest we rather spend the time on figuring out the details on what should be built. |
My goal is to get this work first for the urgent needs we have now and wait for the official release later. I really appreciate your help. I'm pretty stuck at this point, if there's no further help I can get from here I'll have to work on my own to figure that out... I really hope that I can get help from the community. |
It's not that I don't want to help but I would prefer to spend our time in getting the feature in or something similar in a future release. My way of approaching the above problem would be as following: I would create additional system tests to reproduce the above problem. This would give you a easy way to reproduce the problem multiple times and very quickly. Based on your description, it seems like something goes wrong on a second scan. I would investigate on why the file is not discovered, or if it's discovered and opened, on what content the harvester tries to read. I remember one of the trickies parts for the gz files was hacking around the way our current state works as the state that needs to be stored is for the .gz file and not the content inside. I hope this brings you a step closer to the solution. |
For us two, we would need an urgent solution. @Kareny, have you made any progress? Have you built your own version of Filebeat? |
bump for .gz support |
This PR adds support for reading gzip files. As the gzip files are expected to be log files, it is part of the log type. All features of the log harvester type are supported. The main difference is that it is expected that the files are never updated and the files are only read once and then closed.
The implementation with compression allows to add other compressed files in the future.
This implementation has currently the following limitations:
This is an alternative implementation of #2227