-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentbit S3 PlugIn created the Files with Strange Character Names #2905
Comments
@PettitWesley: Can you please have a look. @edsiper: FYI. |
Can you attach your fluent bit config? The In your case, it looks like some how your S3 key is set so that the name is "00" (and then the other bit is appended for randomness). In the future, we are planning to make the randomness customizable, but I think it is needed in the default experience to ensure users don't accidentally overwrite existing files in their bucket. |
Hi @PettitWesley, please find attached an extract of the fluentbit config: fluentbit.txt |
Hi @PettitWesley, I tried also to put a more unique name by adding minutes to the config. But still no change, even if the names are different:
Now I would expect that the randomized objects are not added as the names by default differ. |
Hey @VF-mbrauer I played with your config... I didn't get any names that looked like what you showed in your original comment "00-objectHfqpBisk". I am not sure how that happened.
What you show in that latest comment is expected behavior. It will always add the randomness and there is no way to turn it off. As I mentioned, we are planning to change how this works in the future- S3 enhancements are tracked here: #2700 |
I should note- the plugin only adds this randomness when it uses the PutObject API. Looking at your config, you have it set to use multipart uploads when it can (that's the default). However, it can only use the multipart API when there is enough log data. This is because each part in the multipart upload must be at least 5 MiB. I see you have an upload time out of 5 minutes- if you do not accumulate at least 5 MiB of logs in that time, then it will use the PutObject API and will create files with randomness appended. The logic behind this feature was that when the PutObject API is used it tends to mean that new files are being created more quickly, whereas the multipart API tends to create larger files more slowly. So since files are being created quickly we need to make sure their names are random. I don't regret building it that way and I do not plan to remove this feature, but in the future we will make it more configurable. |
Hi @PettitWesley,
I think I know why we get this name. It is because the hour (%H) so the 00 is the hour I have also 01 and 02 up to 24. So basically the whole day. I just did not want to put all the logs here from the complete day. Maybe my fault. Otherwise you would have seen that probably that this is a result of the hour format %H. But one last thing is still strange. The first entry of an hour is always fine:
OR
If I understand you correct, it should always do it with randomized object keys?
I can understand that this is part of the design and also to make sure that overwriting will no happen, but it looks odd for the customer to have those kind of file and not clean and proper named ones. So I would appreciate if customers can adapt here more to their needs and to get also a better cosmetic view on files without cryptographic characters. |
I agree, so whats happening in your cases probably is:
PutObject uploads add the randomness. I think the main reason your uploads primarily are being sent as PutObject uploads is that you have a very large upload_chunk_size- 50M. Normally folks use the default which is close to 5M. This means that Fluent Bit will not try to do a multipart upload until it has 50M of data. If it has less than that when the timeout passes, you will end up with PutObject uploads. It does this because a MultipartUpload requires at least 3 API requests, its more expensive in terms of the work that needs to be done. I would not necessarily recommend changing your config though- I am guessing you set a large chunk size to save s3 upload costs. The solution in the future will be what is mentioned in that issue- we will find someway to make the filenames more configurable. We can also make sure that the file names are same regardless of which API is used. I think we will always require some randomness in the file name just for safety. However, we may allow you to configure where the randomness is placed in the file name, and how much randomness. So imagine in the future all your names might look like: And you might set your S3 key as: |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi @VF-mbrauer, recently we have merged a PR for S3 key format enhancements. With this feature, UUID and key extension are able to be specified so you can make the s3 key name more customized. More explanations can be found in this PR and we will merge this one soon. Thanks |
Hi @zhonghui12, we updated our aws-for-fluent-bit to 2.12.0, which includes this extension enhancement. And with config looks like this:
But still getting random extension names: Any ideas? |
Hi @atitan, the extension enhancement code hasn't released by fluent bit yet: https://fluentbit.io/announcements/v1.7.2/. It should be included in next release. So this feature is not supported for now. |
@zhonghui12 are you sure? The documentation https://docs.fluentbit.io/manual/pipeline/outputs/s3 says that
|
Hi @atitan @macropin, I am sorry for the mistake. The doc was updated before the feature is released. I have submitted a PR to revert the docs update: fluent/fluent-bit-docs#497. Also, we will update the docs as soon as Thanks again for your understanding. |
Hi @zhonghui12, is there an ETA for the release? We are using fluentbit to write gzip compressed files to S3, and plan on processing the files using PySpark (Databricks). Spark currently requires .gz to be specified as the file extension: https://issues.apache.org/jira/browse/SPARK-29280 Looking forward to this feature! In the mean time, we're able to use uncompressed files for our workflow. |
Hello @bgweber @atitan @VF-mbrauer @macropin, the feature is released in 1.7.3: https://fluentbit.io/announcements/v1.7.3/. And we also update the S3 file: https://github.com/fluent/fluent-bit-docs/blob/master/pipeline/outputs/s3.md. It may takes time for it to display on the documentation but the feature is already available. Thanks. |
Closing as this is part of the 1.7.3 release |
Bug Report
Describe the bug
When the Logs have been delivered to S3-Storage the filenames are looking strange in comparison from what I remember
from FluentD, where the Filenames have had meaningful names.
As an example the name looks like this: 00-objecthy79ndzk
Is there any change to get this customized, or is this something what will be corrected in some time?
To Reproduce
Expected behavior
nginx_logs-2020121601-0.json
nginx_logs-2020121602-0.json
nginx_logs-2020121603-0.json
nginx_logs-2020121604-0.json
We run the LTS of Fluentbit.
The text was updated successfully, but these errors were encountered: