Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit S3 PlugIn created the Files with Strange Character Names #2905

Closed
VF-mbrauer opened this issue Jan 5, 2021 · 19 comments
Closed

Fluentbit S3 PlugIn created the Files with Strange Character Names #2905

VF-mbrauer opened this issue Jan 5, 2021 · 19 comments
Assignees
Labels
AWS Issues with AWS plugins or experienced by users running on AWS

Comments

@VF-mbrauer
Copy link

Bug Report

Describe the bug
When the Logs have been delivered to S3-Storage the filenames are looking strange in comparison from what I remember
from FluentD, where the Filenames have had meaningful names.
As an example the name looks like this: 00-objecthy79ndzk

Is there any change to get this customized, or is this something what will be corrected in some time?

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
Name                Type   Last modified                             Size       Storage class
00-objectHfqpBisk |  -   | December 16, 2020, 01:28:24 (UTC+01:00) | 218.0 KB | Standard
00-objecthy79ndzk |  -   | December 16, 2020, 01:28:24 (UTC+01:00) | 156.4 KB | Standard
00-objecthYVixAQc |  -   | December 16, 2020, 01:38:24 (UTC+01:00) | 209.9 KB | Standard
00-objectiexMUMuZ |  -   | December 16, 2020, 01:43:24 (UTC+01:00) | 156.4 KB | Standard
00-objectjpyqHunt |  -   | December 16, 2020, 01:43:24 (UTC+01:00) | 216.0 KB | Standard
00-objectK4GBr4gA |  -   | December 16, 2020, 01:58:24 (UTC+01:00) | 209.9 KB | Standard
00-objectKcXoISsk |  -   | December 16, 2020, 01:48:24 (UTC+01:00) | 156.4 KB | Standard
00-objectlHBPIPDw |  -   | December 16, 2020, 01:18:24 (UTC+01:00) | 217.6 KB | Standard
00-objectlnjHyipg |  -   | December 16, 2020, 01:08:24 (UTC+01:00) | 209.9 KB | Standard
00-objectOcejgB8C |  -   | December 16, 2020, 01:13:24 (UTC+01:00)


  • Steps to reproduce the problem:

Expected behavior
nginx_logs-2020121601-0.json
nginx_logs-2020121602-0.json
nginx_logs-2020121603-0.json
nginx_logs-2020121604-0.json

We run the LTS of Fluentbit.

@VF-mbrauer
Copy link
Author

@PettitWesley: Can you please have a look.

@edsiper: FYI.

@PettitWesley
Copy link
Contributor

Can you attach your fluent bit config?

The -objectHfqpBisk is expected behavior. When Fluent Bit uses the PutObject API, it automatically appends a random string to the object name to ensure that the object name is unique. Because if the name is not unique then it will overwrite the old object with the same name.

In your case, it looks like some how your S3 key is set so that the name is "00" (and then the other bit is appended for randomness).

In the future, we are planning to make the randomness customizable, but I think it is needed in the default experience to ensure users don't accidentally overwrite existing files in their bucket.

@PettitWesley PettitWesley self-assigned this Jan 5, 2021
@PettitWesley PettitWesley added the AWS Issues with AWS plugins or experienced by users running on AWS label Jan 5, 2021
@VF-mbrauer
Copy link
Author

Hi @PettitWesley, please find attached an extract of the fluentbit config: fluentbit.txt

@VF-mbrauer
Copy link
Author

Hi @PettitWesley,

I tried also to put a more unique name by adding minutes to the config. But still no change, even if the names are different:

2021-01-06 14:51:10   28623441 apiserver_logs/2021/01/06/1351.json
2021-01-06 15:01:17   15048952 apiserver_logs/2021/01/06/1356.json-objectpAHsm4Bj
2021-01-06 15:06:20   14100871 apiserver_logs/2021/01/06/1401.json-object7Vrz6rMv
2021-01-06 15:11:24   14434855 apiserver_logs/2021/01/06/1406.json-objectSLWK3DdP
2021-01-06 15:16:28   14210086 apiserver_logs/2021/01/06/1411.json-objectBlqLYtOx
2021-01-06 15:21:32   14122207 apiserver_logs/2021/01/06/1416.json-objectm2CyEAwh
2021-01-06 15:26:36   14450949 apiserver_logs/2021/01/06/1421.json-objectKuZnQwZ7
2021-01-06 15:31:40   14000509 apiserver_logs/2021/01/06/1426.json-objecta4SNOeaF

Now I would expect that the randomized objects are not added as the names by default differ.

@PettitWesley
Copy link
Contributor

Hey @VF-mbrauer I played with your config... I didn't get any names that looked like what you showed in your original comment "00-objectHfqpBisk". I am not sure how that happened.

Now I would expect that the randomized objects are not added as the names by default differ.

What you show in that latest comment is expected behavior. It will always add the randomness and there is no way to turn it off.

As I mentioned, we are planning to change how this works in the future- S3 enhancements are tracked here: #2700

@PettitWesley
Copy link
Contributor

I should note- the plugin only adds this randomness when it uses the PutObject API.

Looking at your config, you have it set to use multipart uploads when it can (that's the default). However, it can only use the multipart API when there is enough log data. This is because each part in the multipart upload must be at least 5 MiB. I see you have an upload time out of 5 minutes- if you do not accumulate at least 5 MiB of logs in that time, then it will use the PutObject API and will create files with randomness appended.

The logic behind this feature was that when the PutObject API is used it tends to mean that new files are being created more quickly, whereas the multipart API tends to create larger files more slowly. So since files are being created quickly we need to make sure their names are random.

I don't regret building it that way and I do not plan to remove this feature, but in the future we will make it more configurable.

@VF-mbrauer
Copy link
Author

Hi @PettitWesley,

Hey @VF-mbrauer I played with your config... I didn't get any names that looked like what you showed in your original comment "00-objectHfqpBisk". I am not sure how that happened.

I think I know why we get this name. It is because the hour (%H)
s3_key_format /k8s_logs/%Y/%m/%d/%H

so the 00 is the hour I have also 01 and 02 up to 24. So basically the whole day. I just did not want to put all the logs here from the complete day. Maybe my fault. Otherwise you would have seen that probably that this is a result of the hour format %H.

But one last thing is still strange. The first entry of an hour is always fine:
Here the example:

2021-01-06 11:16:40   28553728 apiserver_logs/2021/01/06/10
2021-01-06 11:06:35   14317385 apiserver_logs/2021/01/06/10-object0zFYOqVD
...
2021-01-06 12:32:39   14007158 apiserver_logs/2021/01/06/11
2021-01-06 12:37:43   14341676 apiserver_logs/2021/01/06/11-object5pEt65Bl
...
2021-01-06 13:28:20   28408077 apiserver_logs/2021/01/06/12
2021-01-06 13:58:42   14468679 apiserver_logs/2021/01/06/12-object0PZbe70u

OR

2021-01-06 19:09:30   28191815 apiserver_logs/2021/01/06/1809.json
2021-01-06 19:19:37   14377651 apiserver_logs/2021/01/06/1814.json-objectW59DtvnL
2021-01-06 19:24:41   13923075 apiserver_logs/2021/01/06/1819.json-object2rCGpXuq
...
2021-01-06 14:51:10   28623441 apiserver_logs/2021/01/06/1351.json
2021-01-06 15:01:17   15048952 apiserver_logs/2021/01/06/1356.json-objectpAHsm4Bj
021-01-06 15:01:17   15048952 apiserver_logs/2021/01/06/1356.json-objectpAHsm4Bj
2021-01-06 15:06:20   14100871 apiserver_logs/2021/01/06/1401.json-object7Vrz6rMv

If I understand you correct, it should always do it with randomized object keys?

I don't regret building it that way and I do not plan to remove this feature, but in the future we will make it more configurable.

I can understand that this is part of the design and also to make sure that overwriting will no happen, but it looks odd for the customer to have those kind of file and not clean and proper named ones. So I would appreciate if customers can adapt here more to their needs and to get also a better cosmetic view on files without cryptographic characters.
So if you plan something like this also mentioned in #2700 that would be great.

@PettitWesley
Copy link
Contributor

I agree, so whats happening in your cases probably is:

  • Multipart Upload (because enough data was collected before timeout): 28191815 apiserver_logs/2021/01/06/1809.json
  • PutObject Upload (because too little data was collected in time): 14377651 apiserver_logs/2021/01/06/1814.json-objectW59DtvnL
  • PutObject Upload (because too little data was collected in time): 13923075 apiserver_logs/2021/01/06/1819.json-object2rCGpXuq

PutObject uploads add the randomness.

I think the main reason your uploads primarily are being sent as PutObject uploads is that you have a very large upload_chunk_size- 50M. Normally folks use the default which is close to 5M.

This means that Fluent Bit will not try to do a multipart upload until it has 50M of data. If it has less than that when the timeout passes, you will end up with PutObject uploads. It does this because a MultipartUpload requires at least 3 API requests, its more expensive in terms of the work that needs to be done.

I would not necessarily recommend changing your config though- I am guessing you set a large chunk size to save s3 upload costs.

The solution in the future will be what is mentioned in that issue- we will find someway to make the filenames more configurable. We can also make sure that the file names are same regardless of which API is used.

I think we will always require some randomness in the file name just for safety. However, we may allow you to configure where the randomness is placed in the file name, and how much randomness.

So imagine in the future all your names might look like: apiserver_logs/2021/01/06/1351-pAHsm4Bj.json

And you might set your S3 key as: /apiserver_logs/%Y/%m/%d/%H-${uuid}.json. That is the proposal in the issue- we require randomness for all files but let you specify where the randomness is placed using a special character sequence ${uuid}.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@PettitWesley
Copy link
Contributor

@zhonghui12

@zhonghui12
Copy link
Contributor

Hi @VF-mbrauer, recently we have merged a PR for S3 key format enhancements. With this feature, UUID and key extension are able to be specified so you can make the s3 key name more customized. More explanations can be found in this PR and we will merge this one soon.

Thanks

@atitan
Copy link

atitan commented Mar 23, 2021

Hi @zhonghui12, we updated our aws-for-fluent-bit to 2.12.0, which includes this extension enhancement.

And with config looks like this:

[OUTPUT]
    Name                          s3
    Match                         host.*
    region                        ${AWS_REGION}
    bucket                        ${S3_BUCKET}
    total_file_size               1M
    upload_timeout                10s
    use_put_object                On
    s3_key_format                 /${CLUSTER_NAME}_host/%Y/%m/%d/%H/$TAG-$UUID.gz
    s3_key_format_tag_delimiters  .-_
    compression                   gzip

But still getting random extension names:
image

Any ideas?

@zhonghui12
Copy link
Contributor

Hi @atitan, the extension enhancement code hasn't released by fluent bit yet: https://fluentbit.io/announcements/v1.7.2/. It should be included in next release. So this feature is not supported for now.

@macropin
Copy link

macropin commented Mar 24, 2021

@zhonghui12 are you sure? The documentation https://docs.fluentbit.io/manual/pipeline/outputs/s3 says that $UUID is supported.

Add $UUID in the format string to insert a random string.

@zhonghui12
Copy link
Contributor

Hi @atitan @macropin, I am sorry for the mistake. The doc was updated before the feature is released. I have submitted a PR to revert the docs update: fluent/fluent-bit-docs#497. Also, we will update the docs as soon as $UUID is available and I will let you know.

Thanks again for your understanding.

@bgweber
Copy link

bgweber commented Apr 2, 2021

Hi @zhonghui12, is there an ETA for the release? We are using fluentbit to write gzip compressed files to S3, and plan on processing the files using PySpark (Databricks). Spark currently requires .gz to be specified as the file extension: https://issues.apache.org/jira/browse/SPARK-29280

Looking forward to this feature! In the mean time, we're able to use uncompressed files for our workflow.

@zhonghui12
Copy link
Contributor

Hello @bgweber, I have emailed @edsiper and I think the release should come by the end of this week.

@zhonghui12
Copy link
Contributor

zhonghui12 commented Apr 6, 2021

Hello @bgweber @atitan @VF-mbrauer @macropin, the feature is released in 1.7.3: https://fluentbit.io/announcements/v1.7.3/. And we also update the S3 file: https://github.com/fluent/fluent-bit-docs/blob/master/pipeline/outputs/s3.md. It may takes time for it to display on the documentation but the feature is already available.

Thanks.

@agup006
Copy link
Member

agup006 commented Apr 28, 2021

Closing as this is part of the 1.7.3 release

@agup006 agup006 closed this as completed Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Issues with AWS plugins or experienced by users running on AWS
Projects
None yet
Development

No branches or pull requests

7 participants