Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Output Compression not working #3676

Closed
justchris1 opened this issue Jun 23, 2021 · 31 comments
Closed

S3 Output Compression not working #3676

justchris1 opened this issue Jun 23, 2021 · 31 comments
Labels

Comments

@justchris1
Copy link

Bug Report

Describe the bug
Using td-agent-bit version 1.7.8 with the S3 output, the compression setting seems to be ignored, even when using use_put_object true

To Reproduce
Here is my configuration of the output s3 block.

[OUTPUT]
    name s3
    match *
    region us-east-2
    bucket my-bucket-name
    s3_key_format /fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/%S/$UUID.gz
    use_put_object On
    total_file_size 40M
    upload_timeout 1m
    compression gzip

Regardless if compression setting is missing (inferring none) or present with gzip, the uploaded files are always cleartext / uncompressed.

Expected behavior
Logs uploaded would be compressed with gzip before upload.

Your Environment

  • Version used: 1.7.8
  • Configuration: (See above)
  • Environment name and version (e.g. Kubernetes? What version?): RPM install
  • Server type and version: AWS t3a instance
  • Operating System and version: Centos 8, fully patched as of 2021-06-23
  • Filters and plugins: none

I can find nothing in the error logs about a failed compression. Every upload, I get a 'happy' message: Successfully uploaded object. However, the file is still cleartext. I saw references in @PettitWesley thread in #2700 that this was working, so I am unsure if this is a regression or something else.

@mtparet
Copy link

mtparet commented Jun 24, 2021

I have the same issue, I wondering if setting the content encoding to gzip is the issue. Does S3 automatically decompress the file on its side ?

@justchris1
Copy link
Author

I have the same issue, I wondering if setting the content encoding to gzip is the issue. Does S3 automatically decompress the file on its side ?

I know of no AWS S3 function that would be capable of doing that. S3 is just an object store. I verified this issue by downloading the S3 object directly after it was uploaded to eliminate the fluentd input that was pulling it down as the source of the problem.

@justchris1
Copy link
Author

justchris1 commented Jun 24, 2021 via email

@canidam
Copy link

canidam commented Jul 6, 2021

I have the same issue using fluent/fluent-bit:1.7.9. Any idea if the configuration is wrong, or this is an actual bug?

Is it possible there's a threshold for compression? for example, if the file is less than 1K it skips the compression part?

@justchris1
Copy link
Author

I have the same issue using fluent/fluent-bit:1.7.9. Any idea if the configuration is wrong, or this is an actual bug?

Nope. I haven't even seen any comment from someone at the project even acknowledging the issue.

@PettitWesley
Copy link
Contributor

@DrewZhang13 @zhonghui12

@DrewZhang13
Copy link
Contributor

ACK, issue is reproduced with the same config and fluent bit version.
The uploaded file is always cleartext / uncompressed.

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented Jul 9, 2021

@justchris1 @mtparet After some more test with the same config provided in this issue, the file will auto decompressed if downloaded on Macbook, but uploaded file in S3 is already compressed after size-based comparison.
So I didn't see the real uncompressed file issue in my testing machine.

Could you provide the comparison of size for the file uploaded in S3 and in your local before uploaded to confirm if it's really uncompressed?

@justchris1
Copy link
Author

@justchris1 @mtparet After some more test with the same config provided in this issue, the file will auto decompressed if downloaded on Macbook, but uploaded file in S3 is already compressed after size-based comparison.
So I didn't see the real uncompressed file issue in my testing machine.

Could you provide the comparison of size for the file uploaded in S3 and in your local before uploaded to confirm if it's really uncompressed?

@DrewZhang13 - When I was debugging this, I eliminated the automated ingestion of the file into fluentd on the other side. To confirm it was uncompressed, I downloaded the file directly from S3 after it was uploaded by fluent-bit. When I inspect the file stored in S3, it is uncompressed. S3 has no 'auto-compress' or 'uncompress' functions, so downloading it represents what was stored in S3. The content is plaintext & readable.

@DrewZhang13
Copy link
Contributor

@justchris1 So i have verified from both Macbook and Linux, no similar uncompressed situation comes at my side.
I used the same configuration you provided and compared the file I download from S3.
The file size I download is 679B before decompression and 31KB after decompression.
I could use VI to see the cleartext when the file is compressed. I wonder if this is the situation you mean for the cleartext?

@justchris1
Copy link
Author

I could use VI to see the cleartext when the file is compressed. I wonder if this is the situation you mean for the cleartext?

No, I meant from a 'dumb' text editor like Windows Notepad. When I download the file that was uploaded by fluent-bit with the configuration file shown in the issue from the AWS console, I am able to open the file in Notepad immediately and see clear text.

@canidam
Copy link

canidam commented Jul 18, 2021

@DrewZhang13
There's a weird behavior here. I've a file on S3 with the size 2.3KB
When I download it on my Mac, the size grows to 17K and the file type is JSON Data

➜ file 0N285h0f.gz
0N285h0f.gz: JSON data

I did another test. I use fluentd to consume these log files, and when I use text type it prints binary data. When I use gzip type, it works. So I guess compression works, but something weird is going on when downloading the objects from S3 on a Mac.

@DrewZhang13
Copy link
Contributor

@canidam yeah Mac will automatically decompressed when you download from S3. I think this is the reason why you are seeing wired behavior.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 19, 2021
@justchris1
Copy link
Author

I still see this behavior. Please do not close.

@github-actions github-actions bot removed the Stale label Aug 21, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 1, 2021
@ssc-ksaitou
Copy link

ssc-ksaitou commented Oct 1, 2021

This is occurred by the attribute Content-Encoding: gzip tagged with the log.gz file fluent-bit have uploaded.

image

When you download the file tagged as Content-Encoding: gzip, user agent (e.g. Chrome, curl) will automatically decode the content as same as downloading gzipped stream on HTTP since Content-Encoding: gzip header has been appended to the response header.
Yes, it's obviously been compressed on S3.

An easy solution is to just remove off .gz extension from s3_key_format.

There seems to be no way to turn off Content-Encoding: gzip.

@github-actions github-actions bot removed the Stale label Oct 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 5, 2021
@justchris1
Copy link
Author

This would not explain why I would get parsing errors in fluentd with compression turned on (and the corresponding configuration in fluentd indicating it was compressed, but working immediately after disabling the fluentd side only to indicate no compression). I still see this behavior. Please do not close.

@github-actions github-actions bot removed the Stale label Nov 9, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 10, 2021
@justchris1
Copy link
Author

I still see this behavior. Issue is not resolved.

@gjirm
Copy link

gjirm commented Dec 13, 2021

I can see this behavior on multiple systems too with fluentbit 1.8.10 (standalone fluentbit and fluentbit in the docker container). I also experienced this behavior on previous 1.8.x versions.

This is my config:

[OUTPUT]
    Name                            s3
    Match                           auth
    bucket                          server-logs
    region                          eu-west-1
    tls                             On
    s3_key_format                   /auth-logs/$TAG/%Y/%m/%d/h%H/%M-$UUID.gz
    s3_key_format_tag_delimiters    .-_
    compression                     gzip
    use_put_object                  On
    total_file_size                 50M
    upload_timeout                  10m

@github-actions github-actions bot removed the Stale label Dec 15, 2021
@Spritekin
Copy link

Spritekin commented Feb 23, 2022

I don't think this is working.
I have a similar configuration to the ones reported before:

        [OUTPUT]
            Name s3
            Match *
            bucket mybucket
            region ap-southeast-2
            store_dir /home/ec2-user/buffer
            s3_key_format /fluentbit/$TAG[2]/$TAG[0]/%Y/%m/%d/%H/%M/%S/$UUID.gz
            s3_key_format_tag_delimiters .-
            compression gzip
            use_put_object On
            total_file_size 50M

And I got my files in S3. I.E.
s3://mybucket/fluentbit/log/kube/2022/02/23/01/35/48/15BRQR03.gz

Then I select the file in S3 and in the object actions I select "Query with S3 Select"
Screen Shot 2022-02-23 at 4 05 42 pm

So in the S3 select I configure like:
Screen Shot 2022-02-23 at 4 08 23 pm

Screen Shot 2022-02-23 at 4 08 53 pm

Now you will notice I select one JSON per line and gzip compression as it is the expected output, however it returns an error that says GZIP is not applicable.

However, if I change the compression to None, I get a proper response on the same query:
Screen Shot 2022-02-23 at 4 14 28 pm

While I got a Mac, these queries are being run inside AWS and the files won't touch my laptop so I can say with a level of certainty the files are not being gzipped.

@marcosdiez
Copy link
Contributor

marcosdiez commented Feb 28, 2022

It works for me (i.e. I checked on S3 and the results are gzipped. Athena can read them because the files end with .gz).
I am using ubuntu 20.04 and I got fluentbit v1.8.12 from the official deb package (https://packages.fluentbit.io/ubuntu/focal).

Here are my settings:

[OUTPUT]
    name s3
    match *
    bucket XXXXXXXXXXXXX
    region us-east-1
    s3_key_format /prod-sslv-nginx/$TAG/%Y/%m/%d/%H/%M/%S-$UUID.gz
    total_file_size 1M
    upload_timeout 1m
    compression gzip

@Spritekin
Copy link

Spritekin commented Feb 28, 2022

@marcosdiez

Sure I tried that, please read my test configuration above. Maybe it has been fixed but I was using a recent helm installation.

  repository = "https://fluent.github.io/helm-charts"
  chart      = "fluent-bit"
  version    = "0.19.19"

One thing, not sure that configuration you use works as I'm quite sure you need to set the "use_put_object On" option (I got an error saying I had to turn in on when I omitted the option and the container wouldn't start). If you don't get the error then its another sign the version you are testing might have been updated.

@PettitWesley
Copy link
Contributor

@Spritekin Compression has always only worked wtih Use_Put_object On

@Spritekin
Copy link

@PettitWesley
I'm not claiming otherwise, as you can see in my analysis above the flag is configured. My comment was because Marcos submitted a configuration with the gzip compression enabled but no "use_put_object On" option and said it worked ok. I just pointed his config would be wrong because the use_put_object flag was not set.

@logston
Copy link

logston commented Mar 16, 2022

$ aws --profile 1234567890 s3 cp s3://mybucket/path/to/file/ItWLhdDe.log.gz ~/Downloads/ItWLhdDe.log.gz
$ ls -la ~/Downloads/ItWLhdDe.*
-rw-r--r--  1 paul  staff  2554 Mar 15 17:00 /Users/paul/Downloads/ItWLhdDe.log.gz
$ gunzip ~/Downloads/ItWLhdDe.log.gz
$ ls -la ~/Downloads/ItWLhdDe.*     
-rw-r--r--  1 paul  staff  22264 Mar 15 17:00 /Users/paul/Downloads/ItWLhdDe.log

WHY CHROME, WHY!?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 14, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@xposionn
Copy link

for who still facing this issue.
After deep dive, seems that compression does work but taking a look in the response header seems my .log (which compressed into .gz) has a content-type of application/octet-stream.
After adding content_type text/plain in the s3 output plugin config, the downloaded file has .txt end instead of .gz (which was compressed on s3 but decompressed after downloading with wrong file extension.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests