Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3 bucket Output plugin #1004

Closed
amit-uc opened this issue Jan 2, 2019 · 50 comments
Closed

Add S3 bucket Output plugin #1004

amit-uc opened this issue Jan 2, 2019 · 50 comments
Assignees

Comments

@amit-uc
Copy link

amit-uc commented Jan 2, 2019

Feature:
I have always wanted to push my logs to aws S3 bucket directly.

Will it be possible for us to have an output plugin that will push the logs to s3 bucket real time which will create a file based on daily or weekly log file. Similar to pushing logs to elasticsearch.

@jpds
Copy link

jpds commented Jan 29, 2019

This appears to have been previously discussed at #658 but it was closed with a workaround. I would prefer a direct way to log to S3.

@cosmo0920
Copy link
Contributor

aws-sdk-cpp (not C) is provided.
If we provide direct S3 bucket output plugin, we should use this C++ SDK and create wrapper for C.
Is it acceptable?

@cosmo0920
Copy link
Contributor

I'm also created S3 bucket Output plugin as a PoC which uses fluent-bit's golang plugin interface:
https://github.com/cosmo0920/fluent-bit-go-s3

Anyone wants to try it out?

@devils-ey3
Copy link

I'm also created S3 bucket Output plugin as a PoC which uses fluent-bit's golang plugin interface:
https://github.com/cosmo0920/fluent-bit-go-s3

Anyone wants to try it out?

It seems fine but it could be better if you make a docker file for it.

@cosmo0920
Copy link
Contributor

I'd provided it:

@PettitWesley
Copy link
Contributor

I'm working on a core C plugin for S3; hopefully I'll have a pull request up in a few weeks.

@PettitWesley
Copy link
Contributor

Supporting Parquet as a data format has been requested as well: aws/aws-for-fluent-bit#31

@PettitWesley PettitWesley self-assigned this Mar 30, 2020
@dnascimento
Copy link

@PettitWesley could you share your branch? Thanks for picking this issue!

@PettitWesley
Copy link
Contributor

@dnascimento I'll put up a PR as soon as I have something working... and link it here..

My current, non-finished, non-working code is here: https://github.com/PettitWesley/fluent-bit/commits/s3-plugin

The only thing that works is the Sigv4 change; though I plan to refactor that.

If you or anyone else would like to try to finish the prototype, let me know. If not, wait about a month and I will finish it :)

@gebi
Copy link

gebi commented Apr 27, 2020

It would be awesome to have compression support for upload to s3 too.

@jbnjohnathan
Copy link

We are currently using fluentd to compress and send logs directly to S3, but the CPU usage on it is pretty high. I've seen benchmarks that fluent bits is faster, but as we want to send to S3 built in support for this (including compression) would make it very attractive

@PettitWesley
Copy link
Contributor

PettitWesley commented Jun 8, 2020

Ok I lied when I said it would be a month... I'm now starting to work on this. We realized that first we need to make core changes to enable per-output buffering like what Fluentd has (that way you can send large chunks at a time).

What types of compression or output formats are highest priority?

@shinji62
Copy link

shinji62 commented Jun 8, 2020

@PettitWesley Do you think this plugin will be able to create file / or key (s3) as dynamic, for example I want to create only one file by pod or similar.

@PettitWesley
Copy link
Contributor

@shinji62

I was thinking S3 key name for the file would be something like {prefix}-{tag}-timestamp where prefix is set with some plugin config option like s3_key_prefix. Open to other ideas though.

There might also be some way to configure the size of files that are uploaded. That's the part I'm working on designing/figuring out first. How important do you (or anyone else) think that is?

I'm thinking there would be config options to set the size of files uploaded to S3. You do something like configure uploads every X minutes, or every X megabytes. This is similar to what you can do with Fluentd.

Fluentd lets you accomplish that by configuring a buffer stage per-output. The data is stored in memory or on disk between uploads.

We may or may not do that. I'm thinking its best to store as little data locally as possible, and to get the data off to S3 as quickly as possible Multi-part uploads would allow that. Multi-part lets you send a large file in chunks over a long period of time. You can send 10,000 chunks, which must be at least 5 MB each. So the S3 plugin could buffer data till it gets 5 MB, and then upload a part. The user could configure how many 5 MB chunks they want per file (define your S3 file size in increments of 5 MB). That way, if something goes wrong, you can never lose more than the last 5 MB of data.

@shinji62
Copy link

shinji62 commented Jun 8, 2020

@PettitWesley Understood, I think having "/" is better than "_" as in s3 "/" can be view as folder.

@jbnjohnathan
Copy link

@shinji62
We may or may not do that. I'm thinking its best to store as little data locally as possible, and to get the data off to S3 as quickly as possible Multi-part uploads would allow that. Multi-part lets you send a large file in chunks over a long period of time. You can send 10,000 chunks, which must be at least 5 MB each. So the S3 plugin could buffer data till it gets 5 MB, and then upload a part. The user could configure how many 5 MB chunks they want per file (define your S3 file size in increments of 5 MB). That way, if something goes wrong, you can never lose more than the last 5 MB of data.

One thing to consider if you are thinking of keeping as little buffer as possible locally, is how many files in S3 that will be created.
I used fluent-bits with a log forwarder to S3 and set fluent-bit to send every 1 second. Each file in S3 then contained about 3 log rows, but after a while there was like million of files.
You cannot delete an S3 bucket unless you delete all files first, and you can only delete so many files in each request via Boto3. It took over an hour to delete all files due to the sheer number of them.

@PettitWesley
Copy link
Contributor

@jbnjohnathan Yeah, that's what I'm trying to solve- and multipart uploads would accomplish that. You can upload a chunk every time you accumulate 5 MB of logs (may be a few minutes for an average user), but a file can be multiple chunks and I believe a chunked upload can take place over hours/days, so the total file size could be huge- on the order of gigabytes.

@krushik
Copy link

krushik commented Jun 11, 2020

For S3 logs, it's excellent when prefix includes configurable date pattern, e.g., "foo/%Y/%m/%d/" (something similar is done with Logstash_Prefix/Logstash_DateFormat in the elasticsearch output settings). It is also beneficial to use ${MY_VAR} somewhere in the object name or the prefix ending. This way, we can pass there a hostname/ECS-task-id/pod-uid or whatever helping identify a specific copy of the application, which generated that log, and simultaneously eliminate the possibility of collision between those multiple streams all going to the same prefix.

@gebi
Copy link

gebi commented Jun 12, 2020

What types of compression or output formats are highest priority?

We'd prefere zstd compression with jsonl format and the compression going over multiple jsonl lines (really a requirement for the compression to have any noticeable effect).

What would also be awesome to have configs to limit the output file size AND to have a flush interval (to limit the maxium age of the chunk when there are fewer log lines coming).

@tyrken
Copy link

tyrken commented Jun 16, 2020

I'd +1 for json-lines format & accept simple stream-based (and so across-lines) gzip compression as easy/compatible.

One related note (sorry for a very slight diversion) is from a discussion I'm having about file naming and S3 Object Metadata on compressed files with another log-shipper, vectordotdev/vector#2769.

Try uploading a "dummy.log.gz" gzip'ed text file containing json-lines data with Content-Encoding: gzip and Content-Type: text/x-log as the AWS S3 PutObject API suggests is correct for a compressed file.

If you then view it via the AWS S3 Web Console, the "Open" button will take it's cue from the Content-Encoding to transparently un-gzip the file for display, which is nice. However the "Download" button will also decompress - but give you a file still called "dummy.log.gz" so will be regarded as corrupt by at least Linux & Windows which use the file extension to start a compressed file viewer on opening it.

You might think the correct option is to NOT include the ".gz" extension but do set the Content-Encode, yes this then works fine in the browser for both "Open" and "Download" (though the file icon is now wrong), but the aws s3 CLI ignores Object Metadata and just gives you a "dummy.log" which is actually gzipped still.

There seems no combination of data/filename/settings which work in all cases. The best I can see is using Content-Type: application/x-gzip only which gets "Download" button and CLI to give you the file you asked for (still compressed & named correctly), and "Open" do merely the same as "Download". At least none of them is wrong in some way.

Maybe if we accepted naming all uploaded files "*.txt.gz" then we could use Content-Type: text/plain and Content-Encoding: gzip, then browser-enforced auto-renaming of that content-type to end in ".txt" would make both Open & Download work. It's just that extension for json-lines format seems wrong to me & the "Download" button still isn't just download, it's download-and-decompress.

@PettitWesley PettitWesley added this to the Fluent Bit v1.6 milestone Jul 16, 2020
@PettitWesley
Copy link
Contributor

FYI: I am still actively working on this; it is tentatively planned for v1.6 release in September.

@sibidass
Copy link

sibidass commented Jul 27, 2020

What types of compression or output formats are highest priority?

@PettitWesley yea, we should have per output buffering to properly flush out contiguous data pertaining to specific inputs. This will reduce dramatically, the number of files created in s3 as opposed to free flowing flush(2M as per current setup), which has a direct impact on cost aspects.
Also it would be really nice to see gzip compression as it is a standard one and supported by splunk endpoints. I believe other major log aggregators also support gzip (need to confirm).So for a full fledged logging solution, if we need to transport from s3 to splunk or any other aggregator, it would be seamless.

@shailegu
Copy link

FYI: I am still actively working on this; it is tentatively planned for v1.6 release in September.

@PettitWesley Will 1.6 release also have support for uploading logs via Pre-Signed S3 URLs.

@PettitWesley
Copy link
Contributor

@shailegu Explain the use case? I am not very familiar with pre-signed URLs... it sounds like it only allows you to upload a single object per URL. Which won't quite work with the plugin since it uploads a file every time it gets a set amount of data. The plugin needs to be able to upload multiple files.

@PettitWesley
Copy link
Contributor

@kmajic Its still in progress... about 60% done right now... still planned for Fluent Bit 1.6 which is still planned for mid/end of September. Full code will be available in 2ish weeks for the first pre-release version for testing.

(As always though, no hard guarantees- that's just the plan.

@eyalengel-pagaya
Copy link

thanks for update, were waiting for it!

@tarunwadhwa13
Copy link

@PettitWesley - do we have any update? Any pre-release version available for testing?

@PettitWesley
Copy link
Contributor

PettitWesley commented Sep 14, 2020

@fujimotos Sorry, I forgot your comment.

The plugin is mostly done. Internally we are in the testing phase. Though I am still working on some small enhancements and fixes.

This branch will continue to be updated as I finalize things: https://github.com/PettitWesley/fluent-bit/tree/1_6-pen-test

If you want to start playing with it, it's essentially feature complete.

CC @tarunwadhwa13

@fujimotos
Copy link
Member

@PettitWesley Thank you for the update. I'll check the branch out.

@PettitWesley
Copy link
Contributor

PettitWesley commented Sep 23, 2020

@fujimotos @amit-uc @tarunwadhwa13 @eyalengel-pagaya @kmajic A Pre-release version is now available!!!

The image is here: 094266400487.dkr.ecr.us-west-2.amazonaws.com/aws-fluent-bit-1_6-preview:latest

Based on the code here (testing and review still in progress): https://github.com/PettitWesley/fluent-bit/tree/pen-test-fixes-2

That repository has the same permissions as the ECR repos we publish AWS for Fluent Bit to (public read, private write). You can download that image using any AWS account.

Please remember that this is a preview version. We look forward to your feedback and any bug reports. However, there is absolutely zero guarantee of support or anything for this image; do not use it in prod. (I added a log line that will print this every single flush, so that no one can possibly miss it).

Below is your documentation.

User Interface

Plugin Configuration Options

  • bucket: The S3 bucket name
    • Default value: None
  • region: The AWS region that your S3 bucket is located in
    • Default value: None
  • upload_chunk_size: This plugin uses the S3 Multipart Upload API to stream data to S3, ensuring your data gets-off-the-box as quickly as possible. This parameter configures the size of each “part” in the upload. The total_file_size option configures the size of the file you will see in S3; this option determines the size of chunks uploaded until that size is reached. These chunks are temporarily stored in chunk_buffer_path until their size reaches upload_chunk_size, which point the chunk is uploaded to S3.
    • Default value: 5 MB
    • Max: 50 MB, Min: 5 MB
  • chunk_buffer_dir Local directory on disk to temporarily buffer data before uploading to S3. This plugin will never buffer more than a few megabytes on disk at a time; multipart uploads are used to achieve large file sizes in S3 with frequent uploads of chunks of data.
    • Default value: /fluent-bit/s3/
  • s3_key_format: Format string for keys in S3. This option supports strftime time formatters and a syntax for selecting parts of the Fluent log tag using a syntax inspired by the rewrite_tag filter. Add $TAG in the format string to insert the full log tag; add $TAG[0] to insert the first part of the tag in the s3 key. The tag is split into “parts” using the characters specified with the s3_key_format_tag_delimiters option. See the in depth examples and tutorial in the following section. Depending on your setup (say a large k8s cluster), you may potentially have hundreds of instances of Fluent Bit performing simultaneous uploads to the same S3 bucket. Therefore, you must think carefully and set your s3_key_format so that each upload will be unique. For example, if you only include the day, month, and year in your s3 key- the all uploads each day will have the same key and will overwrite each other. The default value for this parameter is safe in essentially all cases since it includes the full fluent log tag and a time stamp with second precision. Note: Whenever the PutObject API is used to send a smaller chunk of data, Fluent Bit will automatically append -object plus 8 random alphanumeric characters to the S3 key to ensure uniqueness (ex: -object67ac4b55). The plugin will always use the PutObject API to send all locally buffered data chunks on shutdown; you can also enable/disable it for all uploads with the use_put_object option.
    • Default value: /fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/%S
  • s3_key_format_tag_delimiters: A series of characters which will be used to split the tag into “parts” for use with the s3_key_format option. See the in depth examples and tutorial in the following section.
    • Default value: .
  • total_file_size: Specify the total size in bytes of each of the files that you want to create in S3 using standard units; the minimum valid value is 1M (1 megabyte). The maximum valid value is 50G. For example, if you specify 50M, Fluent Bit will create a new file in S3 every time it has sent 50 megabytes of data. This parameter can be used with upload_timeout; an upload will be completed when the timeout is reached or when the file_size has been reached, whichever happens first.
    • Default value: 100M
  • use_put_object: By default, this plugin uses the S3 multipart upload API to send data in chunks to S3. If this option is enabled, the S3 PutObject API will be used instead. If this option is enabled, the max value for total_file_size is 50M. Remember, when this parameter is enabled, the entire file will be sent in one request, so a total_file_size of 50M means the request body will be 50M. For this reason, we recommend setting a low total_file_size when use_put_object is enabled.
    • Default value: Off
  • upload_timeout: Optionally specify a timeout for uploads using an integer number of minutes. Whenever this amount of time has elapsed, Fluent Bit will complete an upload and create a new file in S3. For example, set this value to 60 and you will get a new file in S3 every hour.
    • Default value: 10

Example Config

[OUTPUT]
    Name                  s3
    Match                 *
    bucket                my-bucket
    region                us-west-2
    total_file_size       250M
    s3_key_format         fluent-bit/$TAG/%Y/%m/%d/%H/%M/%S

Assume that the user is running Fluent Bit in FireLens and the tag is app-firelens-01dce3798d7c17a58. With these settings, Fluent Bit will create 250 MB files in the bucket my-bucket with file names like:
fluent-bit/app-firelens-01dce3798d7c17a58/2020/06/14/12/05/05
fluent-bit/app-firelens-01dce3798d7c17a58/2020/06/14/14/05/05

S3 Key Format Option

In Fluent Bit and Fluentd, all logs have an associated tag. The s3_key_format option lets you inject the tag into the s3 key using the following syntax:

  • $TAG => the full tag
  • $TAG[n] => the nth part of the tag (index starting at zero). This syntax is copied from the rewrite tag filter. By default, “parts” of the tag are separated with dots, but you can change this with s3_key_format_tag_delimiters.

In the example below, assume the date is January 1st, 2020 00:00:00 and the tag associated with the logs in question is my_app_name-logs.prod.

[OUTPUT]
    Name                         s3
    Match                        *
    bucket                       my-bucket
    region                       us-west-2
    total_file_size              250M
    s3_key_format                $TAG[2]/$TAG[0]/%Y/%m/%d/%H/%M/%S
    s3_key_format_tag_delimiters .-_

With the delimiters as . and -, the tag will be split into parts as follows:

  • $TAG[0] = my_app_name
  • $TAG[1] = logs
  • $TAG[2] = prod

@Blaumer
Copy link

Blaumer commented Sep 24, 2020

@PettitWesley It looks like you can't place the = character and potentially other special characters in the s3_key_format field. When doing so, I receive CreateMultipartUpload API responded with error='SignatureDoesNotMatch', message='The request signature we calculated does not match the signature you provided. Check your key and signing method.

@PettitWesley
Copy link
Contributor

@Blaumer That's probably from some URL encoding which Fluent Bit is doing differently than S3 expects. I will look into it. However, I will note that S3 recommends not using special characters in key names: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html

@Blaumer
Copy link

Blaumer commented Sep 29, 2020

@PettitWesley Any update on this? Ideally I would not like to use special characters, but unfortunately for hive partitioning in s3, it requires using the = character in the bucket path.

@PettitWesley
Copy link
Contributor

PettitWesley commented Oct 7, 2020

@Blaumer Support for using the = character in the S3 key will be included when this plugin launches in 1.6 (possibly in a few days).

I think actually that almost any special character should work, in the sense that it at least won't throw the error you saw. The issue was the way URI encoding was done in the Sigv4 code, which affected = since that's used in query parameters. I think I've fixed it now in a way that doesn't break any of the other AWS outputs.

Anyway- as a general note though- there will be things missing from this first launch which folks will want. Please submit feature requests, and then we will slowly get to them in future minor version bumps. (For example, no support for compression at first launch).

@PettitWesley
Copy link
Contributor

The preview image has been updated with the latest version of the code in master.

@elrob
Copy link

elrob commented Oct 19, 2020

@PettitWesley Thank you for your work on this.

Feature requests: 🙏

  • gzip compression support (I know this is mentioned above but this is currently a blocker for migrating to using fluent-bit with S3 for me so I want to express my desire for this)
  • make the automatically added object key suffix position configurable so it is possible to have the key end with (e.g. -objectqZ7jv9Qt.jsonl )
  • make it possible to disable the date injection into the output JSON. I can't seem to disable it although I've tried and even tried to work out how this might be disabled by hunting through the source code.

@elrob
Copy link

elrob commented Oct 19, 2020

For disabling date I also tried using false as in the stdout plugin but that just set the JSON key to "false".
https://docs.fluentbit.io/manual/v/master/pipeline/outputs/standard-output#configuration-parameters

@PettitWesley
Copy link
Contributor

@elrob Yeah unfortunately I don't think you can disable injecting the data key.

make the automatically added object key suffix position configurable so it is possible to have the key end with (e.g. -objectqZ7jv9Qt.jsonl )

For this one, I'm considering adding another special format string in the s3 key, $UUID (or may be $RANDOM), which will give you some number of random characters. If you enable use_put_object then having $UUID in the S3 key would be required.

That's not a perfect solution though...

The PutObject API is called under two circumstances:

  1. Normal uploads when you explicitly enable it with use_put_object.
  2. When Fluent Bit is stopped/restarted and there is leftover data to send.

In both cases I want to force some sort of UUID interpolation to ensure the key is unique. I suppose one thing I could do is split the S3 Key on . and then add the UUID before the last piece (if there were dots in the key). That way if you have an S3 key in the form of something.extension the UUID will come before the extension.

Another option would just be to include the $UUID special format string and require that it is always used.

Thoughts?

@PettitWesley
Copy link
Contributor

Oh also in case everyone hasn't realized... this was released in 1.6.

I am going to close this and open a new issue for S3 output enhancements.

caleb15 added a commit to caleb15/fluent-bit that referenced this issue Nov 2, 2020
recently got added - see fluent#1004
caleb15 added a commit to caleb15/fluent-bit that referenced this issue Nov 18, 2020
recently got added - see fluent#1004

Signed-off-by: caleb15 <caleb@15five.com>
PettitWesley pushed a commit that referenced this issue Nov 19, 2020
recently got added - see #1004

Signed-off-by: caleb15 <caleb@15five.com>
edsiper pushed a commit that referenced this issue Nov 20, 2020
recently got added - see #1004

Signed-off-by: caleb15 <caleb@15five.com>
@Eliasi1
Copy link

Eliasi1 commented Dec 2, 2020

Hello,
I am trying to use the s3 output plugin for fluent bit on docker container. the container is up and running but nothing is being written in S3. The out put shows that fluentbit failed to initialize 's3' plugin:
image

Below my fluentbit configuration:
[SERVICE]
Flush 5
Daemon Off
parsers_file parsers.conf
Log_Level debug

[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
tag general

[INPUT]
Name tail
Path /fluent-bit/log/test/*.log
tag sample

[Output]
Name s3
Match *
use_put_object true
Bucket ohad-test
Region eu-west-1
s3_key_format /fluentbit/%Y/%m/%d/%H/%M/%S
s3_key_format_tag_delimiters .-_

Maybe it is a problem of authentication? how fluentbit is authorized to upload logs to S3 bucket?

@Eliasi1
Copy link

Eliasi1 commented Dec 2, 2020

Update:
I removed the "use_put_object true" line from output, so the plugin will use multipart upload API and started to container again. the s3 plugin has successfully initiated, but no credentials found. i put the credentials in the /root/,aws/credential file in the container and its now working.
i have sample log files in the fluentbit that need to be shipped to S3, but for some reason nothing is shipped.
below the logs i see:

image

The S3 is empty. i am able to put object using aws s3 cli with the same credentials.
what could be the problem?

@PettitWesley
Copy link
Contributor

@Eliasi1 That screenshot doesn't tell me much... I don't see any errors.

With the multipart upload API, you will not see any data in S3 until all parts have been uploaded and the upload is marked as complete.

You can of course instead use the PutObject API. As show in the error log from your first screenshot, when you use PutObject you must set a total_file_size which is below 50 megabytes.

If you feel data is taking too long to get into S3, then your file size is probably too large for your data ingestion rate. You probably only ingesting a few MB of logs, and then the plugin is buffering it and waiting to upload.

One option is to use the upload_timeout option to specify a max time window for the upload completion.

Please read the docs: https://docs.fluentbit.io/manual/pipeline/outputs/s3

@Eliasi1
Copy link

Eliasi1 commented Dec 3, 2020

Thank you for the reply,
The size of the log files are about 30KB each. The point i want to test is just the posting in S3.
If i change the input pluging of fluentbit to "dummy", it works well and dummy logs are being posted in S3 bucket.
but Tail input, to store logs from local storage is not.
Although, when sniffing the packet of my host with Wireshark, it seems that there are communication with Amazon end.
image
maybe i am doing something wrong on my config?

@PettitWesley
Copy link
Contributor

@Eliasi1 this might be because the behavior for tail changed recently in 1.6. It is not more like the tail linux command, it starts reading files from the end and only reads new content.

If you want to read from the start of a log file, you have to add read_from_head On to your tail config I think.

@Eliasi1
Copy link

Eliasi1 commented Dec 11, 2020

@PettitWesley hey and thank you for the reply,
apperantly you were right and adding the parameter solved my problem. i just couldn't find it in Fluentbit docs.
Thank you for the help!

@nithin-kumar
Copy link
Contributor

yes pre-signed URLs is a one time use only. We want to use different pre-signed URLs periodically for uploading the logs, probably updating OUTPUT Section in fluent-bit.conf dynamically (not sure if its possible without restarting fluent-bit, still exploring). And the receiver will extract the logs to get business value out of it.

@shailegu Do we have a workaround to achieve this? i.e to support presigned post?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests