-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add S3 bucket Output plugin #1004
Comments
This appears to have been previously discussed at #658 but it was closed with a workaround. I would prefer a direct way to log to S3. |
aws-sdk-cpp (not C) is provided. |
I'm also created S3 bucket Output plugin as a PoC which uses fluent-bit's golang plugin interface: Anyone wants to try it out? |
It seems fine but it could be better if you make a docker file for it. |
I'm working on a core C plugin for S3; hopefully I'll have a pull request up in a few weeks. |
Supporting Parquet as a data format has been requested as well: aws/aws-for-fluent-bit#31 |
@PettitWesley could you share your branch? Thanks for picking this issue! |
@dnascimento I'll put up a PR as soon as I have something working... and link it here.. My current, non-finished, non-working code is here: https://github.com/PettitWesley/fluent-bit/commits/s3-plugin The only thing that works is the Sigv4 change; though I plan to refactor that. If you or anyone else would like to try to finish the prototype, let me know. If not, wait about a month and I will finish it :) |
It would be awesome to have compression support for upload to s3 too. |
We are currently using fluentd to compress and send logs directly to S3, but the CPU usage on it is pretty high. I've seen benchmarks that fluent bits is faster, but as we want to send to S3 built in support for this (including compression) would make it very attractive |
Ok I lied when I said it would be a month... I'm now starting to work on this. We realized that first we need to make core changes to enable per-output buffering like what Fluentd has (that way you can send large chunks at a time). What types of compression or output formats are highest priority? |
@PettitWesley Do you think this plugin will be able to create file / or key (s3) as dynamic, for example I want to create only one file by pod or similar. |
I was thinking S3 key name for the file would be something like There might also be some way to configure the size of files that are uploaded. That's the part I'm working on designing/figuring out first. How important do you (or anyone else) think that is? I'm thinking there would be config options to set the size of files uploaded to S3. You do something like configure uploads every X minutes, or every X megabytes. This is similar to what you can do with Fluentd. Fluentd lets you accomplish that by configuring a buffer stage per-output. The data is stored in memory or on disk between uploads. We may or may not do that. I'm thinking its best to store as little data locally as possible, and to get the data off to S3 as quickly as possible Multi-part uploads would allow that. Multi-part lets you send a large file in chunks over a long period of time. You can send 10,000 chunks, which must be at least 5 MB each. So the S3 plugin could buffer data till it gets 5 MB, and then upload a part. The user could configure how many 5 MB chunks they want per file (define your S3 file size in increments of 5 MB). That way, if something goes wrong, you can never lose more than the last 5 MB of data. |
@PettitWesley Understood, I think having "/" is better than "_" as in s3 "/" can be view as folder. |
One thing to consider if you are thinking of keeping as little buffer as possible locally, is how many files in S3 that will be created. |
@jbnjohnathan Yeah, that's what I'm trying to solve- and multipart uploads would accomplish that. You can upload a chunk every time you accumulate 5 MB of logs (may be a few minutes for an average user), but a file can be multiple chunks and I believe a chunked upload can take place over hours/days, so the total file size could be huge- on the order of gigabytes. |
For S3 logs, it's excellent when prefix includes configurable date pattern, e.g., "foo/%Y/%m/%d/" (something similar is done with Logstash_Prefix/Logstash_DateFormat in the elasticsearch output settings). It is also beneficial to use ${MY_VAR} somewhere in the object name or the prefix ending. This way, we can pass there a hostname/ECS-task-id/pod-uid or whatever helping identify a specific copy of the application, which generated that log, and simultaneously eliminate the possibility of collision between those multiple streams all going to the same prefix. |
We'd prefere zstd compression with jsonl format and the compression going over multiple jsonl lines (really a requirement for the compression to have any noticeable effect). What would also be awesome to have configs to limit the output file size AND to have a flush interval (to limit the maxium age of the chunk when there are fewer log lines coming). |
I'd +1 for json-lines format & accept simple stream-based (and so across-lines) gzip compression as easy/compatible. One related note (sorry for a very slight diversion) is from a discussion I'm having about file naming and S3 Object Metadata on compressed files with another log-shipper, vectordotdev/vector#2769. Try uploading a "dummy.log.gz" gzip'ed text file containing json-lines data with If you then view it via the AWS S3 Web Console, the "Open" button will take it's cue from the Content-Encoding to transparently un-gzip the file for display, which is nice. However the "Download" button will also decompress - but give you a file still called "dummy.log.gz" so will be regarded as corrupt by at least Linux & Windows which use the file extension to start a compressed file viewer on opening it. You might think the correct option is to NOT include the ".gz" extension but do set the Content-Encode, yes this then works fine in the browser for both "Open" and "Download" (though the file icon is now wrong), but the aws s3 CLI ignores Object Metadata and just gives you a "dummy.log" which is actually gzipped still. There seems no combination of data/filename/settings which work in all cases. The best I can see is using Maybe if we accepted naming all uploaded files "*.txt.gz" then we could use |
FYI: I am still actively working on this; it is tentatively planned for v1.6 release in September. |
@PettitWesley yea, we should have per output buffering to properly flush out contiguous data pertaining to specific inputs. This will reduce dramatically, the number of files created in s3 as opposed to free flowing flush(2M as per current setup), which has a direct impact on cost aspects. |
@PettitWesley Will 1.6 release also have support for uploading logs via Pre-Signed S3 URLs. |
@shailegu Explain the use case? I am not very familiar with pre-signed URLs... it sounds like it only allows you to upload a single object per URL. Which won't quite work with the plugin since it uploads a file every time it gets a set amount of data. The plugin needs to be able to upload multiple files. |
@kmajic Its still in progress... about 60% done right now... still planned for Fluent Bit 1.6 which is still planned for mid/end of September. Full code will be available in 2ish weeks for the first pre-release version for testing. (As always though, no hard guarantees- that's just the plan. |
thanks for update, were waiting for it! |
@PettitWesley - do we have any update? Any pre-release version available for testing? |
@fujimotos Sorry, I forgot your comment. The plugin is mostly done. Internally we are in the testing phase. Though I am still working on some small enhancements and fixes. This branch will continue to be updated as I finalize things: https://github.com/PettitWesley/fluent-bit/tree/1_6-pen-test If you want to start playing with it, it's essentially feature complete. |
@PettitWesley Thank you for the update. I'll check the branch out. |
@fujimotos @amit-uc @tarunwadhwa13 @eyalengel-pagaya @kmajic A Pre-release version is now available!!! The image is here: Based on the code here (testing and review still in progress): https://github.com/PettitWesley/fluent-bit/tree/pen-test-fixes-2 That repository has the same permissions as the ECR repos we publish AWS for Fluent Bit to (public read, private write). You can download that image using any AWS account. Please remember that this is a preview version. We look forward to your feedback and any bug reports. However, there is absolutely zero guarantee of support or anything for this image; do not use it in prod. (I added a log line that will print this every single flush, so that no one can possibly miss it). Below is your documentation. User InterfacePlugin Configuration Options
Example Config
Assume that the user is running Fluent Bit in FireLens and the tag is S3 Key Format OptionIn Fluent Bit and Fluentd, all logs have an associated tag.
In the example below, assume the date is January 1st, 2020 00:00:00 and the tag associated with the logs in question is
With the delimiters as
|
@PettitWesley It looks like you can't place the |
@Blaumer That's probably from some URL encoding which Fluent Bit is doing differently than S3 expects. I will look into it. However, I will note that S3 recommends not using special characters in key names: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html |
@PettitWesley Any update on this? Ideally I would not like to use special characters, but unfortunately for hive partitioning in s3, it requires using the |
@Blaumer Support for using the I think actually that almost any special character should work, in the sense that it at least won't throw the error you saw. The issue was the way URI encoding was done in the Sigv4 code, which affected Anyway- as a general note though- there will be things missing from this first launch which folks will want. Please submit feature requests, and then we will slowly get to them in future minor version bumps. (For example, no support for compression at first launch). |
The preview image has been updated with the latest version of the code in master. |
@PettitWesley Thank you for your work on this. Feature requests: 🙏
|
For disabling date I also tried using |
@elrob Yeah unfortunately I don't think you can disable injecting the data key.
For this one, I'm considering adding another special format string in the s3 key, That's not a perfect solution though... The PutObject API is called under two circumstances:
In both cases I want to force some sort of UUID interpolation to ensure the key is unique. I suppose one thing I could do is split the S3 Key on Another option would just be to include the Thoughts? |
Oh also in case everyone hasn't realized... this was released in 1.6. I am going to close this and open a new issue for S3 output enhancements. |
recently got added - see fluent#1004
recently got added - see fluent#1004 Signed-off-by: caleb15 <caleb@15five.com>
recently got added - see #1004 Signed-off-by: caleb15 <caleb@15five.com>
recently got added - see #1004 Signed-off-by: caleb15 <caleb@15five.com>
Update: The S3 is empty. i am able to put object using aws s3 cli with the same credentials. |
@Eliasi1 That screenshot doesn't tell me much... I don't see any errors. With the multipart upload API, you will not see any data in S3 until all parts have been uploaded and the upload is marked as complete. You can of course instead use the PutObject API. As show in the error log from your first screenshot, when you use PutObject you must set a total_file_size which is below 50 megabytes. If you feel data is taking too long to get into S3, then your file size is probably too large for your data ingestion rate. You probably only ingesting a few MB of logs, and then the plugin is buffering it and waiting to upload. One option is to use the upload_timeout option to specify a max time window for the upload completion. Please read the docs: https://docs.fluentbit.io/manual/pipeline/outputs/s3 |
@Eliasi1 this might be because the behavior for tail changed recently in 1.6. It is not more like the tail linux command, it starts reading files from the end and only reads new content. If you want to read from the start of a log file, you have to add |
@PettitWesley hey and thank you for the reply, |
@shailegu Do we have a workaround to achieve this? i.e to support presigned post? |
Feature:
I have always wanted to push my logs to aws S3 bucket directly.
Will it be possible for us to have an output plugin that will push the logs to s3 bucket real time which will create a file based on daily or weekly log file. Similar to pushing logs to elasticsearch.
The text was updated successfully, but these errors were encountered: