Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3: Add parquet compression type #8837

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented May 20, 2024

With columnify command we're able to support parquet format on out_s3.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    trace
    HTTP_Server  Off
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

[INPUT]
    Name dummy
    Tag  dummy.local
    dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}

[OUTPUT]
    Name  s3
    Match dummy*
    Region us-east-2
    bucket fbit-parquet-s3
    Use_Put_object true
    compression parquet
    parquet.schema_file schema-dummy.avsc

schema-dummy.avsc

{
  "type": "record",
  "name": "DummyMessages",
  "fields" : [
    {"name": "boolean", "type": "boolean"},
    {"name": "int",     "type": "int"},
    {"name": "long",    "type": "long"},
    {"name": "float",   "type": "float"},
    {"name": "double",  "type": "double"},
    {"name": "bytes",   "type": "bytes"},
    {"name": "string",  "type": "string"}
  ]
}
  • Debug log output from testing the change
Fluent Bit v3.2.4
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _           _____  _____ 
|  ___| |                | |   | ___ (_) |         |____ |/ __  \
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`' / /'
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \  / /  
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /./ /___
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)_____/


[2024/12/23 12:08:36] [ info] Configuration:
[2024/12/23 12:08:36] [ info]  flush time     | 5.000000 seconds
[2024/12/23 12:08:36] [ info]  grace          | 5 seconds
[2024/12/23 12:08:36] [ info]  daemon         | 0
[2024/12/23 12:08:36] [ info] ___________
[2024/12/23 12:08:36] [ info]  inputs:
[2024/12/23 12:08:36] [ info]      dummy
[2024/12/23 12:08:36] [ info] ___________
[2024/12/23 12:08:36] [ info]  filters:
[2024/12/23 12:08:36] [ info] ___________
[2024/12/23 12:08:36] [ info]  outputs:
[2024/12/23 12:08:36] [ info]      s3.0
[2024/12/23 12:08:36] [ info] ___________
[2024/12/23 12:08:36] [ info]  collectors:
[2024/12/23 12:08:37] [ info] [fluent bit] version=3.2.4, commit=ec656f12b6, pid=1311541
[2024/12/23 12:08:37] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/12/23 12:08:37] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/12/23 12:08:37] [ info] [simd    ] SSE2
[2024/12/23 12:08:37] [ info] [cmetrics] version=0.9.9
[2024/12/23 12:08:37] [ info] [ctraces ] version=0.5.7
[2024/12/23 12:08:37] [ info] [input:dummy:dummy.0] initializing
[2024/12/23 12:08:37] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/12/23 12:08:37] [debug] [dummy:dummy.0] created event channels: read=25 write=26
[2024/12/23 12:08:37] [debug] [s3:s3.0] created event channels: read=27 write=28
[2024/12/23 12:08:37] [ info] [output:s3:s3.0] Using upload size 100000000 bytes
[2024/12/23 12:08:37] [debug] [output:s3:s3.0] parquet.compression format is SNAPPY
[2024/12/23 12:08:37] [ info] [output:s3:s3.0] parquet.record_type format is jsonl
[2024/12/23 12:08:37] [ info] [output:s3:s3.0] parquet.schema_type format is avro
[2024/12/23 12:08:37] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2024/12/23 12:08:37] [debug] [aws_credentials] creating profile (null) provider
[2024/12/23 12:08:37] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain
[2024/12/23 12:08:37] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2024/12/23 12:08:37] [debug] [aws_credentials] Not initializing ECS Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set
[2024/12/23 12:08:37] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2024/12/23 12:08:37] [debug] [aws_credentials] Sync called on the EC2 provider
[2024/12/23 12:08:37] [debug] [aws_credentials] Init called on the env provider
[2024/12/23 12:08:37] [debug] [aws_credentials] upstream_set called on the EC2 provider
[2024/12/23 12:08:37] [ info] [sp] stream processor started
[2024/12/23 12:08:37] [ info] [output:s3:s3.0] worker #0 started
[2024/12/23 12:08:42] [debug] [task] created task=0x6146220 id=0 OK
[2024/12/23 12:08:42] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2024/12/23 12:08:42] [debug] [output:s3:s3.0] Creating upload timer with frequency 60s
[2024/12/23 12:08:42] [debug] [out flush] cb_destroy coro_id=0
[2024/12/23 12:08:42] [debug] [task] destroy task=0x6146220 (task_id=0)
^C[2024/12/23 12:08:43] [engine] caught signal (SIGINT)
[2024/12/23 12:08:43] [debug] [task] created task=0x61b6b20 id=0 OK
[2024/12/23 12:08:43] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2024/12/23 12:08:43] [ warn] [engine] service will shutdown in max 5 seconds
[2024/12/23 12:08:43] [ info] [input] pausing dummy.0
[2024/12/23 12:08:43] [debug] [out flush] cb_destroy coro_id=1
[2024/12/23 12:08:43] [debug] [task] destroy task=0x61b6b20 (task_id=0)
[2024/12/23 12:08:44] [ info] [engine] service has stopped (0 pending tasks)
[2024/12/23 12:08:44] [ info] [input] pausing dummy.0
[2024/12/23 12:08:44] [ info] [output:s3:s3.0] Sending all locally buffered data to S3
[2024/12/23 12:08:44] [ info] [output:s3:s3.0] thread worker #0 stopping...
[2024/12/23 12:08:44] [ info] [output:s3:s3.0] thread worker #0 stopped
[2024/12/23 12:08:44] [ info] [output:s3:s3.0] Pre-compression chunk size is 756, After compression, chunk is 981 bytes
[2024/12/23 12:08:45] [debug] [upstream] KA connection #31 to s3.us-east-2.amazonaws.com:443 is connected
[2024/12/23 12:08:45] [debug] [http_client] not using http_proxy for header
[2024/12/23 12:08:45] [debug] [aws_credentials] Requesting credentials from the env provider..
[2024/12/23 12:08:45] [debug] [upstream] KA connection #31 to s3.us-east-2.amazonaws.com:443 is now available
[2024/12/23 12:08:45] [debug] [output:s3:s3.0] PutObject http status=200
[2024/12/23 12:08:45] [ info] [output:s3:s3.0] Successfully uploaded object /fluent-bit-logs/dummy.local/2024/12/23/03/08/42-objectirl68bju

Install columnify with:

$ go install github.com/reproio/columnify/cmd/columnify@latest
# ...
$ which columnify
/path/to/columnify
$ echo $?
0
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1311541== 
==1311541== HEAP SUMMARY:
==1311541==     in use at exit: 0 bytes in 0 blocks
==1311541==   total heap usage: 20,307 allocs, 20,307 frees, 3,608,269 bytes allocated
==1311541== 
==1311541== All heap blocks were freed -- no leaks are possible
==1311541== 
==1311541== For lists of detected and suppressed errors, rerun with: -s
==1311541== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1380

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

plugins/out_s3/s3.c Outdated Show resolved Hide resolved
plugins/out_s3/s3_win32_compat.h Outdated Show resolved Hide resolved
plugins/out_s3/s3.c Outdated Show resolved Hide resolved
cosmo0920 and others added 22 commits December 23, 2024 11:46
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…n Windows

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ompat

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…rable

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ects

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ects

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920
Copy link
Contributor Author

I rebased off the current master. Waiting for the CI results....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants