New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input forward: plugin doesn't close TCP port when it caught SIGTERM #2610
Comments
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
Fixed in f303cdb. Binary now uses library mode so it can do graceful shutdown (in the grace period) |
@edsiper Thank you for the fix! I tested with the latest master (13735a5) and I could see the input forward was paused consistently during shutdown. However when I started a new container after shutdown the blocked records during shutdown seem to be simply discarded. Build a docker image from the latest master (13735a5)
Run a fluent-bit container:
Run a port scan container:
Stop the fluent-bit container:
Start a new fluent-bit container:
I expect the blocked records during shutdown to be retried to send by docker engine after restart, but they were actually dropped. |
troubleshooting... |
This patch adds some extra handling for service termination, so when the engine paused ingestion, the server socket is close to avoid having incoming connection and not-processed records. This patch works when the service is not in a scenario with back-pressure, so it can be considered a partial workaround. There is an upcoming re-resign of server side API for scalability that will address all these issues. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
I've pushed a workaround for the issue: 4e0e687 The only downside is that if the plugin is under high-backpressure, the server socket won't be terminated before the grade period on shutdown/SIGTERM. Note that after 1.6 release we will refactor the server side API to address this things. |
for now can you provide some feedback on 4e0e687 ? |
@edsiper thank you for the fix! I'll test it. |
@edsiper I tested the 4e0e687, I could see the port was closed during shutdown, but some records were still dropped. Build a docker image from 4e0e687 :
Run a fluent-bit container:
Run a port scan container:
Stop the fluent-bit container:
Start a new fluent-bit container:
The engine caught I'm reading the implementation of Docker's fluentd log driver. If my understanding is correct, it seems to check an error returned by https://github.com/moby/moby/blob/v19.03.12/daemon/logger/fluentd/fluentd.go#L175-L177 I assumed it would retry if the connection was closed. I'm not sure why this doesn't work as intended. A good news: It will check the response if Ack mode is enabled. Even though without the |
I captured packets with
In the below test case,
tcpdump:
From the above, IP address mappings are probably as follows:
As you can see, new connections from |
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
…t#2610) This patch adds some extra handling for service termination, so when the engine paused ingestion, the server socket is close to avoid having incoming connection and not-processed records. This patch works when the service is not in a scenario with back-pressure, so it can be considered a partial workaround. There is an upcoming re-resign of server side API for scalability that will address all these issues. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@edsiper I noticed that v1.6.0 release note said: https://fluentbit.io/announcements/v1.6.0/
What does that mean for this issue? |
* ra: fix typo of comment Signed-off-by: Takahiro YAMASHITA <nokute78@gmail.com> * build: add an option for OSS-Fuzz builds (fluent#2502) This will make things a lot easier from the OSS-Fuzz side and also make it easier to construct new fuzzers. * aws: utils: fix mem leak in flb_imds_request (fluent#2532) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * io: fix EINPROGRESS check, also check that errno is not zero (fluent#2497) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * upstream: new 'destroy_queue' to defer connections context destroy (fluent#2497) A connection context might be destroyed while the event loop still has some pending event to be processed, in some cases a network exception. Destroying the context might lead to a corruption. The following patch implements a new queue to store temporary the connection context so the 'destroy' process is defered until all events from the event loop has been processed. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * engine: do upstream cleanup after the event loop finished processing events (fluent#2497) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: parse tag that matches kubernetes default regex Signed-off-by: Jeff Luo <jeffluoo@google.com> * test: runtime: out_stackdriver: add unit test for tag regex matching Signed-off-by: Jeff Luo <jeffluoo@google.com> * upstream: expose flb_upstream_conn_pending_destroy() Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: runtime: filter_modify: fix sleep() usage and others Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: fix header guard Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: add missing ctype.h header on http_request Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: add cast for regex_match() variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_azure_blob: add support for blockblob type and new blob_type property Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * upstream: fix deletion of destroy_queue Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * io: remove unnecessary errno message Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * upstream: remove unnecessary errno message Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_syslog: move configuration to a config_map Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: print timestamp formatting timespec fields Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_lua: fix handling of integer/double values (fluent#1932 fluent#1647) This patch makes to recognize the preferred numeric data type returned by the Lua script in the record fields. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_lua: new 'time_as_table' property for timestamp handling (fluent#641 fluent#2519 fluent#2015) Long time ago we got some reports that using timestamps as double/floats might lost precision when the values are converted back from Lua. Actually there is no exact way to have 100% precision in doubles in our use case 'C > Lua > C'. Community suggested that we might workaround this with another solution. This patch considering backward compatibility, implements a new configuration property called 'time_as_table', which passes the timestamp as a Lua table with the following keys ['sec'] => timestamp seconds ['nsec'] => timestamp nanoseconds for users looking for 100% timestamp precision and specifically when dealing with nanoseconds, this option ensures timestamp integrity. If the option is enabled the user just need to adjust the script to use the new format if they touch or use the timestamp value. By default the option 'time_as_table' is disabled and in the future we might consider to enable it by default. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_file: use flb_time to format timestamp Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: remove unused variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: mp: include 'flb_compat.h' <unistd.h> is not POSIX standard and not provided by Microsoft Visual C++. Include "flb_compat.h" to avoid the build failure. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * utils: add `flb_utils_hex2int()` function mk_utils_hex2int() is included in mk_server. For this reason, this function only exists when FLB_HTTP_SERVER is enabled. Bring the function to flb_utils.c, so that we can use the functionality regardless of the compile-time option. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * Port Azure BLOB Output plugin to Windows With this patch out_azure_blob can be compiled and linked on Windows Server. This patch also fixes CI builds on AppVeyor. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * pack: Use PRIu64/PRId64 to format 64 bit integers. Do not use `%lu` to format 64-bit int; A long can be 32-bit depending on the system, so this line can implicitly truncate the target value. This should fix the "timestamps are truncated inproperly" bug on out_datadog. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * tests: internal: fuzzer: added three new fuzzer for msgpack handling (fluent#2541) Signed-off-by: davkor <david@adalogics.com> * tests: runtime: out_cloudwatch_logs: Disable net.keepalive in test (fluent#2533) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * in_forward: fix header declaration (copy paste) (fluent#2559) Use FLB_IN_FW_CONFIG_H instead the MQTT one * ci: travis: remove deprecated key (fluent#2454) See: https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration Signed-off-by: Zero King <l2dy@icloud.com> * out_es: support nanosecond timestamp precision (fluent#2544) Starting in Elasticsearch 7, a "date_nanos" data type was added, increasing timestamp precision from milliseconds to nanoseconds. This patch adds a "Time_Key_Nanos" option which tells the ElasticSearch output plugin to send 9 decimal places instead of 3 to ElasticSearch. Tests are included, and a patch to document the new option will be submitted shortly. Signed-off-by: Neal Turett <nturett@evoforge.org> * doc: Add missing plugins to README (fluent#2534) Signed-off-by: zihengCat <zihenglv@gmail.com> * lib: sqlite: upgrade from 3.31.0 to 3.33.0 (fluent#2552) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * random: Introduce a random generator (fluent#2555) This adds flb_randombytes(), which fills the given buffer with random numbers using each OS's built-in Crypt API. The most common use scenario is: unsigned char buf[64] if (get_randombytes(buf, 64)) { flb_error("cannot get random bytes"); } This function supports both UNIX and Windows. You can use this function without caring the underlying OS. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * lib: mbedtls: upgrade from 2.16.5 to 2.24.0 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: mbedtls: comment out link_to_source call Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: mbedtls: comment out Python3 check and 3dparty dir Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: gelf: validate incoming object is a map (oss-fuzz 25754) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: fuzzer: remove test message Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: luajit: upgrade from 2.1.0-beta3 to 2.1.0-dd5032e Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * build: adjust luajit path Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: gelf: format timestamp as seconds.milliseconds Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: gelf: adjust test case for expected records and timestamp Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_cpu: normalize per-process CPU stats by number of cores (fluent#2543) Signed-off-by: yang-padawan <25978390+yang-padawan@users.noreply.github.com> * pack: json_sds: validate unpacking Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * utils: fix bad handling of invalid utf-8 bytes (oss-fuzz 25785) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * strptime: Add a fallback macro for `timezone` (fluent#2493) According to the UNIX standard: The external variable timezone shall be set to the difference, in seconds, between Coordinated Universal Time (UTC) and local standard time FreeBSD is incompatible with this standard. In particular, since it exposes a function symbol `char* timezone(int, int)`, expressions like `-(timezone)` causes a compile error. Fix it by adding a compat macro for FreeBSD. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * out_forward: always initialize salt with random numbers (fluent#2575) There is an initialization bug that leaves the shared key salt being uninitialized when SSL is not enabled. This might allow attackers to guess the shared key by looking at the hash. Let's always initialize the salt buffer securely using flb_randombytes(). Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * out_gelf: port the plugin to Windows (fluent#2574) With this patch, the GELF Output plugin can be compiled and linked on Windows. This commit also fixes a few bugs in GELF plugin: * The message header was not constructed properly. In particular, 11-12th bytes were filled in the reversed order. * The message id generation was bogus (e.g. it did "tv_nsec*1000000 + tm.tm.tv_nsec" to generate a timestamp) Fix these glitches and add detailed documentation to each function. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * out_pgsql: add support for CockroachDB (fluent#2512) The SQL query will change when new option `cockroachdb` it's set to true Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@2ndquadrant.com> Co-authored-by: Jonathan Gonzalez V <jonathan.gonzalez@2ndquadrant.com> * lib: chunkio: upgrade to v1.0.6 (dev changes) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: new interface to manage local files storage Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * random: rename function flb_randombytes() to flb_random_bytes() Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: random: fix function name Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_forward: remove unused variable and adjust random api name Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * build: on FLB_SMALL mode, do not use 'strip' flag for non-gcc compiler Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * strptime: initialize 'len' variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: gelf: initialize 'val_len' variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * str: use memcpy to silent gcc warning Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * io: tls: remove unused conditional and initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_lua: initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_parser: initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_kubernetes: validate metadata is a map Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_systemd: initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_azure_blob: add cast Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_gelf: initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_syslog: initialize variables Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: signv4: initialize variable Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in stdin: add buffer_size parameter (fluent#2364) * in sdtin: add buffer_size parameter Signed-off-by: Martin Dojcak <martin.dojcak@lablabs.io> * lib: mbedtls: force compiler c99 mode Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: disable gelf test in Win32 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * aws: s3 key format option Signed-off-by: Meghna Prabhu <meghnapr@amazon.com> * signv4: Support S3 Signed Payload Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_es: update with new signv4 api Signed-off-by: Wesley Pettit <wppttt@amazon.com> * aws: add new S3 local buffering library Signed-off-by: Wesley Pettit <wppttt@amazon.com> * aws: s3 object key formatting and request signing Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: new output plugin for Amazon S3 Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: use new fstore API plus other adjustments - use new fstore API for buffer management - register plugin flags - config: change configmap for TIME and SIZE value types - config: rename chunk_buffer_dir -> store_dir - fix leaks on exceptions - use field to 'lock' chunks, instead of temporary node deletion - others adjustments note: still work in process. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * aws: fix broken s3 key test Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_kinesis_firehose: new high performance core plugin for Kinesis Firehose (fluent#2572) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * in_tail: implement large-file support for Windows Previously it was not possible to handle files larger than 2GB on Windows. The fix is archived by migrating each vairable and function to be explicitly 64-bit ready. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * in_tail: do a small cleanup of the Windows port We have added a number of helpers to make it easier to support Windows (e.g. `flb_pipefd_t` as a portable type for pipes). This introduces these helpers to in_tail. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * fstore: return on memory exception (CID 309436) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: release context on exception (CID 309435) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: fix error message parameter (CID 309433) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * signv4: fix sds_printf parameters (CID 306765) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_azure_blob: fix use-after-free on exception (CID 306763 306762) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_azure_blob: do not use http context before validation (CID 306663) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: release pattern on exception (CID 305802) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_tail: fix debug message params (CID 305255) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_cloudwatch_logs: add support for custom sts endpoint Signed-off-by: Meghna Prabhu <meghnapr@amazon.com> * out_es: add support for custom STS endpoint Signed-off-by: Meghna Prabhu <meghnapr@amazon.com> * aws: add support for custom STS endpoint in flb_aws_sts_provider Signed-off-by: Meghna Prabhu <meghnapr@amazon.com> * aws: add support for custom sts endpoint in flb_eks_provider Signed-off-by: Meghna Prabhu <meghnapr@amazon.com> * aws: sts: set custom_endpoint flag correctly Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: Add sts_endpoint and role_arn options Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_kinesis_firehose: Add sts_endpoint option Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_cloudwatch_logs: truncate events that are larger than 256KiB Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_cloudwatch_logs: fix off by one error in payload creation Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_cloudwatch_logs: Use documented max payload size and other fixes Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_kinesis_firehose: fix type of event_bytes (CID 309457) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_kinesis_firehose: Use correct printf format for time_t (CID 309456) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * fstore: fix null dereference (CID 309455) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * aws: util: fix double free issues in get_s3_key (CID 309453 & 309443) (fluent#2599) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix possible null dereference (CID 309448) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: always cast fstore meta_buf to char * (CID 309442) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix mem leak (CID 309438) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * lib: jansson: add jansson library for upcoming Avro support (fluent#2568) Signed-off-by: xmcqueen <bmcqueen@linkedin.com> * doc: add users: 'Microsoft', 'Linked In' and 'Trend Micro' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: use the correct project ID The value for project ID of monitored resources of a LogEntry should be project ID, not the project number, which is in numeric format. The value of the project ID of Cloud Logging's Monitored Resources are documented in https://cloud.google.com/logging/docs/api/v2/resource-list. The project ID to be retrived from instance metadata is documented in https://cloud.google.com/compute/docs/storing-retrieving-metadata#default. Signed-off-by: Yen-Cheng Chou <ycchou@google.com> * pack: fix json floating point format regression (fluent#2592) The recent change to the JSON floating point formatting to use "%.16g" caused a regression where values that have no fractional part are formatted as integers. For example, "10.0" gets formatted as "10". This patch uses the same approach as https://github.com/ohler55/oj/blob/v3.10.13/ext/oj/dump_strict.c#L100-L101, which is used in Fluentd. It checks if the double value is equal to the integer part, and if so, will use "%.1f" as the format to ensure the decimal part is still rendered (with a single decimal place of ".0"). This prevents downstream datastores from having data type conflicts. This was tested by building locally and running through different value using the dummy input plugin and stdout output plugin with json_lines formatting. Will include example outputs of tests in Pull Request. Signed-off-by: Joey Paskhay <joey.paskhay@gmail.com> * aws_util: fix failing sanitizer builds (fluent#2604) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: use correct printf format for MAX_UPLOAD_ERRORS (CID 309440) (fluent#2602) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_stackdriver: fix project_id length for testing (fluent#2611) Signed-off-by: Yen-Cheng Chou <ycchou@google.com> * out_http: on error, print HTTP response payload if available (fluent#2593) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * config: new flag to determinate if the ingestion is active Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * engine: upon exit, turn ingestion flag off Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * input: if ingestion is disable, do not resume it Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * input: chunk: if ingestion is disable, do not resume it Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: rename worker thread to 'flb-pipeline' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * log: rename worker thread to 'flb-logger' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bin: spawn pipeline using library mode (fluent#1496 fluent#2610) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * task: fix counter of running tasks, use 'users' counter (fluent#2411) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * input_chunk: drop oldest buffer chunks when reaching the limit Signed-off-by: Jeff Luo <jeffluoo@google.com> * output: add new buffer option: storage.total_limit_size Signed-off-by: Jeff Luo <jeffluoo@google.com> * router: add function to get routes_mask by tag Signed-off-by: Jeff Luo <jeffluoo@google.com> * scheduler: clean up request in request_wait queue Signed-off-by: Jeff Luo <jeffluoo@google.com> * in_tail: add checking to avoid deleting fd with value -1 Signed-off-by: Jeff Luo <jeffluoo@google.com> * pipe: add checking to avoid closing fd with value -1 Signed-off-by: Jeff Luo <jeffluoo@google.com> * task: creat task_routes by routes_mask Signed-off-by: Jeff Luo <jeffluoo@google.com> * test: internal: input_chunk: add unit tests for buffering mechanism Signed-off-by: Jeff Luo <jeffluoo@google.com> * in_tail: inotify: use proper logger API Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: new flb_loop() to wait for main thread exit Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bin: use new flb_loop() for clear exit Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_gelf: port the random seed generation to Windows (fluent#2614) Windows does not have /dev/urandom. For this reason, it was always using a less secure value (= UNIX time) as an entropy source. Use flb_randombytes() to use a good entropy source, and thus, reduce the possibility of message collision. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * in_random: use the new API to generate a random seed (fluent#2613) Simplify the random seed generation by using flb_random_bytes() instead of trying to read entropy sources manually. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * aws: util: initialize key buffer to zero Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * pack: on JSON formatting, remove duplicated keys (fluent#1835 fluent#1051) Our internal serialization format allows to have several keys with the same name in a map. When converting to JSON the case is quite dangerous, despite the JSON spec is not mandatory about having unique keys at the same level in a map, we got tons of reports that backend services that receives JSON payload raises exception due to duplicated keys. On this patch now the JSON formatter will check for duplicated keys, if found, it preserves the latest one found. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: pack: add check for handling of duplicated JSON keys (fluent#1835 fluent#1051) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: pack: release buffers Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_tail: Prevent concatenation of no mulitline logs Signed-off-by: Dominik Rosiek <drosiek@sumologic.com> * tests: runtime: in_tail: Add test for dmode_firstline for docker mode Signed-off-by: Dominik Rosiek <drosiek@sumologic.com> * in_tail: read from tail (duh!) (fluent#1667 fluent#1761 fluent#474 fluent#1645 fluent#1330) From now on, Tail plugin implements the following behavior: 1. If a file is already registered in the database and contains an offset, the file is consumed from that offset position. 2. Upon start, if a file is not known by the database, read from it tail. 3. If the new 'read_from_head' property (default: false) is enabled, for newly discovered files read from the beginning. This flag don't override the behavior of a file that already exists in the database. Additional fix: When a file is being monitored in 'static mode', handle truncation properly. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: runtime: filter_kubernetes: use tail + 'read_from_tail' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: runtime: out_stackdriver: use tail + 'read_from_tail' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: runtime: in_tail: use 'read_from_tail' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * input: chunk: comments cleanup Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bin: win32: fix 'error LNK2001: unresolved external symbol config' (fluent#2627) f303cdb has removed the global symbol "config" on which winsvc.c depends. This ended up producing the following error: error LNK2001: unresolved external symbol config Bring back the symbol to make Windows Service working again. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * in_tail: db: reset bindings after file insertion (fluent#2576) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_tail: always read from head for new discovered files Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_cloudwatch_logs: bug fix: self throttle if too many calls per flush (fluent#2618) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_kinesis_firehose: add support for log_key (fluent#2619) * out_kinesis_firehose: add support for log_key Signed-off-by: Wesley Pettit <wppttt@amazon.com> * aws_util: increase the jsmn default token size Signed-off-by: Drew Zhang <zhayuzhu@amazon.com> * filter_aws: expand EC2 metadata fields Signed-off-by: Drew Zhang <zhayuzhu@amazon.com> * ci: Skip 'tests/internal/input_chunk.c' on Windows This unittest relies on chunkio's file storage, which Windows port does not support yet. Let's skip it for now. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * filter_aws: fix return in get_vpc_metadata Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: add scaffolding for runtime tests Signed-off-by: Wesley Pettit <wppttt@amazon.com> * tests: runtime: out_s3: add runtime tests Signed-off-by: Wesley Pettit <wppttt@amazon.com> * input: on socket collector creation, return collector id Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * in_forward: partial workaround to close connections on SIGTERM (fluent#2610) This patch adds some extra handling for service termination, so when the engine paused ingestion, the server socket is close to avoid having incoming connection and not-processed records. This patch works when the service is not in a scenario with back-pressure, so it can be considered a partial workaround. There is an upcoming re-resign of server side API for scalability that will address all these issues. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: extend API to map all chunks Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_s3: migrate uploads to fstore API Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_s3: update file size Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_cloudwatch_logs: don't include <unistd.h> on Windows (fluent#2631) MSVC++ does not provide <unistd.h>. For this reason, including this header breaks the compilation on Windows. Fix it by adding an ifdef block to the header. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * cmake: define S3 and Kinesis Firehose plugin for Windows (fluent#2632) Turn off plugins that do not work on Windows yet, so that we can compile the master HEAD on MSVC. This patch is required for v1.6 release. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * in_tail: new 'db.locking' option and other perf improvements (fluent#2564) This patch adjust sqlite synchronization mode by default to 'normal', it sets journal mode to WAL and it adds a new option called 'db.locking' (default: false) which helps to reduce the number of syscalls on every commit but at the price of locking the access to the database file to third party programs. The new adjustments makes a significant performance increase reducing database I/O in about 40%. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: clean up buffer before return (fluent#2640) Signed-off-by: Yen-Cheng Chou <ycchou@google.com> * task: fix typo messages and variables name 'attemps' to 'attempts' (fluent#2636) Signed-off-by: Tanapol Rattanapichetkul <tanapol.raa@gmail.com> * out_stackdriver: update error message (fluent#2624) Signed-off-by: Peijia Luo <luopeijia@JefffffffdeMacBook-Pro.local> * aws: remove old s3 local buffer lib Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: add bucket name to store dir Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix build warnings Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix help text for upload_timeout Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix UploadPart response parsing Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: make iso8601 the default date format Signed-off-by: Wesley Pettit <wppttt@amazon.com> * fstore: fix build warning Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix build failure Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix multipart upload buffering (fluent#2645) Signed-off-by: Wesley Pettit <wppttt@amazon.com> * in_forward: handle 'compressed text' option (fluent#2644) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_stackdriver: support monitored resource ingestion from log entry Signed-off-by: Jie Wu <wujie@google.com> * tests: runtime: out_stackdriver: monitored resource ingestion Signed-off-by: Jie Wu <wujie@google.com> * signv4: fix uri encoding to allow '=' in the uri path Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: use UPLOAD_TIMER_MIN_WAIT correctly Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_es: make aws_sts_endpoint optional Signed-off-by: Wesley Pettit <wppttt@amazon.com> * lib: invalidate context on destroy Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_es: increase default response buffer size to 512k Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * aws: add eks, sts, and env credential providers Added three credentials providers: - EKS: Obtain credentials via sts:AssumeRoleWithWebIdentity using a k8s OIDC token - STS: Given a base set of credentials, obtain a new set with sts:AssumeRole - ENV: Standard environment variables for credentials Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_cloudwatch_logs: add support for cpu and mem metrics using cloudwatch emf Signed-off-by: Rayhan Hossain <hossain.rayhan@outlook.com> * out_cloudwatch_logs: rabse with master and address error handling This change rebases the emf support code with master. Additionally, it adds null memory allocation check and better error handling. Signed-off-by: Rayhan Hossain <hossain.rayhan@outlook.com> * out_cloudwatch_logs: remove redundant code blocks and fix spelling mistake Signed-off-by: Rayhan Hossain <hossain.rayhan@outlook.com> * out_cloudwatch_logs: fix spelling mistake Signed-off-by: Rayhan Hossain <hossain.rayhan@outlook.com> * lib: chunkio: sync dev changes v1.0.6 #1c59044 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: delete empty streams on exit Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_kafka: new option 'queue_full_retries' (fluent#2553) When the librdkafka queue is full and Fluent Bit cannot ingest more data, it does a 'local retry' of maximum 10 times waiting a second between each retry. But in some cases like the exposed in fluent#2553 is not enough. This patch exposes a new configuration property called 'queue_full_retries' that configure the maximum number of times the retry must be done. Note that this limit now can be disabled setting a value of '0' or 'false'. Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * filter_tensorflow: new Tensorflow Lite filter plugin! (fluent#2511) Signed-off-by: Masoud Koleini <masoud.koleini@arm.com> * tests: internal: aws: do not declare variables inside for loops Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * aws: util: remove unneeded flb_s3_endpoint function Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: use path based endpoints to support bucket names with dots Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: s3_store: skip locked chunks in s3_store_file_get Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: fix mem leak Signed-off-by: Wesley Pettit <wppttt@amazon.com> * out_s3: store: validate contexts on exit Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: remove unused dump Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bin: on SIGTERM exit properly Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * conf: update typo in fluent-bit.conf (fluent#2641) Simple typo in the default config for fluent-bit * fstore: fix invalid reference on exception (CID 310820) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bin: on SIGTERM, wait for child thread Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_s3: release 'uri' on exception (CID 310852) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: runtime_shell: tail: set 'read_from_head' Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * fstore: support chunkio backend type Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * out_s3: store: detect Travis environment, if so, use fstore memory backend Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * lib: mbedtls: CMakeLists.txt: fix a typo for CFLAGS (fluent#2662) CMAKE_C_CFLAGS -> CMAKE_C_FLAGS, this avoids C_FLAGS from environment being overridden. Signed-off-by: Ming Liu <liu.ming50@gmail.com> * build: bump to v1.7.0 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * Dockerfile: bump to v1.7.0 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * bitbake: bump to v1.7.0 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * doc: add users: DigitalOcean Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: fuzzers: added a genera parser fuzzer (fluent#2665) Signed-off-by: davkor <david@adalogics.com> * tests: internal: fuzzers: fix typo in cmakelists (fluent#2675) Signed-off-by: davkor <david@adalogics.com> * scheduler: use flb_random_bytes() instead of /dev/urandom (fluent#2679) Windows does not support /dev/urandom. For this reason, it ended up using a weak entropy source for task scheduling. Avoid this issue by using flb_random_bytes() instead. Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net> * pack: gelf: on flatten, only process string key types (oss-fuzz 26294) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * parser: on exception, use flb_parser_destroy() (oss-fuzz 26308) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: fuzzer: initialize timestamp Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * tests: internal: fuzzer: fix missing data type declaration Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> * Import libmaxminddb-1.3.2 Signed-off-by: okkez <okkez000@gmail.com> * Add files to implement filter geoip2 Signed-off-by: okkez <okkez000@gmail.com> * Add missing header Signed-off-by: okkez <okkez000@gmail.com> * Implement filter_geoip2 Signed-off-by: okkez <okkez000@gmail.com> * Add simple test for filter_geoip2 Signed-off-by: okkez <okkez000@gmail.com> * Duplicate string Signed-off-by: okkez <okkez000@gmail.com> * Use make install as INSTALL_COMMAND Signed-off-by: okkez <okkez000@gmail.com> * Remove unused variables Signed-off-by: okkez <okkez000@gmail.com> * Free lookup_key Signed-off-by: okkez <okkez000@gmail.com> * Reset counter Signed-off-by: okkez <okkez000@gmail.com> * Add more test for filter_geoip2 Signed-off-by: okkez <okkez000@gmail.com> * Plug memory leaks reported by valgrind Signed-off-by: okkez <okkez000@gmail.com> * Improve log messages Signed-off-by: okkez <okkez000@gmail.com> * Move libmaxminddb-1.3.2 to plugins/filter_geoip/ Signed-off-by: okkez <okkez000@gmail.com> * add cmake Co-authored-by: Takahiro YAMASHITA <nokute78@gmail.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Wesley Pettit <wppttt@amazon.com> Co-authored-by: Eduardo Silva <eduardo@treasure-data.com> Co-authored-by: Jeff Luo <jeffluoo@google.com> Co-authored-by: Fujimoto Seiji <fujimoto@ceptord.net> Co-authored-by: Aleks <git001@users.noreply.github.com> Co-authored-by: Zero King <l2dy@icloud.com> Co-authored-by: turettn <turettn@gmail.com> Co-authored-by: ziheng <zihenglv@gmail.com> Co-authored-by: Matej Staroň <25978390+yang-padawan@users.noreply.github.com> Co-authored-by: Jonathan Gonzalez V <jonathan.abdiel@gmail.com> Co-authored-by: Jonathan Gonzalez V <jonathan.gonzalez@2ndquadrant.com> Co-authored-by: Martin Dojcak <martin.dojcak@lablabs.io> Co-authored-by: Meghna Prabhu <meghnapr@amazon.com> Co-authored-by: Yen-Cheng Chou <ycchou@google.com> Co-authored-by: Joey Paskhay <joey.paskhay@gmail.com> Co-authored-by: Yen-Cheng Chou <3055688+StevenYCChou@users.noreply.github.com> Co-authored-by: Dominik Rosiek <drosiek@sumologic.com> Co-authored-by: Drew Zhang <zhayuzhu@amazon.com> Co-authored-by: pugkung <tanapol.raa@gmail.com> Co-authored-by: Jeff Luo <40296741+JeffLuoo@users.noreply.github.com> Co-authored-by: Jie WU <wujie@google.com> Co-authored-by: Rayhan Hossain <hossain.rayhan@outlook.com> Co-authored-by: Masoud Koleini <7603254+koleini@users.noreply.github.com> Co-authored-by: André <andreprado88@gmail.com> Co-authored-by: Ming <liu.ming50@gmail.com> Co-authored-by: okkez <okkez000@gmail.com> Co-authored-by: Vincent Auclair <auclair.vincent@gmail.com>
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Any update on this? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I tested the fluentd-request-ack option in Docker 20.10 with the latest fluent-bit v1.7.4, but some records were lost during shutdown.
Start a fluent-bit v.1.7.4 container:
Start a nc container with --log-opt fluentd-request-ack=true:
Stop the fluent-bit container:
Start a new fluent-bit container:
Stop the 2nd fluent-bit container:
As you can see, the last record processed by the 1st fluent-bit container:
The first record processed by the 2nd fluent-bit container:
This means some records were lost during shutdown. I expected that the fluentd-request-ack option would solve this issue, but am I misunderstanding something? |
@edsiper Any update on this? I think we have seen reports that match the last comment from @minamijoyo |
@edsiper Ping- we are still seeing reports that Fluent Bit stops accepting records on the forward input when it receives a SIGTERM, which causes log loss. Customers prefer fluent bit to continue accepting records until SIGKILL. If that behavior is not your preferred default, then we request that this alternate behavior be an opt-in thing. |
@edsiper could we have any update on this one? We have other customers also face this issue and hope to get a fix on this one. |
@edsiper Do we have any update on this? We have other customer issues similar to this one. Expecting a response from you. |
@edsiper Ping for update. This issue is still in risk and could causing log loss to customers. |
When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open state. This patch changes that behavior and if the plugin is paused or if the engine receives SIGTERM (which will trigger a pause), now the agent will drop the active TCP connections. Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
I just pushed a fix on PR #4339 extra tests are welcome (so we can merge the change in the next release) |
When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open state. This patch changes that behavior and if the plugin is paused or if the engine receives SIGTERM (which will trigger a pause), now the agent will drop the active TCP connections. Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open state. This patch changes that behavior and if the plugin is paused or if the engine receives SIGTERM (which will trigger a pause), now the agent will drop the active TCP connections. Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
@edsiper Thanks. We tested this, and you merged it, so this issue can be closed? |
@edsiper Oh I see, this hasn't been released yet? When is it coming? in 1.8 or 1.9? |
@PettitWesley, my bad. I said that I didn't see this in 1.8 because I was looking for a backport PR, and only saw a PR into master. I just checked the commits to 1.8 and now see the one referenced above: ea0fba8 This looks good, and is in 1.8.10. I think this issue can be closed. |
@edsiper Thank you for the fix! I tested the latest fluent-bit v1.8.10 and confirmed that it now correctly closes port on shutdown. It's a great improvement. Thanks! However, I can still see that some records were lost during shutdown even if with the fluentd-request-ack=true option enabled. I'm not sure where the root cause is, anyway I'll share the log.
Start a fluent-bit v.1.8.10 container:
Start a nc container with --log-opt fluentd-request-ack=true:
Stop the fluent-bit container:
Start a new fluent-bit container:
As you can see, the last record processed by the 1st fluent-bit container:
The first record processed by the 2nd fluent-bit container:
This means two records were lost during shutdown.
If you think this is another issue, please let me so. I'll open another issue. |
can you try out building a container from GIT master ?, we did other improvements when shutting down and also we added more verbose messages. |
@edsiper Thank you for your reply. I built and tested the latest master.
It looks immediately shutdown when it caught SIGTERM, but I can still see that some records were lost as the same as v1.8.10. There may be a problem on the side of docker's log driver retry mechanism, how can we identify what is wrong?, or is it possible to prove innocence of the fluent-bit? |
hmm you might try enabling debug mode in Fluent Bit side |
@edsiper Thanks. I'll make a time to try other environments. For anyone who interested in this topic, please help debugging and report your result what condition was success or fail. |
I tested without docker fluentd log driver. In this case, I've still use docker as a runtime to reproduce the case easily for anyone, but sent records by fluent-bit output forward plugin.
Start a fluent-bit container as a receiver:
Start another fluent-bit container as a sender with the following configuration:
Stop the receiver's fluent-bit container:
Start a new receiver's fluent-bit container:
As you can see, the last record processed by the 1st receiver's fluent-bit container:
The first record processed by the 2nd receiver's fluent-bit container:
In this case, you can see that the records between 1640682585 and 1640682590 are also received. There were no lost records. The result when the receiver is restarted depends on the retry logic of the sender, that is, it may be a problem of docker fluentd log driver. |
When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open state. This patch changes that behavior and if the plugin is paused or if the engine receives SIGTERM (which will trigger a pause), now the agent will drop the active TCP connections. Signed-off-by: Eduardo Silva <eduardo@calyptia.com>
This was fixed in 31d26a2 |
Hi @edsiper, Thank you for the fix! Although I can still see some records being lost on shutdown even with the current latest Docker 23.0.1 + fluent-bit v2.0.10 with --log-opt fluentd-request-ack=true, I think there is another problem on the side of Docker's fluentd log driver. I found a related issue on the moby project. moby/moby#43690 (comment) I linked this to the above moby's issue and I'm going to close this one. Thanks! |
Bug Report
Describe the bug
The input forward plugin doesn't close TCP port when it caught SIGTERM. This causes losing some logs on shutdown.
To Reproduce
nc
command (port scan) container with Docker fluent log driver and forward its log to the fluent-bit container:The outputs of command are as follows:
According to the above output,
the Docker tried to stop the fluent-bit container at 00:10:04.
The engine caught signal (SIGTERM) at 00:10:04, but some logs are still ingested and the timestamp of the last log entry is 1601338207(=2020/09/29 00:10:07) and the result of nc command shows the TCP port was still open.
The engine completed shutdown at 00:10:09 after grace period (5 sec)
I tested multiple times and I found its behavior seems to be inconsistent.
The input was sometimes paused (but TCP port was still open).
Expected behavior
The input forward plugin should close TCP port whent it caught SIGTERM not to lose any logs on shutdown.
Your Environment
Additional context
I'm running fluent-bit containers on AWS ECS Fargate behind AWS NLB and forwarding other container's log with the Docker fluentd log driver.
When I need to update my configuration for fluent-bit, I'm using the Blue-Green deployment strategy not to lose any logs during the deployment.
However, AWS NLB doesn't reset the existing TCP connection even if NLB's deregistration delay passed. The deregistration delay block a new connection, but it doesn't reset the existing one. If my understanding is correct, the grace period doesn't ensure to flush all buffers. This means that if fluent-bit ingest new log entries during shutdown they will be lost.
I don't think it only affects fluent-bit behind AWS NLB. I can reproduce a simple docker container setup described above.
Blocking input on shutdown consistently may be a minimum solution, but it would be great if the fluent-bit close TCP port on shutdown because it allows sender to detect quickly receiver no longer available.
Related issue
#1496
The text was updated successfully, but these errors were encountered: