Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent Bit 1.6 ignores Grace Period #2715

Closed
PettitWesley opened this issue Oct 21, 2020 · 5 comments
Closed

Fluent Bit 1.6 ignores Grace Period #2715

PettitWesley opened this issue Oct 21, 2020 · 5 comments
Labels

Comments

@PettitWesley
Copy link
Contributor

Bug Report

Describe the bug

Grace period is entirely ignored and fluent bit quits instantly. This happens with both 1.6.0 and 1.6.1

To Reproduce

[SERVICE]
    Log_Level debug
    Grace 30

[INPUT]
    Name dummy
    Tag  dummy

[OUTPUT]
    Name cloudwatch_logs
    Match *
    log_stream_prefix hi-hello-fluent-bitttt-
    log_group_name fluent
    region us-west-2

Send Fluent Bit a sigterm and on shutdown you see:

[2020/10/21 22:50:00] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sent events to hi-hello-fluent-bitttt-0da87724773c
[2020/10/21 22:50:00] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Sent 50 events to CloudWatch
[2020/10/21 22:50:00] [debug] [task] destroy task=0x7f5ff00cb2b0 (task_id=0)
[2020/10/21 22:50:03] [engine] caught signal (SIGTERM)
[2020/10/21 22:50:03] [ info] [input] pausing forward.0

It also stops immediately, there is no grace period, no wait, it just stops.

Expected behavior
Prior to 1.6, on SigTerm there was a message and then Fluent Bit waited:

[2020/10/21 22:57:25] [debug] [http_client] server 169.254.169.254:80 will close connection #16
[2020/10/21 22:57:25] [debug] [router] match rule forward.0:cloudwatch_logs.0
[2020/10/21 22:57:25] [ info] [sp] stream processor started
[engine] caught signal (SIGTERM)
[2020/10/21 22:57:34] [ warn] [engine] service will stop in 30 seconds
[2020/10/21 22:58:04] [ info] [engine] service stopped

And it will actually wait a full 30 seconds...

Your Environment

  • Version used: 1.6.0, 1.6.1, latest 1.6 branch as well

Additional context

aws/aws-for-fluent-bit#81

@edsiper
Copy link
Member

edsiper commented Oct 24, 2020

I cannot reproduce the problem

@PettitWesley
Copy link
Contributor Author

@edsiper I can reproduce it with a simple config with the stdout output:

[SERVICE]
     Grace 30

[INPUT]
     Name dummy
     Tag dummy

[OUTPUT]
     Name stdout
     Match *

I built Fluent Bit from the latest commit on branch 1.6 on a fresh EC2 instance:

./bin/fluent-bit -c ~/fluent-bit.conf

In another window, send it a SIGTERM:

$ ps -e | grep flu
27799 pts/0    00:00:00 fluent-bit
$ kill -15 27799

Fluent Bit quits immediately, the 30 second grace period is ignored.

I built branch 1.5 on the same instance, and went through the same process, and the Grace period is respected when I send Fluent Bit a SIGTERM.

Output continues and you get a message saying that Fluent Bit will quit soon (in 1.6 this message no longer appears):

[0] dummy: [1603521497.377335622, {"message"=>"dummy"}]
[1] dummy: [1603521498.377261063, {"message"=>"dummy"}]
[2] dummy: [1603521499.377258684, {"message"=>"dummy"}]
[3] dummy: [1603521500.377250324, {"message"=>"dummy"}]
[4] dummy: [1603521501.377252626, {"message"=>"dummy"}]
[engine] caught signal (SIGTERM)
[0] dummy: [1603521502.377294435, {"message"=>"dummy"}]
[2020/10/24 06:38:23] [ warn] [engine] service will stop in 30 seconds
[0] dummy: [1603521503.377260033, {"message"=>"dummy"}]
[1] dummy: [1603521504.377255466, {"message"=>"dummy"}]
...
[3] dummy: [1603521525.377258357, {"message"=>"dummy"}]
[4] dummy: [1603521526.377261706, {"message"=>"dummy"}]
[2020/10/24 06:38:52] [ info] [engine] service stopped

I went through commits to 1.6 and checked whether or not they have this issue; this commit added it: 70771be

And I checked that if you remove the call to _exit on 1.6 then the grace period returns. However, the "double free" issue on shut down returns (which I originally alerted you too before the 1.6.0 release).

We have a large number of AWS customers who have been unable to upgrade to 1.6.x because of this issue.

@edsiper
Copy link
Member

edsiper commented Oct 29, 2020

@PettitWesley I am trying to reproduce the issue described with 1.6 branch (v1.6.2) and this is the output:

$ ../../build/bin/fluent-bit -c conf 
Fluent Bit v1.6.2
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/10/28 21:01:40] [ info] [engine] started (pid=1719983)
[2020/10/28 21:01:40] [ info] [storage] version=1.0.6, initializing...
[2020/10/28 21:01:40] [ info] [storage] in-memory
[2020/10/28 21:01:40] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/10/28 21:01:40] [ info] [sp] stream processor started
[0] dummy: [1603940501.354723113, {"message"=>"dummy"}]
[1] dummy: [1603940502.354722239, {"message"=>"dummy"}]
[2] dummy: [1603940503.354688760, {"message"=>"dummy"}]
[3] dummy: [1603940504.354833207, {"message"=>"dummy"}]
[2020/10/28 21:01:48] [engine] caught signal (SIGTERM)
[0] dummy: [1603940505.354638619, {"message"=>"dummy"}]
[1] dummy: [1603940506.356208382, {"message"=>"dummy"}]
[2] dummy: [1603940507.354874306, {"message"=>"dummy"}]
[3] dummy: [1603940508.354687246, {"message"=>"dummy"}]
[2020/10/28 21:01:48] [ warn] [engine] service will stop in 30 seconds
[2020/10/28 21:02:18] [ info] [engine] service stopped

message was triggered 21:01:48 and service stopped at 21:02:18, 30 seconds exactly.

which distro and compiler version are you testing this on ?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@giuliano-sider
Copy link

giuliano-sider commented Apr 11, 2023

If I docker run fluent/fluent-bit:1.8.12 with the following config,

[SERVICE]
    Grace         30
    Log_Level  debug

[INPUT]
    Name dummy
    Tag dummy

[INPUT]
    Name             tail
    Tag              plain_old_text.txt
    Path             /testdata/plain_old_text.txt
    DB               /plain_old_text.db
    Skip_Long_Lines  On
    Refresh_Interval 5
    Read_from_Head   True

[OUTPUT]
    Name stdout
    Match *

in which /testdata/plain_old_text.txt is a file where I write "hi\n" every second, I get the following after sending a termination signal:

^C[2023/04/11 20:38:31] [engine] caught signal (SIGINT)
[2023/04/11 20:38:31] [ info] [input] pausing tail.1
[2023/04/11 20:38:31] [debug] [task] created task=0x7fb146c37a00 id=0 OK
[0] dummy: [1681245511.154544603, {"message"=>"dummy"}]
[2023/04/11 20:38:31] [ warn] [engine] service will shutdown in max 30 seconds
[2023/04/11 20:38:31] [debug] [out coro] cb_destroy coro_id=8
[2023/04/11 20:38:31] [debug] [task] destroy task=0x7fb146c37a00 (task_id=0)
[2023/04/11 20:38:32] [ info] [engine] service has stopped (0 pending tasks)
[2023/04/11 20:38:32] [debug] [input:tail:tail.1] inode=8280729 removing file name /testdata/plain_old_text.txt
[2023/04/11 20:38:32] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=8280729 watch_fd=1

It pauses the input plugin immediately after getting a signal. It exits immediately after flushing whatever has been ingested (it doesn't wait for the 30s grace period, claiming that there are "0 pending tasks". Is this expected behavior?

When I docker run fluent/fluent-bit:1.6.1, I get similar behavior: it pauses the input plugin. It does wait the 30s grace period, rather uselessly since it's not ingesting logs anymore. Is that expected behavior as well? Why is there a difference between the 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants