Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block excessive gracefulReload requests #3396

Closed
wants to merge 3 commits into from

Conversation

kenhys
Copy link
Contributor

@kenhys kenhys commented May 26, 2021

Which issue(s) this PR fixes:

Fixes #3341

What this PR does / why we need it:

In the previous versions, /api/config.gracefulReload call doesn't
restrict excessive API calls. It causes the following error when
already gracefulReload is executing.

Worker 0 finished unexpectedly with signal SIGKILL

This commit mitigates such a situation by limit a API call.
(it gives an interval in 60 seconds by default)

Docs Changes:

https://docs.fluentd.org/deployment/system-config
@blocking_reload_interval should be added in another PR.

Release Note:

N/A

NOTE: Ideally it should wait and detects graceful reload finish, but
there is no easy way to synchronize internal state between
ServerModule(RPC::Server#mount_proc) and WorkerModule (reload_config).

@kenhys
Copy link
Contributor Author

kenhys commented May 26, 2021

rerun windows CI again.

@kenhys kenhys marked this pull request as ready for review May 26, 2021 08:53
lib/fluent/supervisor.rb Outdated Show resolved Hide resolved
lib/fluent/supervisor.rb Outdated Show resolved Hide resolved
@kenhys
Copy link
Contributor Author

kenhys commented May 27, 2021

TODO: @system_config.blocking_reload_interval returns default value instead of customized one. 🤔

@kenhys kenhys changed the title Block excessive gracefulReload requests WIP: Block excessive gracefulReload requests May 27, 2021
@kenhys
Copy link
Contributor Author

kenhys commented May 28, 2021

This PR should be ready for v1.13.1? or later release. (not for v1.13.0)

@cosmo0920
Copy link
Contributor

@kenhys All of tests which are running on GitHub Actions is failed. Could you check them?

@kenhys
Copy link
Contributor Author

kenhys commented May 28, 2021

The problem of the current approach to customize blocking_read_interval:

  • config[:blocking_read_interval] is nil in Fluent::ServerModule#before_run

  • because of the above, @block_reload_until is not overriden in Fluent::ServerModule#run

@kenhys kenhys changed the title WIP: Block excessive gracefulReload requests Block excessive gracefulReload requests Jun 1, 2021
@kenhys
Copy link
Contributor Author

kenhys commented Jun 1, 2021

Broken test cases were fixed.

@kenhys
Copy link
Contributor Author

kenhys commented Jun 1, 2021

Rebased with recent master.

@kenhys kenhys requested a review from ashie June 2, 2021 02:32
lib/fluent/supervisor.rb Outdated Show resolved Hide resolved
@kenhys kenhys force-pushed the guard-reload branch 3 times, most recently from ad6e5fd to 98e4ce9 Compare June 2, 2021 07:18
@kenhys
Copy link
Contributor Author

kenhys commented Jun 2, 2021

Removed if @blocking_reload_interval check, and tweak test case a bit.

lib/fluent/supervisor.rb Outdated Show resolved Hide resolved
@kenhys
Copy link
Contributor Author

kenhys commented Jun 3, 2021

Changed to use @blocking_reload_interval instead of config[:blocking_reload_interval] and removed a redundant conditional assignment.

@ashie
Copy link
Member

ashie commented Jun 7, 2021

It can block excessive RPC call but can't block excessive SIGUSR2 call.
Could you consider also SIGUSR2 case?

@kenhys
Copy link
Contributor Author

kenhys commented Jun 7, 2021

I'll fix it, too.

In the previous versions, RPC::Server#mount_proc assumes that
the content of API response is either {'ok':true} or {'ok':false}.
And response must be instance of HTTPResponse or similar one
.(which respond to #body)

This commit accept more simple hash object to customize
API response.

Signed-off-by: Kentaro Hayashi <hayashi@clear-code.com>
@kenhys
Copy link
Contributor Author

kenhys commented Jun 7, 2021

SIGUSR2 is also supported.

@ashie
Copy link
Member

ashie commented Jun 7, 2021

CI failed on Windows:

2021-06-07T05:42:25.0561362Z Error: test_blocking_signal_handler(SupervisorTest::gracefulReload): NoMethodError: undefined method `stop_rpc_server' for nil:NilClass
2021-06-07T05:42:25.0565425Z D:/a/fluentd/fluentd/test/test_supervisor.rb:299:in `teardown'

Closes: fluent#3341

In the previous versions, /api/config.gracefulReload call doesn't
restrict excessive API calls. It causes the following error when
already gracefulReload is executing.

  Worker 0 finished unexpectedly with signal SIGKILL

This commit mitigates such a situation by restricting a API call.
(it gives an some interval and it is customizable in system configuration
 - blocking_reload_interval parameter)

NOTE: Ideally it should wait and detects graceful reload finish, but
there is no easy way to synchronize internal state between
ServerModule(RPC::Server#mount_proc) and WorkerModule (reload_config).

Signed-off-by: Kentaro Hayashi <hayashi@clear-code.com>
The reload is kicked from not only RPC call, but also SIGUSR2.

Signed-off-by: Kentaro Hayashi <hayashi@clear-code.com>
@Cryptophobia
Copy link

Cryptophobia commented Jun 8, 2021

When gracefulReload is triggered, a new thread for reloading is created at both supervisor process and worker process.

If the threads run sequentially, would that be enough to keep consistency, no matter how many times gracefulReload is called? This would the similar as implementing a queue for the threads?

I have no opinions against trying to solve this as long as there is a flag to keep current behavior the same because in projects like KFO, we do not wait or check the status for a return from gracefulReload endpoint when issuing multiple requests sequentially. We hope that fluentd will be able to handle this logic of what to do, in situations when configuration has changed in a short amount of time.

The other important consideration is, if a gracefulReload worker thread fails, how do we make sure that unfinished configuration change is still gracefullyReloaded? If it only fails once and blocks, the subsequent configuration change will not be able to gracefulReload, and that would be bad design, because then subsequent calls to an API (gracefulReload in this case) depend on the processing of previous API calls. Should the API not be independent and the processing of the config reloading need to be idempotent?

By idempotent, I just mean if we send 2 calls to gracefulReload, fluentd should just finish processing both sequentially. Any blocking would change behavior of the API and configuration changes could be missed, also API calls ignored.

@kenhys
Copy link
Contributor Author

kenhys commented Jun 9, 2021

Thanks @alex-vmw, @Cryptophobia, @ashie
Current approach makes it problematic, I withdraw PR.

@kenhys kenhys closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worker-0 is getting SIGKILLed instead of nicely reloading upon issuing gracefulReload
5 participants