Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

markusthoemmes · 2024-01-30T15:40:39Z

Note: I recognize that I haven't proposed this change per-se and that it's not necessarily necessary to do this rework in order to implement hot-reload. I'm happy to change things around and I don't want to come across as "here's a rework, plz merge" 😅 . This is more of a discussion-base for now.

What this PR does / why we need it:

This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes' CrashLoopBackoff take care of the rest.

That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped.

SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected.

Does this PR introduced a user-facing change?

The fluent-bit config is now reloaded through the hot-reload mechanism and the underlying process is no longer restarted to force a config reload.

ACTION REQUIRED: Due to the usage of the hot-reload feature, this version is no longer suitable for fluent-bit versions below 2.1.
ACTION REQUIRED: Due to the usage of the hot-reload feature, the `--exit-on-failure` and `--flb-timeout` flags have become deprecated and have no effect. `--exit-on-failure` is no longer necessary as the fluent-bit pod will exit whenever fluent-bit exits (error or not) and `--flb-timeout` can be recreated by adding a `terminationGracePeriod` to the `FluentBit` resource.

Additional documentation, usage docs, etc.:

This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes' `CrashLookBackoff` take care of the rest. That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped. SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected. Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

benjaminhuo · 2024-01-31T01:34:55Z

@markusthoemmes Thanks very much for this change!
@wanjunlei @wenchajun Please help to review this

wenchajun · 2024-01-31T02:52:33Z

Here is a note: Due to the use of the hot-reload feature in the new version of Fluent Bit, the Fluent Operator image created based on this is not suitable for image versions below Fluent Bit v2.1.

markusthoemmes · 2024-01-31T08:36:15Z

Quick check wrt. the --exit-on-failure and --flb-timeout flags: I've just dropped them in the current state. I wonder though if we should keep them for backwards compatibility and log a deprecation?

wenchajun · 2024-01-31T08:45:54Z

I think we can retain them and leave documentation notes. Of course, this is a point worth discussing, as both options are feasible.

markusthoemmes · 2024-01-31T09:33:30Z

Hmm. In terms of feasibility, I think --exit-on-failure is moot since we're exiting regardless (the lifecycle has changed). --flb-timeout actually becomes terminationGracePeriod on the respective DaemonSet/StatefulSet, I think. That'll instruct Kubernetes to send a SIGKILL after the respective time, which makes it behave exactly the same. I'll have another look.

Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

markusthoemmes · 2024-01-31T10:45:41Z

I've reflected my thoughts on how to deal with the flags in the current state.

I'm dealing with a bit of an issue of the reload though due to the parsers.conf file. I'll have to go check if that's due to my very old fluent-operator version. I'll report back.

Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

benjaminhuo · 2024-02-01T02:29:51Z

@markusthoemmes Thanks for all the thoughts.
@wanjunlei what do you think?

benjaminhuo · 2024-02-18T07:51:01Z

@markusthoemmes Thanks for this big change. We're on the way to implement a new fluentbit/fluentd config generation mechanism by watching CRDs in the watcher instead of putting the configs in a secret, which is more scalable.

I'm going to invite you as fluent-operator maintainer @markusthoemmes
cc @wanjunlei @wenchajun @Gentleelephant @patrick-stephens

benjaminhuo · 2024-02-18T07:52:50Z

I'm going to invite you as fluent-operator maintainer @markusthoemmes

@patrick-stephens @agup006 would you help to invite @markusthoemmes as a member of the fluent org as well?

markusthoemmes · 2024-02-18T07:57:53Z

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

benjaminhuo · 2024-02-18T08:10:03Z

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

Looks like we need to wait for this fluentbit PR to be merged: fluent/fluent-bit#8461
And when a new fluentbit release is out after this PR, we can add a retry mechanism to the reload?

benjaminhuo · 2024-03-08T07:07:45Z

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

@patrick-stephens @agup006 now fluentbit watcher use fluentbit's hot reload feature instead of killing the fluentbit process, but looks like the hot reload has some issues, would you please take a look?

markusthoemmes force-pushed the rework-fluent-bit-watcher branch from 963f00e to 4c1fb3a Compare January 30, 2024 15:48

markusthoemmes force-pushed the rework-fluent-bit-watcher branch from 4c1fb3a to 90d364b Compare January 30, 2024 15:54

benjaminhuo requested review from wanjunlei and wenchajun January 31, 2024 01:32

wenchajun previously approved these changes Jan 31, 2024

View reviewed changes

markusthoemmes added 2 commits January 31, 2024 11:38

Silence errors if the process has already exited

3a41f7a

Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

Keep old flags and deprecate them for backwards compatibility

051ca71

Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

markusthoemmes dismissed wenchajun’s stale review via 051ca71 January 31, 2024 10:44

Fully qualify default parsers.conf

468acdc

Signed-off-by: Markus Thömmes <markusthoemmes@me.com>

markusthoemmes force-pushed the rework-fluent-bit-watcher branch from d0b52b9 to 468acdc Compare January 31, 2024 11:16

wanjunlei approved these changes Feb 18, 2024

View reviewed changes

benjaminhuo approved these changes Feb 18, 2024

View reviewed changes

benjaminhuo merged commit 2f254cc into fluent:master Feb 18, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

markusthoemmes commented Jan 30, 2024 •

edited

benjaminhuo commented Jan 31, 2024

wenchajun commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

wenchajun commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

benjaminhuo commented Feb 1, 2024

benjaminhuo commented Feb 18, 2024 •

edited

benjaminhuo commented Feb 18, 2024

markusthoemmes commented Feb 18, 2024

benjaminhuo commented Feb 18, 2024 •

edited

benjaminhuo commented Mar 8, 2024

Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

Conversation

markusthoemmes commented Jan 30, 2024 • edited

What this PR does / why we need it:

Does this PR introduced a user-facing change?

Additional documentation, usage docs, etc.:

benjaminhuo commented Jan 31, 2024

wenchajun commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

wenchajun commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

markusthoemmes commented Jan 31, 2024

benjaminhuo commented Feb 1, 2024

benjaminhuo commented Feb 18, 2024 • edited

benjaminhuo commented Feb 18, 2024

markusthoemmes commented Feb 18, 2024

benjaminhuo commented Feb 18, 2024 • edited

benjaminhuo commented Mar 8, 2024

markusthoemmes commented Jan 30, 2024 •

edited

benjaminhuo commented Feb 18, 2024 •

edited

benjaminhuo commented Feb 18, 2024 •

edited