chore: restart fluentbit on failures [DET-4665] #1696

stoksc · 2020-12-10T15:40:41Z

Description

This change introduces logic to restart fluentbit on failures.

Test Plan

run a cluster and kill the fluent container, ensure it is restarted
run an experiment on a cluster and kill fluentbit, make sure it is restarted and logs are restarted.

Commentary (optional)

Checklist

dzhu

Nothing really blocking, though there are some tweaks to how errors are passed around that will improve the experience of figuring out what happened if this needs to be used, so probably good to get those in if there's time.

dzhu · 2020-12-10T20:49:08Z

agent/internal/agent.go

@@ -136,6 +153,27 @@ func (a *agent) Receive(ctx *actor.Context) error {
 	return nil
 }

+func (a *agent) restartFluent(ctx *actor.Context) error {
+	i := 0
+	for {


[Discussed offline and we can pretty much leave this, but documenting my thoughts.]

This retrying only applies to errors that come up inside Docker while starting the container, not failures of Fluent Bit itself. So if, e.g., the desired port is already bound, the agent is just going to retry once a second forever. Which is not terrible behavior in and of itself, but I feel like this logic here may not be pulling its weight, especially since newFluentActor must have succeeded once already before this can run.

dzhu · 2020-12-10T20:58:25Z

agent/internal/agent.go

 		}
 		return errors.Wrapf(msg.Error, "unexpected child failure: %s", msg.Child.Address())

+	case fluentFailed:
+		return errors.Wrapf(msg.err, "unexpected unrecoverable fluent failure: %s", msg.err)


Don't need to wrap it and %s it.

yeah, was an oversight

dzhu · 2020-12-10T21:01:23Z

agent/internal/agent.go

+				err := a.restartFluent(ctx)
+				if err != nil {
+					ctx.Tell(ctx.Self(), fluentFailed{
+						err: errors.New("failed to restart fluent with retries"),


This should probably wrap the err that led to this branch.

dzhu · 2020-12-10T21:53:59Z

agent/internal/fluent.go

@@ -439,6 +439,11 @@ func newFluentActor(
 	}, nil
 }

+// fluentFailed is a message sent when the trackLogs sees fluent has failed.
+type fluentFailed struct {


non-blocking: I find it mildly confusing that fluentFailed is used internally by both fluentActor and agent but is not passed between them. Maybe use different types? Or you could just have fluentActor Stop() instead of returning an error, since the parent is going to print out plenty of stuff if that happens anyway.

This change introduces logic to restart fluentbit on failures.

stoksc requested a review from dzhu December 10, 2020 15:40

stoksc assigned dzhu Dec 10, 2020

chore: restart fluentbit on failures

226d758

cla-bot bot added the cla-signed label Dec 10, 2020

justin-determined-ai changed the title ~~chore: restart fluentbit on failures~~ chore: restart fluentbit on failures [DET-4665] Dec 10, 2020

dzhu approved these changes Dec 10, 2020

View reviewed changes

dzhu assigned stoksc and unassigned dzhu Dec 10, 2020

feedback

9c863df

stoksc merged commit d5a913d into determined-ai:master Dec 10, 2020

stoksc deleted the agent-fluent-restart branch December 10, 2020 22:51

justin-determined-ai pushed a commit to justin-determined-ai/determined that referenced this pull request Dec 10, 2020

chore: restart fluentbit on failures [DET-4665] (determined-ai#1696)

6e6fa1c

This change introduces logic to restart fluentbit on failures.

dannysauer added this to the 0.13.11 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: restart fluentbit on failures [DET-4665] #1696

chore: restart fluentbit on failures [DET-4665] #1696

stoksc commented Dec 10, 2020 •

edited

dzhu left a comment

dzhu Dec 10, 2020

dzhu Dec 10, 2020

stoksc Dec 10, 2020

dzhu Dec 10, 2020

dzhu Dec 10, 2020

chore: restart fluentbit on failures [DET-4665] #1696

chore: restart fluentbit on failures [DET-4665] #1696

Conversation

stoksc commented Dec 10, 2020 • edited

Description

Test Plan

Commentary (optional)

Checklist

dzhu left a comment

Choose a reason for hiding this comment

dzhu Dec 10, 2020

Choose a reason for hiding this comment

dzhu Dec 10, 2020

Choose a reason for hiding this comment

stoksc Dec 10, 2020

Choose a reason for hiding this comment

dzhu Dec 10, 2020

Choose a reason for hiding this comment

dzhu Dec 10, 2020

Choose a reason for hiding this comment

stoksc commented Dec 10, 2020 •

edited