New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: restart fluentbit on failures [DET-4665] #1696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing really blocking, though there are some tweaks to how errors are passed around that will improve the experience of figuring out what happened if this needs to be used, so probably good to get those in if there's time.
@@ -136,6 +153,27 @@ func (a *agent) Receive(ctx *actor.Context) error { | |||
return nil | |||
} | |||
|
|||
func (a *agent) restartFluent(ctx *actor.Context) error { | |||
i := 0 | |||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Discussed offline and we can pretty much leave this, but documenting my thoughts.]
This retrying only applies to errors that come up inside Docker while starting the container, not failures of Fluent Bit itself. So if, e.g., the desired port is already bound, the agent is just going to retry once a second forever. Which is not terrible behavior in and of itself, but I feel like this logic here may not be pulling its weight, especially since newFluentActor
must have succeeded once already before this can run.
agent/internal/agent.go
Outdated
} | ||
return errors.Wrapf(msg.Error, "unexpected child failure: %s", msg.Child.Address()) | ||
|
||
case fluentFailed: | ||
return errors.Wrapf(msg.err, "unexpected unrecoverable fluent failure: %s", msg.err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to wrap it and %s
it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, was an oversight
agent/internal/agent.go
Outdated
err := a.restartFluent(ctx) | ||
if err != nil { | ||
ctx.Tell(ctx.Self(), fluentFailed{ | ||
err: errors.New("failed to restart fluent with retries"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably wrap the err
that led to this branch.
agent/internal/fluent.go
Outdated
@@ -439,6 +439,11 @@ func newFluentActor( | |||
}, nil | |||
} | |||
|
|||
// fluentFailed is a message sent when the trackLogs sees fluent has failed. | |||
type fluentFailed struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: I find it mildly confusing that fluentFailed
is used internally by both fluentActor
and agent
but is not passed between them. Maybe use different types? Or you could just have fluentActor
Stop()
instead of returning an error, since the parent is going to print out plenty of stuff if that happens anyway.
This change introduces logic to restart fluentbit on failures.
Description
This change introduces logic to restart fluentbit on failures.
Test Plan
Commentary (optional)
Checklist