Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: restart fluentbit on failures [DET-4665] #1696

Merged
merged 2 commits into from Dec 10, 2020
Merged

chore: restart fluentbit on failures [DET-4665] #1696

merged 2 commits into from Dec 10, 2020

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented Dec 10, 2020

Description

This change introduces logic to restart fluentbit on failures.

Test Plan

  • run a cluster and kill the fluent container, ensure it is restarted
  • run an experiment on a cluster and kill fluentbit, make sure it is restarted and logs are restarted.

Commentary (optional)

Checklist

@stoksc stoksc requested a review from dzhu December 10, 2020 15:40
@cla-bot cla-bot bot added the cla-signed label Dec 10, 2020
@justin-determined-ai justin-determined-ai changed the title chore: restart fluentbit on failures chore: restart fluentbit on failures [DET-4665] Dec 10, 2020
Copy link
Contributor

@dzhu dzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing really blocking, though there are some tweaks to how errors are passed around that will improve the experience of figuring out what happened if this needs to be used, so probably good to get those in if there's time.

@@ -136,6 +153,27 @@ func (a *agent) Receive(ctx *actor.Context) error {
return nil
}

func (a *agent) restartFluent(ctx *actor.Context) error {
i := 0
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussed offline and we can pretty much leave this, but documenting my thoughts.]

This retrying only applies to errors that come up inside Docker while starting the container, not failures of Fluent Bit itself. So if, e.g., the desired port is already bound, the agent is just going to retry once a second forever. Which is not terrible behavior in and of itself, but I feel like this logic here may not be pulling its weight, especially since newFluentActor must have succeeded once already before this can run.

}
return errors.Wrapf(msg.Error, "unexpected child failure: %s", msg.Child.Address())

case fluentFailed:
return errors.Wrapf(msg.err, "unexpected unrecoverable fluent failure: %s", msg.err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to wrap it and %s it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, was an oversight

err := a.restartFluent(ctx)
if err != nil {
ctx.Tell(ctx.Self(), fluentFailed{
err: errors.New("failed to restart fluent with retries"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably wrap the err that led to this branch.

@@ -439,6 +439,11 @@ func newFluentActor(
}, nil
}

// fluentFailed is a message sent when the trackLogs sees fluent has failed.
type fluentFailed struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I find it mildly confusing that fluentFailed is used internally by both fluentActor and agent but is not passed between them. Maybe use different types? Or you could just have fluentActor Stop() instead of returning an error, since the parent is going to print out plenty of stuff if that happens anyway.

@dzhu dzhu assigned stoksc and unassigned dzhu Dec 10, 2020
@stoksc stoksc merged commit d5a913d into determined-ai:master Dec 10, 2020
@stoksc stoksc deleted the agent-fluent-restart branch December 10, 2020 22:51
justin-determined-ai pushed a commit to justin-determined-ai/determined that referenced this pull request Dec 10, 2020
This change introduces logic to restart fluentbit on failures.
@dannysauer dannysauer added this to the 0.13.11 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants