Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some broker pods get stuck caused by a terminated container when there is no healthy broker in the kafka cluster. #802

Merged
merged 5 commits into from May 5, 2022

Conversation

bartam1
Copy link
Contributor

@bartam1 bartam1 commented Apr 27, 2022

Q A
Bug fix? yes
New feature? no
API breaks? no
Deprecations? no
License Apache 2.0

What's in this PR?

If reconcilePerBrokerDynamicConfig gets an error then I let the loop continue to the next broker. A combined error will be returned after the broker loop so the operator will try again later.

Why?

When in a kafka cluster there is not any working kafka broker present, but there are multiple broker pods with terminated container, only one broker pod will be restarted. Other brokers with terminated containers will not until at least one healthy broker will be present in the kafka cluster. The root cause of this that brokers are unreachable and reconcilePerBrokerDynamicConfig function can't set dynamic configs and It will be ended with an error.

Checklist

  • Implementation tested
  • Error handling code meets the guideline
  • Logging code meets the guideline

@bartam1 bartam1 requested a review from a team as a code owner April 27, 2022 14:12
Kuvesz
Kuvesz previously approved these changes Apr 27, 2022
Copy link
Contributor

@Kuvesz Kuvesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bartam1 bartam1 changed the title Fix broker pods doesn't restarting beause no healthy brokers present Some broker pods get stuck caused by a terminated container when there is no healthy broker in the kafka cluster. Apr 27, 2022
if err = r.reconcilePerBrokerDynamicConfig(broker.Id, brokerConfig, configMap, log); err != nil {
return err
perBrokerDynamicConfigCombineError = errors.Combine(perBrokerDynamicConfigCombineError, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add some details for the error like broker.Id, brokerConfig, configMap if it's not already included.

}
}

if perBrokerDynamicConfigCombineError != nil {
return perBrokerDynamicConfigCombineError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the other operations failed in the loop, e.g.

return errors.WrapIf(err, "failed to reconcile resource")
,
return errors.WrapIfWithDetails(err, "failed to reconcile resource", "resource", configMap.GetObjectKind().GroupVersionKind())
,
return errors.WrapIfWithDetails(err, "failed to reconcile resource", "resource", configMap.GetObjectKind().GroupVersionKind())

And perBrokerDynamicConfigCombineError is not nil because there was some problems reconciling the broker dynamic configs for the previous brokers, then this perBrokerDynamicConfigCombineError is not going to be returned, which doesn't seem to be ideal to me...

pregnor
pregnor previously approved these changes Apr 29, 2022
Copy link
Member

@pregnor pregnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Kuvesz
Kuvesz previously approved these changes Apr 29, 2022
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved
@bartam1 bartam1 dismissed stale reviews from Kuvesz and pregnor via 37f613c May 2, 2022 14:56
pregnor
pregnor previously approved these changes May 2, 2022
stoader
stoader previously approved these changes May 3, 2022
@bartam1 bartam1 dismissed stale reviews from stoader and pregnor via 9c13738 May 4, 2022 11:09
@bartam1
Copy link
Contributor Author

bartam1 commented May 4, 2022

I have disabled the cyclo check for the reconcile function temporarily.
Im going to make refactor for the reconcile function and put back cyclo check in another PR

@bartam1 bartam1 merged commit 8f589ca into master May 5, 2022
@bartam1 bartam1 deleted the fixbrokerrestart branch May 5, 2022 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants