New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some broker pods get stuck caused by a terminated container when there is no healthy broker in the kafka cluster. #802
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pkg/resources/kafka/kafka.go
Outdated
if err = r.reconcilePerBrokerDynamicConfig(broker.Id, brokerConfig, configMap, log); err != nil { | ||
return err | ||
perBrokerDynamicConfigCombineError = errors.Combine(perBrokerDynamicConfigCombineError, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add some details for the error like broker.Id, brokerConfig, configMap
if it's not already included.
pkg/resources/kafka/kafka.go
Outdated
} | ||
} | ||
|
||
if perBrokerDynamicConfigCombineError != nil { | ||
return perBrokerDynamicConfigCombineError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the other operations failed in the loop, e.g.
koperator/pkg/resources/kafka/kafka.go
Line 247 in f0fc247
return errors.WrapIf(err, "failed to reconcile resource") |
koperator/pkg/resources/kafka/kafka.go
Line 255 in f0fc247
return errors.WrapIfWithDetails(err, "failed to reconcile resource", "resource", configMap.GetObjectKind().GroupVersionKind()) |
koperator/pkg/resources/kafka/kafka.go
Line 262 in f0fc247
return errors.WrapIfWithDetails(err, "failed to reconcile resource", "resource", configMap.GetObjectKind().GroupVersionKind()) |
And perBrokerDynamicConfigCombineError
is not nil
because there was some problems reconciling the broker dynamic configs for the previous brokers, then this perBrokerDynamicConfigCombineError
is not going to be returned, which doesn't seem to be ideal to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have disabled the cyclo check for the reconcile function temporarily. |
What's in this PR?
If reconcilePerBrokerDynamicConfig gets an error then I let the loop continue to the next broker. A combined error will be returned after the broker loop so the operator will try again later.
Why?
When in a kafka cluster there is not any working kafka broker present, but there are multiple broker pods with terminated container, only one broker pod will be restarted. Other brokers with terminated containers will not until at least one healthy broker will be present in the kafka cluster. The root cause of this that brokers are unreachable and reconcilePerBrokerDynamicConfig function can't set dynamic configs and It will be ended with an error.
Checklist