Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Kafka PDB to use live Pods instead of the CR Spec #770

Merged
merged 2 commits into from Mar 22, 2022

Conversation

alungu
Copy link
Contributor

@alungu alungu commented Feb 11, 2022

Q A
Bug fix? yes
New feature? no
API breaks? no
Deprecations? no
Related tickets #769
License Apache 2.0

What's in this PR?

The Kafka PDB logic is update to consider the total number of healthy Kafka brokers as being the number of Running Kafka pods (instead of the number of Kafka brokers specified in the Kafka CR)

Why?

The previous implementation lead to a faulty PDB during scaling (both in and out) since adding or removing brokers requires some time.

Additional context

Checklist

  • Implementation tested
  • Error handling code meets the guideline
  • Logging code meets the guideline
  • User guide and development docs updated (if needed)

@alungu alungu requested a review from a team as a code owner February 11, 2022 10:10
@alungu alungu force-pushed the kafka-pdb branch 2 times, most recently from 3a3f558 to 8855602 Compare February 11, 2022 12:40
Copy link
Contributor

@bartam1 bartam1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

pkg/resources/kafka/poddisruptionbudget.go Outdated Show resolved Hide resolved
@stoader
Copy link
Member

stoader commented Mar 2, 2022

@alungu relying on the number of Kafka pods might not be the best approach. For example if you have a 6 broker cluster and all 6 broker pods are up and running thus the PDB:min available will be 5. When a broker pod is deleted for whatever reason outside of koperator that event will trigger a reconcile inkoperator which will update the PDB:min available to 4 as there are only 5 running broker pods found before the missing pod is recreated. After the missing pod comes up in the next reconcile flow PDB:min available will be set back to 5.

If the PDB handling https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L162 is moved as a last step after https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L287 than it not might be an issue.

What I'm thinking of is to determine the number of brokers for PDB handling from Status as that contains the list of running brokers and brokers pending graceful upscale/downscale. In this case the PDB handling could stay where it is now: https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L162

What do you think?

@alungu
Copy link
Contributor Author

alungu commented Mar 18, 2022

@alungu relying on the number of Kafka pods might not be the best approach. For example if you have a 6 broker cluster and all 6 broker pods are up and running thus the PDB:min available will be 5. When a broker pod is deleted for whatever reason outside of koperator that event will trigger a reconcile inkoperator which will update the PDB:min available to 4 as there are only 5 running broker pods found before the missing pod is recreated. After the missing pod comes up in the next reconcile flow PDB:min available will be set back to 5.

If the PDB handling https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L162 is moved as a last step after https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L287 than it not might be an issue.

What I'm thinking of is to determine the number of brokers for PDB handling from Status as that contains the list of running brokers and brokers pending graceful upscale/downscale. In this case the PDB handling could stay where it is now: https://github.com/banzaicloud/koperator/blob/master/pkg/resources/kafka/kafka.go#L162

What do you think?

@stoader Thank you for mentioning this issue!
I performed a few more tests with all the options (Spec vs Status vs Pod List). All of them have blind spots, but I agree that on the issue you highlighted with the "Pod List" option.
As such, I will patch this PR to use the KafkaCluster Status instead of the "List Kafka Pods".

@alungu
Copy link
Contributor Author

alungu commented Mar 18, 2022

@bartam1 @stoader could you review the new solution, please? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants