Skip to content

KAFKA-12309 The revocation algorithm produces uneven distributions#10077

Closed
chia7712 wants to merge 11 commits intoapache:trunkfrom
chia7712:KAFKA-12309
Closed

KAFKA-12309 The revocation algorithm produces uneven distributions#10077
chia7712 wants to merge 11 commits intoapache:trunkfrom
chia7712:KAFKA-12309

Conversation

@chia7712
Copy link
Copy Markdown
Member

@chia7712 chia7712 commented Feb 7, 2021

issue: https://issues.apache.org/jira/browse/KAFKA-12309

issue description

When adding a new worker, the revocation algorithm tries to revoke some connectors/tasks from existent workers so as to move them to new worker to balance load. However, the algorithm can revoke incorrect number of connectors/tasks for following cases.

Assignments:

"worker_0" -> 8 connectors/tasks
"worker_1" -> 8 connectors/tasks
(New) "worker_2" -> 0 connectors/tasks

Result

"worker_0" -> 6 connectors/tasks
"worker_1" -> 6 connectors/tasks
"worker_2" -> 4 connectors/tasks

(expected) Result

"worker_0" -> 6 connectors/tasks
"worker_1" -> 5 connectors/tasks
"worker_2" -> 5 connectors/tasks
or
"worker_0" -> 5 connectors/tasks
"worker_1" -> 6 connectors/tasks
"worker_2" -> 5 connectors/tasks

root cause

the algorithm makes all existent workers keep ceil (numToRevoke = existing.tasksSize() - ceilTasks;) connectors/tasks. That results in that there is no enough revoked connectors/tasks to be assigned to new worker.

extra changes included by this PR

  1. move revocation algorithm to a helper method in order to test it

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@chia7712 chia7712 requested a review from kkonstantine February 7, 2021 16:27
tasks.values().size(),
tasks.values().stream().distinct().collect(Collectors.toList()).size());
assertTrue("Connectors are imbalanced: " + formatAssignment(connectors), maxConnectors - minConnectors < 2);
if (minConnectors > 1) assertEquals("Some workers have no connectors", connectors.size(), connect.workers().size());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure there is no idle worker

@chia7712
Copy link
Copy Markdown
Member Author

@rhauch Could you take a look? this uneven distributions can be reproduced easily so it would be nice to fix it asap.

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 , thanks for the patch! This is a good finding! Left some comments. Thanks.

for (int i = existing.tasksSize(); i > floorTasks && numToRevoke > 0; --i, --numToRevoke) {
int currentSize = forTask ? existing.tasksSize() : existing.connectorsSize();
int expectedSize;
if (existingWorkers.size() == 1 || currentSize == 1) expectedSize = ceil;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this condition? And why under this case, we don't need to have numberOfBiggers--? After all, we assigned one ceiling to this worker, didn't we?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a specify case as the node having single task is already in balance. We can move the job to another node but it does not balance anything but it brings extra down-time.

I will add comments for that case.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @chia7712

I'd like to return for a detailed review after we exclude stylistic changes that are not necessary. Also, it'd be good to enhance the description besides the example and call out the corner cases for which logic has been added.

@chia7712
Copy link
Copy Markdown
Member Author

@kkonstantine thanks for your feedback.

I'd like to return for a detailed review after we exclude stylistic changes that are not necessary. Also, it'd be good to enhance the description besides the example and call out the corner cases for which logic has been added.

Sorry that I did not follow the code style in connect module. I have reverted all incorrect style. Also, I have updated the description.

Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround @chia7712 !
Much easier to review line my line now (at least for me).

I'll do my best to review by the end of the week.

@chia7712 chia7712 closed this Mar 25, 2024
@chia7712 chia7712 deleted the KAFKA-12309 branch March 25, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants