Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415) #6363

Merged
merged 66 commits into from May 17, 2019

Conversation

@kkonstantine
Copy link
Contributor

commented Mar 4, 2019

  • Resolve the stop-the-world effect in Connect by allowing connectors and tasks to keep running if possible.
  • Maintain connector and task assignment in the presence of quick restarts or rolling upgrades of Connect workers
  • Upgrade Connect protocol to allow workers to report their active assignments and the leader to revoke assignments.
  • Select the Connect protocol during runtime in a fully backwards compatible way.
  • Introduce ConnectAssignor and make task assignment scheduling pluggable.
  • Implement incremental cooperative rebalancing of connectors and tasks
  • https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect

Tested via:

  • Unit tests for protocol compatibility
  • Unit tests on task assignment scheduling
  • Unit tests on incremental cooperative rebalancing
  • Integration tests on incremental cooperative rebalancing

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch 3 times, most recently from 0dd3063 to ef8d122 Mar 6, 2019

@kkonstantine
Copy link
Contributor Author

left a comment

Marked a set of files that belong to different PRs and will be removed once those PRs are merged.
The PRs are:
#6340
#6342

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch 2 times, most recently from 24cffd0 to abbf617 Mar 7, 2019

@kkonstantine

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2019

@ewencp @hachikuji @rhauch @mumrah the PR is ready for review. The following items are expected to be addressed along with your comments:

  • The code on task assignment is currently written to be readable and evaluate correctness. I'd like to add micro-benchmarks and I expect it'll be optimized.
  • Unit tests for DistributedHerder and WorkerCoordinatorIncrementalTest will be expanded to guard against changes in the new protocol more extensively.
  • More logging will be added appropriately and some integration with metrics will be considered.
  • A few more integration tests will be added.

Some files contain changes that are introduced by other outstanding PRs. If still present here, please skip or review the changes in their respective PRs.

Really looking forward to your comments! Thanks!

@rhauch
Copy link
Contributor

left a comment

Nice job, @kkonstantine. Took my first pass, and overall it looks good. With your review guidance above, I couldn't find any major issues, but I have quite a few comments/questions. Found and logged some nits when I happened to notice them, but I wasn't looking for them. :-D

BTW, #6342 is now merged.

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch from f209531 to da659db Apr 3, 2019

@kkonstantine
Copy link
Contributor Author

left a comment

Thanks for your comments. @rayokota I think I've replied to all your comments.
@rhauch I didn't get to all your comments yet. Next I'll update the config and continue with the rest of the comments.

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch from e5b22b1 to 4f25d17 Apr 16, 2019

@mumrah
Copy link
Contributor

left a comment

Looks great @kkonstantine! My only real concern is the complexity and length of the methods in IncrementalCooperativeAssignor. I kind of wonder if a pattern other than procedural is warranted? I think we should at least consider changing the private methods to package-private and adding some unit tests.

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch from 4f25d17 to 28402f1 Apr 17, 2019

@rayokota

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

@kkonstantine , thanks for responding to my feedback! I just had one remaining comment. Looking great!

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch 2 times, most recently from 9ce15ad to 662a259 Apr 20, 2019

@kkonstantine
Copy link
Contributor Author

left a comment

@rhauch @mumrah I've addressed almost all your comments with changes or replies.
Would you mind returning to these discussions to see if we can resolve them?
There are a couple remaining items regarding javadocs and error handling during assignment, if I'm not mistaken, that I will definitely address before merging. Thanks!

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch from fc4cb39 to 4bebbdd Apr 25, 2019

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch 2 times, most recently from b6d69f2 to 79677ae Apr 26, 2019

@kkonstantine
Copy link
Contributor Author

left a comment

@ewencp thanks a lot for what I assume is your first round of comments!
I fixed/replied to the majority of the comments. Will return to the ones that need more work very soon in a second pass.

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch 2 times, most recently from 145a2fd to 736668b May 16, 2019

@kkonstantine kkonstantine force-pushed the kkonstantine:kafka-5505 branch from 736668b to eb1853d May 17, 2019

@kkonstantine

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

Thanks @rhauch @mumrah @rayokota @ryannedolan and @ewencp for all the insightful and useful comments! I believe I've addressed everything, except a few cleanup/refactoring suggestions that deemed high risk at this point and will be addressed in a follow up PR after this feature is merged.

Soak testing has been also performed and has confirmed correct execution for several days. More extensive testing and performance benchmarking will follow up in the next few days.

I'll be glad if we can get this in. Thanks!

@rhauch

rhauch approved these changes May 17, 2019

Copy link
Contributor

left a comment

Fantastic work, @kkonstantine. I wish this weren't such a big PR, but I've been steadily tracking the progress of the latest commits as you've been running multiple tests. As you say, there are some minor things that could be cleaned up and improved, but given the size of the PR it'd be good to handle those separately in the coming days, since they shouldn't affect behavior or functionality but will be more about maintainability.

I'm approving pending a green build and successful Connect tests. Most of the recent PR builds have been great, but I know you changed just a few test-related things (e.g., Jenkinsfile to run the Connect tests many times) that you've now reverted, and they theoretically shouldn't affect the build.

@rhauch rhauch merged commit ce584a0 into apache:trunk May 17, 2019

1 of 2 checks passed

JDK 8 and Scala 2.11 Build started for merge commit.
Details
JDK 11 and Scala 2.12 SUCCESS 11202 tests run, 69 skipped, 0 failed.
Details

Pengxiaolong added a commit to Pengxiaolong/kafka that referenced this pull request Jun 14, 2019

KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415) (a…
…pache#6363)

Added the incremental cooperative rebalancing in Connect to avoid global rebalances on all connectors and tasks with each new/changed/removed connector. This new protocol is backward compatible and will work with heterogeneous clusters that exist during a rolling upgrade, but once the clusters consist of new workers only some affected connectors and tasks will be rebalanced: connectors and tasks on existing nodes still in the cluster and not added/changed/removed will continue running while the affected connectors and tasks are rebalanced.

This commit attempted to minimize the changes to the existing V0 protocol logic, though that was not entirely possible.

This commit adds extensive unit and integration tests for both the old V0 protocol and the new v1 protocol. Soak testing has been performed multiple times to verify behavior while connectors and added, changed, and removed and while workers are added and removed from the cluster.

Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <me@ewencp.org>, Robert Yokota <rayokota@gmail.com>, David Arthur <mumrah@gmail.com>, Ryanne Dolan <ryannedolan@gmail.com>

jsancio pushed a commit to jsancio/kafka that referenced this pull request Aug 6, 2019

Merge remote-tracking branch 'confluent/master' into sync-upstream-17…
…-may

* confluent/master:
  KAFKA-8265: Initial implementation for ConnectorClientConfigPolicy to enable overrides (KIP-458) (apache#6624)
  KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415) (apache#6363)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.