Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconfig support for CH Keeper #49450

Merged
merged 15 commits into from
Jul 21, 2023

Conversation

myrrc
Copy link
Contributor

@myrrc myrrc commented May 3, 2023

Resolves: #45886
Depends on: eBay/NuRaft#432, ClickHouse/NuRaft#63
Related: #50251

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support ZooKeeper reconfig command for CH Keeper with incremental reconfiguration which can be enabled via keeper_server.enable_reconfiguration setting.
Support adding servers, removing servers, and changing server priorities.

Documentation entry for user-facing changes

Docs PR: ClickHouse/clickhouse-docs#1164

  • Documentation is written (mandatory for new features)

@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch from a5991e3 to c873025 Compare May 3, 2023 12:49
@myrrc myrrc changed the title "reconfig" for Keeper, ZooKeeper and TestKeeper "reconfig" support for CH Keeper May 5, 2023
@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch 2 times, most recently from 46b1c5c to b0c4626 Compare May 15, 2023 12:20
@robot-ch-test-poll robot-ch-test-poll added the pr-feature Pull request with new product feature label May 15, 2023
@robot-ch-test-poll
Copy link
Contributor

robot-ch-test-poll commented May 15, 2023

This is an automated comment for commit 59ad2d9 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🟡 pending

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟡 pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process🟢 success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help🟢 success
Docs CheckBuilds and tests the documentation🟡 pending
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here🟢 success
Mergeable CheckChecks if all other necessary checks are successful🟢 success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests🟢 success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub🟢 success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🟢 success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report🟢 success

@clickhouse-ci
Copy link

clickhouse-ci bot commented May 15, 2023

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Category 'Backward incompatible' is not valid

@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch 2 times, most recently from 491ee5b to 2574487 Compare May 16, 2023 12:26
@robot-ch-test-poll4 robot-ch-test-poll4 added pr-backward-incompatible Pull request with backwards incompatible changes and removed pr-feature Pull request with new product feature labels May 16, 2023
@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch 2 times, most recently from 122cbbf to 486cb36 Compare May 22, 2023 16:46
Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonio2368 , please also take a look when it will be finished.

base/base/move_extend.h Outdated Show resolved Hide resolved
src/Common/ZooKeeper/TestKeeper.cpp Outdated Show resolved Hide resolved
src/Common/ZooKeeper/TestKeeper.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperDispatcher.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperDispatcher.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperServer.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperServer.cpp Show resolved Hide resolved
src/Coordination/KeeperDispatcher.h Outdated Show resolved Hide resolved
base/base/find_symbols.h Outdated Show resolved Hide resolved
src/Common/ConcurrentBoundedQueue.h Outdated Show resolved Hide resolved
@antonio2368
Copy link
Member

We have a conflict with #50091 which will add an HTTP endpoint for cluster management.

I'm okay with reconfig compatibility from ZK clients but I would make the code much simpler by pushing the reconfig requests directly to update configuration queue and not let it go through RAFT - same what we do in #50091

The problem is that NuRaft already handles internally the cluster configuration and it would be hell to keep those things in sync with the way you implemented it.

I would put this PR on hold until #50091 is merged and then we can keep only part for adding ReconfigRequest as an alternative (more limited) version for HTTP endpoint.

@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch 2 times, most recently from 4ef9f28 to 7405a57 Compare May 24, 2023 17:36
@robot-ch-test-poll robot-ch-test-poll added pr-feature Pull request with new product feature and removed pr-backward-incompatible Pull request with backwards incompatible changes labels May 24, 2023
@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch 4 times, most recently from 284e17c to 1f65efe Compare May 30, 2023 20:03
@myrrc myrrc requested a review from alesapin May 31, 2023 16:20
@myrrc myrrc marked this pull request as ready for review May 31, 2023 16:20
@myrrc
Copy link
Contributor Author

myrrc commented May 31, 2023

Things that need to be addressed:

  • Backoff interval while trying to apply configuration update in dispatcher thread. It's currently set to 50ms which may be too slow (I've encountered situations where replica fails to push configuration update in seconds which would be a lot of unnecessary network traffic; this situation is usually encountered while removing a leader from large clusters).
  • Setting /keeper/config node on server start when there are no reconfigurations. In Zookeeper this node is populated on first leader elections. However, at the point of leader elections we don't get configuration like in Zookeeper, so I update /keeper/config from state manager's config as a workaround:
    • Nuraft allows us to transfer a payload with every message though sending configuration with every message is obviously wrong
    • Updating from state manager is correct if any reconfiguration has been issued, because state manager would have this information (transferred from leader) at the point we query it on initialization
    • If there was no reconfiguration, /keeper/config may possibly get out of sync if e.g. a new node has incorrect configuration. However, incorrect configuration should be handled on other levels so I believe it's not of an issue as server is broken anyway.

Currently test_reconfig_remove_add is flaky. In this test we remove a leader from a 3-node cluster and simultaneously add another replica. Maybe I'd have to prohibit having joining and leaving filled if I don't find a way to fix it

@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch from f69580f to 9b232ea Compare May 31, 2023 16:37
@myrrc myrrc requested a review from antonio2368 July 5, 2023 18:24
@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch from 811a48f to 4bb3e74 Compare July 5, 2023 22:47
src/Coordination/KeeperDispatcher.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperDispatcher.cpp Show resolved Hide resolved
src/Coordination/KeeperStateMachine.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperStateMachine.cpp Outdated Show resolved Hide resolved
src/Coordination/RaftServerConfig.cpp Outdated Show resolved Hide resolved
@myrrc myrrc requested a review from antonio2368 July 6, 2023 17:16
@myrrc myrrc force-pushed the feature/keeper-dyn-reconf branch from fd59c29 to 5302b47 Compare July 6, 2023 21:20
src/Coordination/KeeperDispatcher.h Outdated Show resolved Hide resolved
src/Coordination/KeeperReconfiguration.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperReconfiguration.cpp Outdated Show resolved Hide resolved
src/Coordination/KeeperServer.cpp Outdated Show resolved Hide resolved
src/Coordination/RaftServerConfig.cpp Outdated Show resolved Hide resolved
src/Coordination/RaftServerConfig.cpp Outdated Show resolved Hide resolved
@antonio2368 antonio2368 changed the title "reconfig" support for CH Keeper reconfig support for CH Keeper Jul 21, 2023
@antonio2368 antonio2368 merged commit 6ed97a9 into ClickHouse:master Jul 21, 2023
44 of 157 checks passed
@myrrc myrrc deleted the feature/keeper-dyn-reconf branch July 21, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keeper Dynamic Reconfiguration
6 participants