-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging DDL queries on Replicated database #29328
Merged
tavplubix
merged 1 commit into
ClickHouse:master
from
aiven:kmichel-recover-replica-race
Oct 27, 2021
Merged
Fix hanging DDL queries on Replicated database #29328
tavplubix
merged 1 commit into
ClickHouse:master
from
aiven:kmichel-recover-replica-race
Oct 27, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a race condition when issuing a DDL query on a replica just after a new replica was added. If the DDL query is issued after the new replica adds itself to the list of replicas, but before the new replica has finished its recovery, then the first replica adds the new replica to the list of replicas to wait to confirm the query was replicated. Meanwhile, the new replica is still in recovery and applies queries from the /metadata snapshot. When it's done, it bumps its log_ptr without marking the corresponding log entries (if any) as finished. The first replica then waits until distributed_ddl_task_timeout expires and wrongly assumes the query was not replicated. The issue is fixed by remembering the max_log_ptr at the exact point where the replica adds itself to the list of replicas, then mark as finished all queries that happened between that max_log_ptr and the max_log_ptr of the metadata snapshot used in recovery. The bug was randomly observed during a downstream test. It can be reproduced more easily by inserting a sleep of a few seconds at the end of createReplicaNodesInZooKeeper, enough to have time to issue a DDL query on the first replica.
robot-clickhouse
added
the
pr-bugfix
Pull request with bugfix, not backported by default
label
Sep 24, 2021
tavplubix
approved these changes
Oct 6, 2021
This was referenced Oct 27, 2021
robot-clickhouse
pushed a commit
that referenced
this pull request
Oct 27, 2021
robot-clickhouse
pushed a commit
that referenced
this pull request
Oct 27, 2021
robot-clickhouse
pushed a commit
that referenced
this pull request
Oct 27, 2021
tavplubix
added a commit
that referenced
this pull request
Oct 28, 2021
Backport #29328 to 21.8: Fix hanging DDL queries on Replicated database
tavplubix
added a commit
that referenced
this pull request
Oct 28, 2021
Backport #29328 to 21.10: Fix hanging DDL queries on Replicated database
tavplubix
added a commit
that referenced
this pull request
Oct 28, 2021
Backport #29328 to 21.9: Fix hanging DDL queries on Replicated database
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a race condition when issuing a DDL query on a replica just after a new replica was added.
If the DDL query is issued after the new replica adds itself to the list of replicas, but before the new replica has finished its recovery, then the first replica adds the new replica to the list of replicas to wait to confirm the query was replicated.
Meanwhile, the new replica is still in recovery and applies queries from the
/metadata
snapshot.When it's done, it bumps its
log_ptr
without marking the corresponding log entries (if any) as finished.The first replica then waits until
distributed_ddl_task_timeout
expires and wrongly assumes the query was not replicated.The issue is fixed by remembering the
max_log_ptr
at the exact point where the replica adds itself to the list of replicas, then mark as finished all queries that happened between thatmax_log_ptr
and themax_log_ptr
of the metadata snapshot used in recovery.The bug was randomly observed during a downstream test.
It can be reproduced more easily by inserting a sleep of a few seconds at the end of
createReplicaNodesInZooKeeper
, enough to have time to issue a DDL query on the first replica.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix hanging DDL queries on Replicated database while adding a new replica