-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not reset snapshot replication when a single request timedout #16971
Conversation
Retry sending the same chunk, otherwise large snapshots can end up in a retry loop just because a single request timed out.
0ffcf25
to
4670203
Compare
If a leader observed a time out for an InstallRequest, it will resend the same chunk. If the follower had already processed the previous request successfully, it should simply accept the request instead of rejecting it with an out of order error. If it was actually out of order, it will be identified during commit as the checksum would not match.
When the request is retried, the reader might be already at the end. So it will return before seeking to the chunk that should be re-send.
Reader will be pointing to the next chunk, when we want to retry the first chunk. So we must reset the reader, so that it can be re-read.
040e6cd
to
18e6598
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
member.setNextSnapshotChunk(null); | ||
final boolean isTimeout = | ||
error instanceof TimeoutException | ||
|| (error != null && error.getCause() instanceof TimeoutException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ What were the case where it would be the cause? Do you think it could also be more deeply nested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error
is usually CompletionException
so we always have to get the cause. I don't think it can be more deeply nested.
zeebe/atomix/cluster/src/main/java/io/atomix/raft/roles/PassiveRole.java
Show resolved
Hide resolved
zeebe/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderAppender.java
Show resolved
Hide resolved
/backport |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-16971-to-stable/8.2
git worktree add --checkout .worktree/backport-16971-to-stable/8.2 backport-16971-to-stable/8.2
cd .worktree/backport-16971-to-stable/8.2
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-16971-to-stable/8.3
git worktree add --checkout .worktree/backport-16971-to-stable/8.3 backport-16971-to-stable/8.3
cd .worktree/backport-16971-to-stable/8.3
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-16971-to-stable/8.4
git worktree add --checkout .worktree/backport-16971-to-stable/8.4 backport-16971-to-stable/8.4
cd .worktree/backport-16971-to-stable/8.4
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease |
… request timedout (#19030) # Description Backport of #16971 to `stable/8.4`. relates to #11496 original author: @deepthidevaki
…lication when a single request timedout (#19038) # Description Backport of #19030 to `stable/8.3`. relates to #16971 #11496 original author: @backport-action
…lication when a single request timedout (#19037) # Description Backport of #19030 to `stable/8.2`. relates to #16971 #11496 original author: @backport-action
Description
Previously, when replicating a snapshot chunk timedout, leader restarts snapshot replication from the first chunk. This can result in a never ending retry loop if one of the other chunk hits a timeout frequently. This can happen more frequent when network latency is high and/or snapshot is large. To ensure that snapshot replication can complete in such cases,
Tests are added where we can inject timeout and observer the behavior.
RaftRandomizedTest
also covers snapshot replication.Related issues
closes #11496
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.
Please refer to our review guidelines.