Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not reset snapshot replication when a single request timedout #16971

Merged
merged 9 commits into from
Mar 20, 2024

Conversation

deepthidevaki
Copy link
Contributor

@deepthidevaki deepthidevaki commented Mar 19, 2024

Description

Previously, when replicating a snapshot chunk timedout, leader restarts snapshot replication from the first chunk. This can result in a never ending retry loop if one of the other chunk hits a timeout frequently. This can happen more frequent when network latency is high and/or snapshot is large. To ensure that snapshot replication can complete in such cases,

  1. Leader do not reset snapshot replication, instead re-send the same chunk if it encounters a timeout
  2. Follower accepts the chunk even if it is a duplicate.

Tests are added where we can inject timeout and observer the behavior. RaftRandomizedTest also covers snapshot replication.

Related issues

closes #11496

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

@github-actions github-actions bot added the component/zeebe Related to the Zeebe component/team label Mar 19, 2024
Retry sending the same chunk, otherwise large snapshots
can end up in a retry loop just because a single request
timed out.
@deepthidevaki deepthidevaki force-pushed the dd-11496-snapshot-replicaiton-timeout branch from 0ffcf25 to 4670203 Compare March 19, 2024 08:10
If a leader observed a time out for an InstallRequest,
it will resend the same chunk. If the follower had
already processed the previous request successfully, it
should simply accept the request instead of rejecting
it with an out of order error. If it was actually out
of order, it will be identified during commit  as the
checksum would not match.
When the request is retried, the reader might be already
at the end. So it will return before seeking to the chunk
that should be re-send.
Reader will be pointing to the next chunk, when we want to retry
the first chunk. So we must reset the reader, so that it can be re-read.
@deepthidevaki deepthidevaki force-pushed the dd-11496-snapshot-replicaiton-timeout branch from 040e6cd to 18e6598 Compare March 19, 2024 11:12
@deepthidevaki deepthidevaki marked this pull request as ready for review March 19, 2024 12:12
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

member.setNextSnapshotChunk(null);
final boolean isTimeout =
error instanceof TimeoutException
|| (error != null && error.getCause() instanceof TimeoutException);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ What were the case where it would be the cause? Do you think it could also be more deeply nested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error is usually CompletionException so we always have to get the cause. I don't think it can be more deeply nested.

@deepthidevaki deepthidevaki added this pull request to the merge queue Mar 20, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2024
@deepthidevaki deepthidevaki added this pull request to the merge queue Mar 20, 2024
Merged via the queue into main with commit 9c34814 Mar 20, 2024
31 checks passed
@deepthidevaki deepthidevaki deleted the dd-11496-snapshot-replicaiton-timeout branch March 20, 2024 14:53
@deepthidevaki
Copy link
Contributor Author

This was not backported to older versions. But there was an incident today for which this would have helped. I don't see any reason not to backport this, although original issue was tagged as an improvement.

I'm backporting this. /cc @npepinpe

@deepthidevaki deepthidevaki added backport stable/8.4 Backport a pull request to 8.4.x backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.2 Backport a pull request to 8.2.x labels Jun 3, 2024
@deepthidevaki
Copy link
Contributor Author

/backport

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.2:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-16971-to-stable/8.2
git worktree add --checkout .worktree/backport-16971-to-stable/8.2 backport-16971-to-stable/8.2
cd .worktree/backport-16971-to-stable/8.2
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.3:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-16971-to-stable/8.3
git worktree add --checkout .worktree/backport-16971-to-stable/8.3 backport-16971-to-stable/8.3
cd .worktree/backport-16971-to-stable/8.3
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease

@backport-action
Copy link
Collaborator

Created backport PR for stable/8.4:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-16971-to-stable/8.4
git worktree add --checkout .worktree/backport-16971-to-stable/8.4 backport-16971-to-stable/8.4
cd .worktree/backport-16971-to-stable/8.4
git reset --hard HEAD^
git cherry-pick -x 281eb95c0f593c2cc4ffecc873f3738b895bd09b d32d65050d5c8a498ec6a60ddfa5b8dc3eb3ce89 be3c7d703e8ac9c84c3bf2ada4efcde1e39ba5da c16f3c8c9f66fdae85ca800fac4b978546d4f566 13e8d923c8892c03a155615eebc573bc96f079e4 fc7828a67a8cbe5b225abc83eb1d234c8502947e 18e6598d84e3944d10538ec6fe2813900ba7fe71 05ed2dce937338e936387fa6345993b78b14498b dd76387d68891fe97266a671cb4a9bc247610b5e
git push --force-with-lease

github-merge-queue bot pushed a commit that referenced this pull request Jun 3, 2024
… request timedout (#19030)

# Description
Backport of #16971 to `stable/8.4`.

relates to #11496
original author: @deepthidevaki
github-merge-queue bot pushed a commit that referenced this pull request Jun 4, 2024
…lication when a single request timedout (#19038)

# Description
Backport of #19030 to `stable/8.3`.

relates to #16971 #11496
original author: @backport-action
@github-actions github-actions bot added version:8.3.12 Label that represents issues released on version 8.3.12 version:8.4.8 Label that represents issues released on version 8.4.8 labels Jun 4, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jun 12, 2024
…lication when a single request timedout (#19037)

# Description
Backport of #19030 to `stable/8.2`.

relates to #16971 #11496
original author: @backport-action
@github-actions github-actions bot added the version:8.2.29 Marks an issue as being completely or in parts released in 8.2.29 label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable/8.2 Backport a pull request to 8.2.x backport stable/8.3 Backport a pull request to 8.3.x backport stable/8.4 Backport a pull request to 8.4.x component/zeebe Related to the Zeebe component/team version:8.2.29 Marks an issue as being completely or in parts released in 8.2.29 version:8.3.12 Label that represents issues released on version 8.3.12 version:8.4.8 Label that represents issues released on version 8.4.8
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve raft snapshot replication failure handling
3 participants