New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport-2.0: kv: try next replica on RangeNotFoundError #31350

Merged
merged 1 commit into from Oct 15, 2018

Conversation

Projects
None yet
3 participants
@tschottdorf
Member

tschottdorf commented Oct 15, 2018

Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

  1. new replica receives a preemptive snapshot and the ConfChange
  2. cluster restarts
  3. now the new replica is in this state until the range wakes
    up, which may not happen for some time.
  4. the first request to the range runs into the above problem

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.

kv: try next replica on RangeNotFoundError
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.

@tschottdorf tschottdorf requested a review from cockroachdb/core-prs as a code owner Oct 15, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 15, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 15, 2018

This change is Reviewable

@tschottdorf tschottdorf changed the title from kv: try next replica on RangeNotFoundError to backport-2.0: kv: try next replica on RangeNotFoundError Oct 15, 2018

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Oct 15, 2018

Member

PS there was lots of manual labor involved in this cherry-pick. Please review thoroughly.

Member

tschottdorf commented Oct 15, 2018

PS there was lots of manual labor involved in this cherry-pick. Please review thoroughly.

@tschottdorf tschottdorf requested a review from bdarnell Oct 15, 2018

@bdarnell

Reviewed 11 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tschottdorf tschottdorf merged commit 366deb2 into cockroachdb:release-2.0 Oct 15, 2018

2 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
license/cla Contributor License Agreement is signed.
Details

@tschottdorf tschottdorf deleted the tschottdorf:backport2.1-31013 branch Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment