Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: sort by region before sorting by latency #112993

Merged
merged 1 commit into from Nov 16, 2023

Conversation

andrewbaptist
Copy link
Collaborator

Follower reads attempt to be efficient by sorting to the nearest follower first. The nearest follower can be either the closest by latency or by locality. There is a complex relationship between dollar cost, latency and throughput for different nodes, and just sorting by latency is not ideal in many cases. This PR changes the behavior to first sort by the locality definition and then the latency. In most cases this won't make a difference to the user, but in some cases this behavior can be better.

Epic: none

Release note (performance improvement): Sorting by locality can improve the performance of follower reads.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist
Copy link
Collaborator Author

I'm not 100% sure we want to do this, but wanted to get your thoughts on it. I have a fix for the follower reads shutdown during drain which is independent of this other than the test uses localities to "pin" the follower reads. I can change it to use a latency signal instead if we decide not to merge this. Another option would be to have a config flag so that we can keep the existing behavior. Thoughts?

@andrewbaptist andrewbaptist marked this pull request as ready for review October 24, 2023 20:08
@andrewbaptist andrewbaptist requested a review from a team as a code owner October 24, 2023 20:08
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me, but I don’t know why it was written the way that it was and whether there are important cases where latency ordering does not match the locality hierarchy. We should dig through the git diff history here and try to determine whether the current ordering is deliberate or whether the rationale was closer to "use latency if available because why not, otherwise fall back to locality."

If we do make this change, one approach to roll it out (in v24.1, right?) would be to add a hidden cluster setting to fall back to the old behavior. That gives us an escape hatch in case we break something in a customer's cluster. If we don't need the cluster setting for a release or two after that, we can remove it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/kvclient/kvcoord/replica_slice_test.go line 238 at r1 (raw file):

		},
		{
			name:     "order by locality than latency",

s/than/then/

@andrewbaptist andrewbaptist requested review from a team as code owners October 25, 2023 14:25
@andrewbaptist andrewbaptist requested review from mgartner and removed request for a team October 25, 2023 14:25
@yuzefovich yuzefovich removed the request for review from mgartner October 25, 2023 16:27
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


-- commits line 8 at r2:

just sorting by latency is not ideal in many cases.

but in some cases this behavior can be better.

Could you describe the cases you had in mind? Naively, I would expect localities matching to result in lower latency. If that isn't the case (for some reason), we'd want to pick the lower latency node.


pkg/kv/kvclient/kvcoord/replica_slice.go line 221 at r2 (raw file):

		if sortByLocalityFirst.Get(&st.SV) {
			// Longer locality matches sort first. The assumption is that they are
			// closer and will be better choices If the locality match is the same,

nit: missing period before "If"

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/kvclient/kvcoord/replica_slice.go line 219 at r2 (raw file):

		// If this setting is false, ignore locality to match pre 24.1 behavior.
		if sortByLocalityFirst.Get(&st.SV) {

What if !sortByLocalityFirst.Get(&st.SV)? Are we no longer sorting by locality after latency (or if latencyFn is nil)? Are you assuming that this setting will only be enabled in prod, where latencyFn is never nil?

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original PR that established the order was here: d72a431 And I can see the argument both ways for this. I added the cluster setting to minimize the risk of this setting, and it is only for 24.1.

There are two related changes in the PR, one which switches from Attributes to Tiers and the other that adds the latencyFn. I can't tell if this has caused us problems, but it was strange from a testing perspective to see how it was done since in practice we will currently never look at the Locality in production systems.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 8 at r2:

Previously, arulajmani (Arul Ajmani) wrote…

just sorting by latency is not ideal in many cases.

but in some cases this behavior can be better.

Could you describe the cases you had in mind? Naively, I would expect localities matching to result in lower latency. If that isn't the case (for some reason), we'd want to pick the lower latency node.

I cleaned up this text a little. A majority of the time the two will sort the same way. There are a few cases where they might be different:

  1. Localities are not set up correctly (the hierarchy doesn't reflect reality).
  2. The network is bad between different AZs within a region.
  3. One of the nodes is temporarily (or permanently) slow within a region.
  4. The variance in network ping time is higher than the processing time for the messages.

It's not clear to me in many of these cases which is better, but from a customer perspective, it might be more clear to know that the follower reads are choosing locality first as they might expect.


pkg/kv/kvclient/kvcoord/replica_slice.go line 219 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What if !sortByLocalityFirst.Get(&st.SV)? Are we no longer sorting by locality after latency (or if latencyFn is nil)? Are you assuming that this setting will only be enabled in prod, where latencyFn is never nil?

Right - I had assumed that we would only use this in prod. I suppose we could sort by locality after latency if this setting is set to handle the nil latencyFn but that seemed unnecessary since it wasn't showing up in any test failures.


pkg/kv/kvclient/kvcoord/replica_slice.go line 221 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: missing period before "If"

Fixed.


pkg/kv/kvclient/kvcoord/replica_slice_test.go line 238 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/than/then/

Done

@andrewbaptist andrewbaptist force-pushed the 2023-10-24-sort-by-region branch 4 times, most recently from a3083fc to a2ba551 Compare November 14, 2023 16:15
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @arulajmani)


pkg/kv/kvclient/kvcoord/replica_slice.go line 241 at r4 (raw file):

		// If the region is different choose the closer one.
		// If the setting is true(default) consider locality before latency.
		if sortByLocalityFirst.Get(&st.SV) {

sortByLocalityFirst.Get is an atomic load. We should pull it into a local variable outside of the sort.Slice loop. In fact, we should probably do the same for FollowerReadsUnhealthy as well.


pkg/kv/kvclient/kvcoord/replica_slice_test.go line 196 at r4 (raw file):

		// ordering between replicas on the same node is not deterministic).
		expOrdered      []roachpb.NodeID
		disableLocality bool

Is disableLocality the right name for this option? How about sortByLocalityFirst or (to avoid shadowing) sortByLocalityFirstSetting?

Follower reads attempt to be efficient by sorting to the nearest
follower first. The nearest follower can be either the closest by
latency or by locality. There is a complex relationship between dollar
cost, latency and throughput for different nodes, and just sorting by
latency may not be ideal in many cases. This PR changes the behavior to
first sort by the locality definition and then the latency.

Specifically where this can matter is when the network latency between
nodes is close to or below the processing time for a PingRequest
including goroutine scheduling, CPU and OS queueing on both sides.

Epic: none

Release note (performance improvement): Sorting by locality can improve
the performance and predictability of follower reads.
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR and the suggestions

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/replica_slice.go line 241 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

sortByLocalityFirst.Get is an atomic load. We should pull it into a local variable outside of the sort.Slice loop. In fact, we should probably do the same for FollowerReadsUnhealthy as well.

Done - I pulled these both out as bool and only read once. This also could help some weird race where the setting changes "mid-loop".


pkg/kv/kvclient/kvcoord/replica_slice_test.go line 196 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is disableLocality the right name for this option? How about sortByLocalityFirst or (to avoid shadowing) sortByLocalityFirstSetting?

It is the opposite of this setting. I renamed it to dontSortByLocalityFirst. Since I would like to remove it eventually its easier to not have it in all the places.

@andrewbaptist
Copy link
Collaborator Author

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Nov 16, 2023

Build succeeded:

@craig craig bot merged commit 23d8585 into cockroachdb:master Nov 16, 2023
8 checks passed
@andrewbaptist andrewbaptist deleted the 2023-10-24-sort-by-region branch December 15, 2023 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants