New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-2.1: storage: Fix GetNeededReplicas to count non-live, non-dead nodes #33047

Merged
merged 2 commits into from Dec 12, 2018

Conversation

Projects
None yet
4 participants
@a-robinson
Copy link
Collaborator

a-robinson commented Dec 12, 2018

Manual backport of #32949 on top of a hand-cherry-picked minimal version of #30559.

@tbg @cockroachdb/release

storage: refactor GetNeededReplicas to use a notion of available nodes
This is the commit from #30559 but without all the changes that aren't
necessary to enable backporting #32949.

Release note: None

@a-robinson a-robinson requested review from cockroachdb/admin-ui-prs as code owners Dec 12, 2018

@CLAassistant

This comment has been minimized.

Copy link

CLAassistant commented Dec 12, 2018

CLA assistant check
All committers have signed the CLA.

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Dec 12, 2018

This change is Reviewable

@a-robinson a-robinson changed the base branch from master to release-2.1 Dec 12, 2018

@tbg tbg removed request for cockroachdb/rfc-prs Dec 12, 2018

@tbg tbg removed request for cockroachdb/sql-async-prs Dec 12, 2018

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Dec 12, 2018

Ok, I reproduced this in a roachtest. Now I just have to confirm that it's fixed with this patch applied. (And polish up the roachtest).

storage: Fix GetNeededReplicas to count non-live, non-dead nodes
Release note (bug fix): Fixes issue where ranges with high replication
factors would be incorrectly down-replicated if enough nodes stop
running that the number of desired replicas is greater than the number
of running nodes. This issue could cause under-replication and
potentially unavailability.

@tbg tbg force-pushed the a-robinson:backport-32949 branch from ac87b29 to 4aa0b52 Dec 12, 2018

@tbg

tbg approved these changes Dec 12, 2018

Copy link
Member

tbg left a comment

Reviewed 10 of 10 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Dec 12, 2018

For anyone else wondering, the bug looks "new" in the diff of #30559 but was actually already in the old code. The reason it doesn't show up in the diff is because it was changed; instead the new method GetAvailableNodes is basically a copy of getStoreList:

// getStoreListFromIDsRLocked is the same function as getStoreList but requires
// that the detailsMU read lock is held.
func (sp *StorePool) getStoreListFromIDsRLocked(
storeIDs roachpb.StoreIDSlice, rangeID roachpb.RangeID, filter storeFilter,
) (StoreList, int, int) {
if sp.deterministic {
sort.Sort(storeIDs)
} else {
shuffle.Shuffle(storeIDs)
}
var aliveStoreCount int
var throttledStoreCount int
var storeDescriptors []roachpb.StoreDescriptor
now := sp.clock.PhysicalTime()
timeUntilStoreDead := TimeUntilStoreDead.Get(&sp.st.SV)
for _, storeID := range storeIDs {
detail := sp.detailsMu.storeDetails[storeID]
switch s := detail.status(now, timeUntilStoreDead, rangeID, sp.nodeLivenessFn); s {
case storeStatusThrottled:
aliveStoreCount++
throttledStoreCount++
if filter != storeFilterThrottled {
storeDescriptors = append(storeDescriptors, *detail.desc)
}
case storeStatusReplicaCorrupted:
aliveStoreCount++
case storeStatusAvailable:
aliveStoreCount++
storeDescriptors = append(storeDescriptors, *detail.desc)
case storeStatusDead, storeStatusUnknown, storeStatusDecommissioning:
// Do nothing; this node cannot be used.
default:
panic(fmt.Sprintf("unknown store status: %d", s))
}
}
return makeStoreList(storeDescriptors), aliveStoreCount, throttledStoreCount
}

That method remains untouched by all of these fixes. I think this is OK because its callers want a certified live store to transfer ranges or leases to, so we don't actually want to return nodes that are non-live. Still, the method description doesn't properly explain the semantics.

@tbg tbg merged commit 67e4d98 into cockroachdb:release-2.1 Dec 12, 2018

2 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
license/cla Contributor License Agreement is signed.
Details
@a-robinson

This comment has been minimized.

Copy link
Collaborator

a-robinson commented Dec 12, 2018

That method remains untouched by all of these fixes. I think this is OK because its callers want a certified live store to transfer ranges or leases to, so we don't actually want to return nodes that are non-live.

Yes, that's correct. Most of the rebalancing code really does only want nodes that are currently live. It's a subtlety that we aren't handling very well, though, as we discussed last night.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment