Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/reports: change node liveness considerations #43825

Merged
merged 4 commits into from Jan 14, 2020

Conversation

@andreimatei
Copy link
Member

andreimatei commented Jan 8, 2020

Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.

@andreimatei andreimatei requested review from ajwerner, darinpp and andy-kimball Jan 8, 2020
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Jan 8, 2020

This change is Reviewable

Copy link
Member Author

andreimatei left a comment

hold off the review, I want to do something else

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

@andreimatei andreimatei force-pushed the andreimatei:reports.underreplicated-on-dead branch 2 times, most recently from 2974b45 to 6ab7b7d Jan 10, 2020
Copy link
Member Author

andreimatei left a comment

good to go

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

Copy link
Collaborator

ajwerner left a comment

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 4 of 4 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

					// Under-replicated.
					{key: "/Table/t1/pk/102", stores: []int{1, 2}},
					// Under-replicated because 3 is dead.

Under-replicated because 4 is dead.?

andreimatei added 4 commits Jan 8, 2020
Localize liveness state.

Release note: None
Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). The code was buggy in that a
decomissioning node was considered live for 5 minutes after it stopped
heartbeating its liveness record, whereas a non-decomissioning one was
only considered live for a few seconds. The patch fixes it by using the
same logic to make liveness determinations as the underreplicated metric
does - you're dead as soon as the liveness record expires regardless of
decomissioning status.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
@andreimatei andreimatei force-pushed the andreimatei:reports.underreplicated-on-dead branch from 6ab7b7d to bd9a8af Jan 10, 2020
Copy link
Member Author

andreimatei left a comment

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)


pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

Previously, ajwerner wrote…

Under-replicated because 4 is dead.?

done

craig bot pushed a commit that referenced this pull request Jan 14, 2020
43825: storage/reports: change node liveness considerations r=andreimatei a=andreimatei

Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.

43943: Revert "roachprod: Make multiple set [provider]-zones always geo-distribute nodes" r=ajwerner a=jlinder

This reverts commit d24e40e.

Reverting back to the prior roachprod functionality for how --geo and
--zones worked because nightly roachtest running from teamcity is failing
numerous tests and this will allow fixing roachtest without the time pressure.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: James H. Linder <jamesl@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Jan 14, 2020

Build succeeded

@craig craig bot merged commit bd9a8af into cockroachdb:master Jan 14, 2020
3 checks passed
3 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@andreimatei andreimatei deleted the andreimatei:reports.underreplicated-on-dead branch Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.