Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: add replicas.uninitialized metric #73975

Merged
merged 1 commit into from
Dec 22, 2021

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Dec 17, 2021

This change adds replicas.unitialized metric, tracking the number of non-init replicas on a store.

Note that this metric doesn't give the exact number of uninitialized replicas because, after a process restart, uninitialized replicas can remain on disk without an in memory representation.

resolves #73368

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli added do-not-merge bors won't merge a PR with this label. A-kv Anything in KV that doesn't belong in a more specific category. A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. T-kv KV Team labels Dec 17, 2021
@kvoli kvoli self-assigned this Dec 17, 2021
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet turnaround time! And congrats on your first PR! In addition to the comments below, let's squash these changes into a single commit. We have this wiki-page on good commit hygiene that you might find interesting. For here I'd just suggest rewording the title to read something similar to what you have for the PR: "kvserver: add metric for uninitialized replicas".

@@ -3011,6 +3012,9 @@ func (s *Store) updateReplicationGauges(ctx context.Context) error {
if metrics.Quiescent {
quiescentCount++
}

uninitializedCount = int64(len(s.mu.uninitReplicas))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, github reviews don't show where this map is being updated, but I could see it locally. In any case, some notes from when we looked at this offline:

  • The per-replica visitor above (newStoreReplicaVisitor) also iterates over uninitialized replicas. Instead of tracking+maintaining the set of uninitialized replicas in a separate map (probably more error prone), using the existing replica iteration + checking if the current replica is initialized is I think sufficient.
  • We'd want a test. We looked at another test to steal from where we could block the reception of a snapshot on a second node, ensure that our uninitialized count is as we'd expect, unblock the replica initialization, and ensure it's back to 0.

pkg/kv/kvserver/metrics.go Show resolved Hide resolved
@irfansharif irfansharif removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. A-kv Anything in KV that doesn't belong in a more specific category. A-kv-observability T-kv KV Team labels Dec 20, 2021
@irfansharif
Copy link
Contributor

Note on labels: for most of them (mod do-not-merge, X-noremind, etc) we really only use them for filed issues.

@kvoli kvoli force-pushed the replicas-uninit-metrics branch 5 times, most recently from c5d9d32 to 78acdf2 Compare December 21, 2021 22:46
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! I only have a few minor suggestions to trim it down and we should be all set. The CI failures also suggest an update to TestChartCatalogMetrics.

pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_replica_test.go Outdated Show resolved Hide resolved
@irfansharif irfansharif marked this pull request as ready for review December 21, 2021 23:16
@irfansharif irfansharif requested a review from a team as a code owner December 21, 2021 23:16
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/store.go, line 3016 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Weird, github reviews don't show where this map is being updated, but I could see it locally. In any case, some notes from when we looked at this offline:

  • The per-replica visitor above (newStoreReplicaVisitor) also iterates over uninitialized replicas. Instead of tracking+maintaining the set of uninitialized replicas in a separate map (probably more error prone), using the existing replica iteration + checking if the current replica is initialized is I think sufficient.
  • We'd want a test. We looked at another test to steal from where we could block the reception of a snapshot on a second node, ensure that our uninitialized count is as we'd expect, unblock the replica initialization, and ensure it's back to 0.

Done. We resolved to use the original method and reuse the test.


pkg/kv/kvserver/client_replica_test.go, line 3912 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I think we can get rid of all usages of stickyEngineRegistry, raftConfig, and stickySpecTestServerArgs. We're not restarting an existing test server and don't need server state to be persisted across the restarts, which is what all this machinery sets out to do. Trying it out locally, all we really need is the following:

	tc := testcluster.StartTestCluster(t, 1, base.TestClusterArgs{
		ReplicationMode: base.ReplicationManual,
		ServerArgs:      base.TestServerArgs{Insecure: true},
	})

And below:

	tc.AddAndStartServer(t, base.TestServerArgs{
		Insecure: true,

Done.


pkg/kv/kvserver/client_replica_test.go, line 3945 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

While we technically can split out a separate range using a tenant prefix, we can do something more direct:

	scratchKey := tc.ScratchRange(t)
	_, repl := getFirstStoreReplica(t, tc.Server(0), scratchKey)

Done.


pkg/kv/kvserver/client_replica_test.go, line 3949 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Since there's only one sub-test, let's get rid of this t.Run outdent and make it part of the main test itself. We typically use sub-tests when testing minor alterations of a single test run (perhaps with shared setup), which isn't quite the case here.

Done.


pkg/kv/kvserver/client_replica_test.go, line 3960 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Given these are unused, s/request_type/_ and s/strings/_.

Done.


pkg/kv/kvserver/client_replica_test.go, line 3997 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/iota/0

Done.


pkg/kv/kvserver/client_replica_test.go, line 4001 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Inline things here and below:

require.NoError(t, uninitStore.ComputeMetrics(ctx, 0))
// ...
require.Equal(t, int64(1), uninitStore.Metrics().UninitializedCount.Value())
// ...
require.NoError(t, uninitStore.ComputeMetrics(ctx, 0))
// ...
require.Equal(t, int64(0), uninitStore.Metrics().UninitializedCount.Value()) // now initialized

Done.


pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.fixtures.ts, line 891 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is this diff still relevant given we're not adding it to the cluster UI? (anything in addition to the advanced debug page i.e.)

You are right, I thought it might be necessary to declare a default value. I tested without this line and it still shows for advanced debugging. Done.

Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfansharif thanks for the comments. I've adopted your suggestions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

@kvoli kvoli linked an issue Dec 22, 2021 that may be closed by this pull request
@kvoli kvoli removed the do-not-merge bors won't merge a PR with this label. label Dec 22, 2021
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, LGTM! For this kind of test I'd typically stress it locally to shake out any flakes, but because it's a new test our CI should already do that for you. For posterity:

$ dev test pkg/kv/kvserver -f=TestUninitializedMetric --stress
INFO: Found 1 test target
16 runs so far, 0 failures, over 5s
59 runs so far, 0 failures, over 10s
95 runs so far, 0 failures, over 15s
125 runs so far, 0 failures, over 20s
...

Had to first rebase on master to pick up #73982. Could also use make stress PKG=./pkg/kv/kvserver TESTS=TestUninitializedMetric.

@@ -3901,6 +3901,89 @@ func TestTenantID(t *testing.T) {

}

// TestUninitializedMetric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TestUninitializedMetric ensures that uninitialized replicas are reflected in the UninitializedCount store metric.

forceMetricTick++

require.NoError(t, <-addReplicaErr)
require.NoError(t, targetStore.ComputeMetrics(ctx, forceMetricTick))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, instead of maintaining the forceMetricCounter, just passing 0 directly in ComputeMetrics also works (looking at other usages that's what we seem to do).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r5, 1 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @kvoli)


-- commits, line 5 at r7:
You'll want to link this to #73368 here and the in PR description. You have "Addresses" in the PR description, but unfortunately, that is not one of the supported keywords. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword.


pkg/kv/kvserver/metrics.go, line 73 at r4 (raw file):

Previously, kvoli (Austen) wrote…

This is from: #73368. @nvanbenschoten mentions

uninitialized replicas can lie dormant in persistent state without any corresponding in-memory representation

Right, there's a small amount of state (typically just the HardState) that can be persisted by an uninitialized replica. We don't look for this on process startup and don't instantiate a Replica object for these unless that replica hears a new message. Undercounting here is mostly fine, as this persisted state is tiny and we're mostly interested in the cost of these in-memory Replica objects.

There's some discussion about this in here and here.


pkg/kv/kvserver/metrics.go, line 1249 at r7 (raw file):

	RaftLeaderNotLeaseHolderCount *metric.Gauge
	LeaseHolderCount              *metric.Gauge
	UninitializedCount            *metric.Gauge

nit: keep the ordering consistent. So move this below QuiescentCount.


pkg/kv/kvserver/store.go, line 3016 at r4 (raw file):

Previously, kvoli (Austen) wrote…

Done. We resolved to use the original method and reuse the test.

An alternative to this approach is to add a new WithUninitialized() method (similar to InOrder) to storeReplicaVisitor which avoids filtering out uninitialized replicas in storeReplicaVisitor.Visit. We could then add a small branch at the top of this visitor function that checks if the current replica is uninitialized ((*Replica).IsInitialized). If it is, increment uninitializedCount and return true.

But what we have here works too.


pkg/kv/kvserver/store.go, line 3019 at r7 (raw file):

			quiescentCount++
		}

nit: now that you've addressed Irfan's comment, there's a stray line here.

@kvoli kvoli force-pushed the replicas-uninit-metrics branch 2 times, most recently from 1acc811 to b51b14c Compare December 22, 2021 17:03
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @kvoli, and @nvanbenschoten)


-- commits, line 5 at r7:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You'll want to link this to #73368 here and the in PR description. You have "Addresses" in the PR description, but unfortunately, that is not one of the supported keywords. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword.

Updated with a supported keyword.


pkg/kv/kvserver/metrics.go, line 73 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Right, there's a small amount of state (typically just the HardState) that can be persisted by an uninitialized replica. We don't look for this on process startup and don't instantiate a Replica object for these unless that replica hears a new message. Undercounting here is mostly fine, as this persisted state is tiny and we're mostly interested in the cost of these in-memory Replica objects.

There's some discussion about this in here and here.

Noted, thanks for the context.


pkg/kv/kvserver/metrics.go, line 1249 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: keep the ordering consistent. So move this below QuiescentCount.

Done.


pkg/kv/kvserver/store.go, line 3016 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

An alternative to this approach is to add a new WithUninitialized() method (similar to InOrder) to storeReplicaVisitor which avoids filtering out uninitialized replicas in storeReplicaVisitor.Visit. We could then add a small branch at the top of this visitor function that checks if the current replica is uninitialized ((*Replica).IsInitialized). If it is, increment uninitializedCount and return true.

But what we have here works too.

So it is being filtered out here. I wasn't certain. I'll keep it as is atm, since I don't know what else it might be necessary for yet.


pkg/kv/kvserver/client_replica_test.go, line 3904 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…
// TestUninitializedMetric ensures that uninitialized replicas are reflected in the UninitializedCount store metric.

Done.


pkg/kv/kvserver/client_replica_test.go, line 3981 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Btw, instead of maintaining the forceMetricCounter, just passing 0 directly in ComputeMetrics also works (looking at other usages that's what we seem to do).

Adopted this change.

@kvoli kvoli force-pushed the replicas-uninit-metrics branch 2 times, most recently from 9f2d01d to db1e944 Compare December 22, 2021 17:26
This change adds a metric `replica.uninitialized` counting the number
of uninitialized replicas within a store.

resolves cockroachdb#73368

Release note (ops change): Added a metric `replica.uninitialized` which
tracks the number of `Uninitialized` replicas in a store.
@kvoli
Copy link
Collaborator Author

kvoli commented Dec 22, 2021

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 22, 2021

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Dec 22, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: add replicas.uninitialized metric
4 participants