pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

jaylim-crl · 2024-02-13T18:07:14Z

pkg/util/metric: support metrics removal from the metrics registry

Previously, once a metric has been added to the metrics registry, it will
always be registered forever, and there isn't a mechanism to remove it. For
multi-tenancy, we plan to implement a job that exports global metrics for
tenants (i.e. such metrics should only exist on one SQL node at any point in
time). Given that jobs can be cancelled and resumed on a different SQL node,
the only option to support such a behavior is to remove metrics from the
registry when the job is no longer running, and this commit adds such support
to it.

Epic: none

Release note: None

pkg/server: add ApproximateTotalStats field to SpanStats proto message

Previously, the SpanStats proto message only kept track of the logical MVCC
stats in the TotalStats field. This is insufficient for the work that exposes
the aggregated livebytes as a metric for tenants as the metric value needs to
take into account all replicas for a given range. To address that, this commit
adds a new ApproximateTotalStats field to the SpanStats proto message, and it
represents post-replicated MVCC stats for the span.

Epic: none

Release note: None

pkg/sql: export sql.aggregated_livebytes metric for out-of-process tenants

Previously, in order to obtain livebytes metrics for tenants, one would need
to query such values via the KV servers, and this can be problematic if we
only have access to just the SQL servers. For example, in CockroachDB Cloud,
only metrics from the SQL servers are exported to end-users, and is done so
directly from the cockroachdb process. It is not trivial to export an
additional subset of metrics from the KV servers filtered by tenant ID.

To address that, this commit exposes livebytes for tenants directly via an
aggregated metric on the SQL nodes. The aggregated metric will be updated
every 60 seconds by default, and will be exported via the existing MVCC
statistics update job. Unlike other job metrics where metrics are registered
at initialization time and stays forever, this aggregated metric is tied to
the lifespan of the job (i.e. it is only exported if the job is running, and
unexported otherwise).

This feature is scoped to standalone SQL servers only, which at this point of
writing, is only supported in CockroachDB Cloud. If we wanted to backport this
into 23.2, it should be straightforward as well since the permanent upgrade
to insert the job is already in release-23.2.

Fixes: #119139

Epic: none

Release note (sql change): Out-of-process SQL servers will start exporting a
new sql.aggregated_livebytes metric. This metric gets updated once every 60
seconds by default, and its update interval can be configured via the
tenant_global_metrics_exporter_interval cluster setting.

Previously, once a metric has been added to the metrics registry, it will always be registered forever, and there isn't a mechanism to remove it. For multi-tenancy, we plan to implement a job that exports global metrics for tenants (i.e. such metrics should only exist on one SQL node at any point in time). Given that jobs can be cancelled and resumed on a different SQL node, the only option to support such a behavior is to remove metrics from the registry when the job is no longer running, and this commit adds such support to it. Epic: none Release note: None

Previously, the SpanStats proto message only kept track of the logical MVCC stats in the TotalStats field. This is insufficient for the work that exposes the aggregated livebytes as a metric for tenants as the metric value needs to take into account all replicas for a given range. To address that, this commit adds a new ApproximateTotalStats field to the SpanStats proto message, and it represents post-replicated MVCC stats for the span. Epic: none Release note: None

cockroach-teamcity · 2024-02-13T18:07:25Z

This change is

pkg/jobs/registry.go

…nants Previously, in order to obtain livebytes metrics for tenants, one would need to query such values via the KV servers, and this can be problematic if we only have access to just the SQL servers. For example, in CockroachDB Cloud, only metrics from the SQL servers are exported to end-users, and is done so directly from the cockroachdb process. It is not trivial to export an additional subset of metrics from the KV servers filtered by tenant ID. To address that, this commit exposes livebytes for tenants directly via an aggregated metric on the SQL nodes. The aggregated metric will be updated every 60 seconds by default, and will be exported via the existing MVCC statistics update job. Unlike other job metrics where metrics are registered at initialization time and stays forever, this aggregated metric is tied to the lifespan of the job (i.e. it is only exported if the job is running, and unexported otherwise). This feature is scoped to standalone SQL servers only, which at this point of writing, is only supported in CockroachDB Cloud. If we wanted to backport this into 23.2, it should be straightforward as well since the permanent upgrade to insert the job is already in release-23.2. Fixes: cockroachdb#119139 Epic: none Release note (sql change): Out-of-process SQL servers will start exporting a new sql.aggregated_livebytes metric. This metric gets updated once every 60 seconds by default, and its update interval can be configured via the `tenant_global_metrics_exporter_interval` cluster setting.

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @dt, @JeffSwenson, and @michae2)

pkg/jobs/registry.go line 282 at r3 (raw file):

Previously, dt (David Taylor) wrote…

I'd have a slight preference for putting this directly in JobExecContext or in ExecutorConfig since those is supposed the thing jobs/sql code need to execute, whereas jobs.Registry is supposed to just be the thing that executes them rather than something they depend on. Obviously it already is used as a dep in a couple places so this is already a little circular and I not something I feel strongly about, but this feels like it'd be more at home in ExecutorConfig than hanging off of job registry.

Done. I moved it into ExecutorConfig via MetricsRecorder.

pkg/jobs/registry.go line 303 at r3 (raw file):

Previously, dt (David Taylor) wrote…

nit: this doesn't really feel like it belongs on jobs.Registry to me. I think you can get this bool in your job off the existing execCtx argument, e.g. execCtx.ExecCfg().NodeInfo.NodeID.OptionalNodeID()

Done.

dt · 2024-02-13T20:31:50Z

job infra changes (the very few that are left now) LTGM.

(I'll leave chiming in on the actual business logic of the job itself to your other reviewers)

JeffSwenson

LGTM

jaylim-crl · 2024-02-17T17:47:13Z

TFTRs!

bors r+

craig · 2024-02-17T18:29:19Z

Build succeeded:

Bazel Essential CI (Cockroach)

jaylim-crl · 2024-02-19T22:31:17Z

blathers backport 23.2

blathers-crl · 2024-02-19T22:31:26Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error setting reviewers, but backport branch blathers/backport-release-23.2-119140 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/119371/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

jaylim-crl added 2 commits February 13, 2024 13:06

jaylim-crl mentioned this pull request Feb 13, 2024

pkg/sql: export sql.aggregated_livebytes metric for tenants #118796

Closed

jaylim-crl marked this pull request as ready for review February 13, 2024 18:09

jaylim-crl requested review from a team as code owners February 13, 2024 18:09

jaylim-crl requested review from abarganier, msbutler, JeffSwenson and andy-kimball and removed request for a team, abarganier and msbutler February 13, 2024 18:09

dt reviewed Feb 13, 2024

View reviewed changes

pkg/jobs/registry.go Outdated Show resolved Hide resolved

dt reviewed Feb 13, 2024

View reviewed changes

pkg/jobs/registry.go Outdated Show resolved Hide resolved

jaylim-crl force-pushed the jay/240212-aggregated-livebytes-external-tenants branch from 903bd23 to 7260245 Compare February 13, 2024 20:27

jaylim-crl requested review from a team as code owners February 13, 2024 20:27

jaylim-crl requested review from michae2 and dt and removed request for a team February 13, 2024 20:27

jaylim-crl commented Feb 13, 2024

View reviewed changes

jaylim-crl removed the request for review from michae2 February 13, 2024 20:29

JeffSwenson approved these changes Feb 16, 2024

View reviewed changes

craig bot merged commit 356b76f into cockroachdb:master Feb 17, 2024
12 checks passed

blathers-crl bot mentioned this pull request Feb 19, 2024

release-23.2: pkg/sql: export sql.aggregated_livebytes metric for tenants #119371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

jaylim-crl commented Feb 13, 2024

cockroach-teamcity commented Feb 13, 2024

jaylim-crl left a comment

dt commented Feb 13, 2024

JeffSwenson left a comment

jaylim-crl commented Feb 17, 2024

craig bot commented Feb 17, 2024

jaylim-crl commented Feb 19, 2024

blathers-crl bot commented Feb 19, 2024

pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

Conversation

jaylim-crl commented Feb 13, 2024

pkg/util/metric: support metrics removal from the metrics registry

pkg/server: add ApproximateTotalStats field to SpanStats proto message

pkg/sql: export sql.aggregated_livebytes metric for out-of-process tenants

cockroach-teamcity commented Feb 13, 2024

jaylim-crl left a comment

Choose a reason for hiding this comment

dt commented Feb 13, 2024

JeffSwenson left a comment

Choose a reason for hiding this comment

jaylim-crl commented Feb 17, 2024

craig bot commented Feb 17, 2024

jaylim-crl commented Feb 19, 2024

blathers-crl bot commented Feb 19, 2024