Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/sql: export sql.aggregated_livebytes metric for tenants #119140

Conversation

jaylim-crl
Copy link
Collaborator

pkg/util/metric: support metrics removal from the metrics registry

Previously, once a metric has been added to the metrics registry, it will
always be registered forever, and there isn't a mechanism to remove it. For
multi-tenancy, we plan to implement a job that exports global metrics for
tenants (i.e. such metrics should only exist on one SQL node at any point in
time). Given that jobs can be cancelled and resumed on a different SQL node,
the only option to support such a behavior is to remove metrics from the
registry when the job is no longer running, and this commit adds such support
to it.

Epic: none

Release note: None

pkg/server: add ApproximateTotalStats field to SpanStats proto message

Previously, the SpanStats proto message only kept track of the logical MVCC
stats in the TotalStats field. This is insufficient for the work that exposes
the aggregated livebytes as a metric for tenants as the metric value needs to
take into account all replicas for a given range. To address that, this commit
adds a new ApproximateTotalStats field to the SpanStats proto message, and it
represents post-replicated MVCC stats for the span.

Epic: none

Release note: None

pkg/sql: export sql.aggregated_livebytes metric for out-of-process tenants

Previously, in order to obtain livebytes metrics for tenants, one would need
to query such values via the KV servers, and this can be problematic if we
only have access to just the SQL servers. For example, in CockroachDB Cloud,
only metrics from the SQL servers are exported to end-users, and is done so
directly from the cockroachdb process. It is not trivial to export an
additional subset of metrics from the KV servers filtered by tenant ID.

To address that, this commit exposes livebytes for tenants directly via an
aggregated metric on the SQL nodes. The aggregated metric will be updated
every 60 seconds by default, and will be exported via the existing MVCC
statistics update job. Unlike other job metrics where metrics are registered
at initialization time and stays forever, this aggregated metric is tied to
the lifespan of the job (i.e. it is only exported if the job is running, and
unexported otherwise).

This feature is scoped to standalone SQL servers only, which at this point of
writing, is only supported in CockroachDB Cloud. If we wanted to backport this
into 23.2, it should be straightforward as well since the permanent upgrade
to insert the job is already in release-23.2.

Fixes: #119139

Epic: none

Release note (sql change): Out-of-process SQL servers will start exporting a
new sql.aggregated_livebytes metric. This metric gets updated once every 60
seconds by default, and its update interval can be configured via the
tenant_global_metrics_exporter_interval cluster setting.

Previously, once a metric has been added to the metrics registry, it will
always be registered forever, and there isn't a mechanism to remove it. For
multi-tenancy, we plan to implement a job that exports global metrics for
tenants (i.e. such metrics should only exist on one SQL node at any point in
time). Given that jobs can be cancelled and resumed on a different SQL node,
the only option to support such a behavior is to remove metrics from the
registry when the job is no longer running, and this commit adds such support
to it.

Epic: none

Release note: None
Previously, the SpanStats proto message only kept track of the logical MVCC
stats in the TotalStats field. This is insufficient for the work that exposes
the aggregated livebytes as a metric for tenants as the metric value needs to
take into account all replicas for a given range. To address that, this commit
adds a new ApproximateTotalStats field to the SpanStats proto message, and it
represents post-replicated MVCC stats for the span.

Epic: none

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jaylim-crl jaylim-crl marked this pull request as ready for review February 13, 2024 18:09
@jaylim-crl jaylim-crl requested review from a team as code owners February 13, 2024 18:09
@jaylim-crl jaylim-crl requested review from abarganier, msbutler, JeffSwenson and andy-kimball and removed request for a team, abarganier and msbutler February 13, 2024 18:09
pkg/jobs/registry.go Outdated Show resolved Hide resolved
pkg/jobs/registry.go Outdated Show resolved Hide resolved
…nants

Previously, in order to obtain livebytes metrics for tenants, one would need
to query such values via the KV servers, and this can be problematic if we
only have access to just the SQL servers. For example, in CockroachDB Cloud,
only metrics from the SQL servers are exported to end-users, and is done so
directly from the cockroachdb process. It is not trivial to export an
additional subset of metrics from the KV servers filtered by tenant ID.

To address that, this commit exposes livebytes for tenants directly via an
aggregated metric on the SQL nodes. The aggregated metric will be updated
every 60 seconds by default, and will be exported via the existing MVCC
statistics update job. Unlike other job metrics where metrics are registered
at initialization time and stays forever, this aggregated metric is tied to
the lifespan of the job (i.e. it is only exported if the job is running, and
unexported otherwise).

This feature is scoped to standalone SQL servers only, which at this point of
writing, is only supported in CockroachDB Cloud. If we wanted to backport this
into 23.2, it should be straightforward as well since the permanent upgrade
to insert the job is already in release-23.2.

Fixes: cockroachdb#119139

Epic: none

Release note (sql change): Out-of-process SQL servers will start exporting a
new sql.aggregated_livebytes metric. This metric gets updated once every 60
seconds by default, and its update interval can be configured via the
`tenant_global_metrics_exporter_interval` cluster setting.
@jaylim-crl jaylim-crl force-pushed the jay/240212-aggregated-livebytes-external-tenants branch from 903bd23 to 7260245 Compare February 13, 2024 20:27
@jaylim-crl jaylim-crl requested review from a team as code owners February 13, 2024 20:27
@jaylim-crl jaylim-crl requested review from michae2 and dt and removed request for a team February 13, 2024 20:27
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @dt, @JeffSwenson, and @michae2)


pkg/jobs/registry.go line 282 at r3 (raw file):

Previously, dt (David Taylor) wrote…

I'd have a slight preference for putting this directly in JobExecContext or in ExecutorConfig since those is supposed the thing jobs/sql code need to execute, whereas jobs.Registry is supposed to just be the thing that executes them rather than something they depend on. Obviously it already is used as a dep in a couple places so this is already a little circular and I not something I feel strongly about, but this feels like it'd be more at home in ExecutorConfig than hanging off of job registry.

Done. I moved it into ExecutorConfig via MetricsRecorder.


pkg/jobs/registry.go line 303 at r3 (raw file):

Previously, dt (David Taylor) wrote…

nit: this doesn't really feel like it belongs on jobs.Registry to me. I think you can get this bool in your job off the existing execCtx argument, e.g. execCtx.ExecCfg().NodeInfo.NodeID.OptionalNodeID()

Done.

@jaylim-crl jaylim-crl removed the request for review from michae2 February 13, 2024 20:29
@dt
Copy link
Member

dt commented Feb 13, 2024

job infra changes (the very few that are left now) LTGM.

(I'll leave chiming in on the actual business logic of the job itself to your other reviewers)

Copy link
Collaborator

@JeffSwenson JeffSwenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaylim-crl
Copy link
Collaborator Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 17, 2024

Build succeeded:

@craig craig bot merged commit 356b76f into cockroachdb:master Feb 17, 2024
12 checks passed
@jaylim-crl
Copy link
Collaborator Author

blathers backport 23.2

Copy link

blathers-crl bot commented Feb 19, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.2-119140 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/119371/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: export aggregated livebytes for tenants in SQL
4 participants