New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/sql: export sql.aggregated_livebytes metric for tenants #119140
pkg/sql: export sql.aggregated_livebytes metric for tenants #119140
Conversation
Previously, once a metric has been added to the metrics registry, it will always be registered forever, and there isn't a mechanism to remove it. For multi-tenancy, we plan to implement a job that exports global metrics for tenants (i.e. such metrics should only exist on one SQL node at any point in time). Given that jobs can be cancelled and resumed on a different SQL node, the only option to support such a behavior is to remove metrics from the registry when the job is no longer running, and this commit adds such support to it. Epic: none Release note: None
Previously, the SpanStats proto message only kept track of the logical MVCC stats in the TotalStats field. This is insufficient for the work that exposes the aggregated livebytes as a metric for tenants as the metric value needs to take into account all replicas for a given range. To address that, this commit adds a new ApproximateTotalStats field to the SpanStats proto message, and it represents post-replicated MVCC stats for the span. Epic: none Release note: None
…nants Previously, in order to obtain livebytes metrics for tenants, one would need to query such values via the KV servers, and this can be problematic if we only have access to just the SQL servers. For example, in CockroachDB Cloud, only metrics from the SQL servers are exported to end-users, and is done so directly from the cockroachdb process. It is not trivial to export an additional subset of metrics from the KV servers filtered by tenant ID. To address that, this commit exposes livebytes for tenants directly via an aggregated metric on the SQL nodes. The aggregated metric will be updated every 60 seconds by default, and will be exported via the existing MVCC statistics update job. Unlike other job metrics where metrics are registered at initialization time and stays forever, this aggregated metric is tied to the lifespan of the job (i.e. it is only exported if the job is running, and unexported otherwise). This feature is scoped to standalone SQL servers only, which at this point of writing, is only supported in CockroachDB Cloud. If we wanted to backport this into 23.2, it should be straightforward as well since the permanent upgrade to insert the job is already in release-23.2. Fixes: cockroachdb#119139 Epic: none Release note (sql change): Out-of-process SQL servers will start exporting a new sql.aggregated_livebytes metric. This metric gets updated once every 60 seconds by default, and its update interval can be configured via the `tenant_global_metrics_exporter_interval` cluster setting.
903bd23
to
7260245
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @dt, @JeffSwenson, and @michae2)
pkg/jobs/registry.go
line 282 at r3 (raw file):
Previously, dt (David Taylor) wrote…
I'd have a slight preference for putting this directly in JobExecContext or in ExecutorConfig since those is supposed the thing jobs/sql code need to execute, whereas jobs.Registry is supposed to just be the thing that executes them rather than something they depend on. Obviously it already is used as a dep in a couple places so this is already a little circular and I not something I feel strongly about, but this feels like it'd be more at home in ExecutorConfig than hanging off of job registry.
Done. I moved it into ExecutorConfig
via MetricsRecorder
.
pkg/jobs/registry.go
line 303 at r3 (raw file):
Previously, dt (David Taylor) wrote…
nit: this doesn't really feel like it belongs on jobs.Registry to me. I think you can get this bool in your job off the existing
execCtx
argument, e.g.execCtx.ExecCfg().NodeInfo.NodeID.OptionalNodeID()
Done.
job infra changes (the very few that are left now) LTGM. (I'll leave chiming in on the actual business logic of the job itself to your other reviewers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTRs! bors r+ |
Build succeeded: |
blathers backport 23.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.2-119140 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/119371/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
pkg/util/metric: support metrics removal from the metrics registry
Previously, once a metric has been added to the metrics registry, it will
always be registered forever, and there isn't a mechanism to remove it. For
multi-tenancy, we plan to implement a job that exports global metrics for
tenants (i.e. such metrics should only exist on one SQL node at any point in
time). Given that jobs can be cancelled and resumed on a different SQL node,
the only option to support such a behavior is to remove metrics from the
registry when the job is no longer running, and this commit adds such support
to it.
Epic: none
Release note: None
pkg/server: add ApproximateTotalStats field to SpanStats proto message
Previously, the SpanStats proto message only kept track of the logical MVCC
stats in the TotalStats field. This is insufficient for the work that exposes
the aggregated livebytes as a metric for tenants as the metric value needs to
take into account all replicas for a given range. To address that, this commit
adds a new ApproximateTotalStats field to the SpanStats proto message, and it
represents post-replicated MVCC stats for the span.
Epic: none
Release note: None
pkg/sql: export sql.aggregated_livebytes metric for out-of-process tenants
Previously, in order to obtain livebytes metrics for tenants, one would need
to query such values via the KV servers, and this can be problematic if we
only have access to just the SQL servers. For example, in CockroachDB Cloud,
only metrics from the SQL servers are exported to end-users, and is done so
directly from the cockroachdb process. It is not trivial to export an
additional subset of metrics from the KV servers filtered by tenant ID.
To address that, this commit exposes livebytes for tenants directly via an
aggregated metric on the SQL nodes. The aggregated metric will be updated
every 60 seconds by default, and will be exported via the existing MVCC
statistics update job. Unlike other job metrics where metrics are registered
at initialization time and stays forever, this aggregated metric is tied to
the lifespan of the job (i.e. it is only exported if the job is running, and
unexported otherwise).
This feature is scoped to standalone SQL servers only, which at this point of
writing, is only supported in CockroachDB Cloud. If we wanted to backport this
into 23.2, it should be straightforward as well since the permanent upgrade
to insert the job is already in release-23.2.
Fixes: #119139
Epic: none
Release note (sql change): Out-of-process SQL servers will start exporting a
new sql.aggregated_livebytes metric. This metric gets updated once every 60
seconds by default, and its update interval can be configured via the
tenant_global_metrics_exporter_interval
cluster setting.