Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: revamp db console overload page to have useful metrics #121572

Closed
Tracked by #121574
aadityasondhi opened this issue Apr 2, 2024 · 1 comment
Closed
Tracked by #121574
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Apr 2, 2024

Through escalations we have found some metrics on this page to be not useful, while we miss some useful metrics. We should be deliberate with each chart on this page.

Examples to remove:

  • admission delay rate
  • admission slots

Jira issue: CRDB-37347

Epic CRDB-36319

@aadityasondhi aadityasondhi added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control T-admission-control Admission Control labels Apr 2, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 2, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
@aadityasondhi aadityasondhi self-assigned this May 2, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 2, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrined (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrined (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 2, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 2, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 2, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 6, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 7, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 7, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 7, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 8, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 8, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 8, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 8, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 8, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch uses the new sperated `elastic-stores` metrics for queing
delay from cockroachdb#123890.

Informs cockroachdb#121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in cockroachdb#124078.

Informs cockroachdb#121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch uses the new sperated `elastic-stores` metrics for queing
delay from cockroachdb#123890.

Informs cockroachdb#121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in cockroachdb#124078.

Informs cockroachdb#121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch uses the new sperated `elastic-stores` metrics for queing
delay from cockroachdb#123890.

Informs cockroachdb#121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in cockroachdb#124078.

Informs cockroachdb#121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
This patch uses the new sperated `elastic-stores` metrics for queing
delay from cockroachdb#123890.

Informs cockroachdb#121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 21, 2024
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in cockroachdb#124078.

Informs cockroachdb#121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
craig bot pushed a commit that referenced this issue May 21, 2024
123522: dbconsole: overload page improvements r=sumeerbhola a=aadityasondhi

This PR contains a series of improvements to the overload page of the DB console as part of #121574. It is separated into multiple commits for ease of review.

____

dbconsole: remove non useful charts on the overload page

In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.

___

dbconsole: reorder overload page metrics for better readability

This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

___

dbconsole: include better names and descriptions for overload page
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes #120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.

___

dbconsole: additional higher granularity metrics for overload

This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

___

dbconsole: split Admission Queue graphs to avoid overcrowding

Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas

___

dbconsole: add elastic store metric to the overload page 

This patch uses the new sperated `elastic-stores` metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.

___

dbconsole: add elastic io token exhausted duration to overload page 

This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in #124078.

Informs #121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.

124493: packer: only try emulating via Docker on x86 r=rail a=rickystewart

Epic: none

Release note: None

Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
This patch uses the new sperated `elastic-stores` metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
blathers-crl bot pushed a commit that referenced this issue May 21, 2024
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in #124078.

Informs #121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
@aadityasondhi
Copy link
Collaborator Author

Merged #123522.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

1 participant