New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui: improve sql stats fingerprints pages response time #97875
Comments
I'm going to include/address the following issues as a part of this: |
|
@j82w Great questions --
In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.
I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes'). WDYT? |
I was hoping this feature would also reduce the initial page load times for everyone and not just the customers with large datasets. If we want to focus on just getting the one scenario to work that is fine.
I'm trying to think of a way to convey that it's not all data from the last 10 minutes, because different nodes will have different times they flush the data. If it just says 'Last 10 Minutes' we will likely get customer issues asking why they don't see a query that ran 2minutes ago because the flush occurred 1 minute ago. @kevin-v-ngo and/or @dongniwang do you have any suggestions? |
Data point for
(select * from system.statement_statistics
order by statistics->'execution_statistics'->>'cnt' limit 40)
union
(select * from system.statement_statistics
order by statistics->'execution_statistics'->'contentionTime'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'cpuSQLNanos'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'rowsRead'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'runLat'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'svcLat'->>'mean' limit 40) Data point for
Taking the above numbers into consideration would a better solution be to:
This solution would allow users to select time frames and avoids the overhead of sending all the data back to the ui to do the computation. The trade off is that each time there is a sort/filter it will take longer. WDYT? |
Adding onto the above -- posting from slack for tracking. I spoke with @j82w on Friday who initially had some concerns about the in-memory fallback, and after doing some testing of my own with lots of in-memory data (~40k unique fingerprints), I also now feel like providing the in-memory option may not be the reliable solution we want, rather I feel like we should actually do the opposite -- default to not reading from in-memory stats and querying from the system table only (with the new limit and order by features to get interesting statements). When we have a lot of in-memory data (which I assume customers running into the loading issue often do, since the default is past 1 hour) it seems to cause all sorts of problems --
Given the virtual table generation and merging weith in-memory stats seems to the be bottleneck in seemingly problematic workloads (ones with high unique fingerprints), how does product feel about not including in-mem stats by default? Our flush interval is already set to 10 minutes, which is pretty frequent. One would then need to wait 10 minutes to see any data appear (after resetting). If it's important to surface that data right after resetting / starting the cluster, maybe an idea there is if we don't return anything from the system table, to then query from in-memory in the interim (we'd still need to make improvments on our in-memory response time to service that reasonably). |
The in-memory results have the following issues.
Here is a rough diagram explaining the process. Let me know if you have any suggestions or notice any errors. sequenceDiagram
title crdb_internal fanout logic
participant user
participant sql
participant crdb_internal
participant SQLStatusServer
participant Local
user->>sql: Request crdb_internal.cluster_statement_statistics
sql->>crdb_internal: Generate the view
crdb_internal->>SQLStatusServer: Requests all stats. No filters or limits are passed in.
SQLStatusServer->>Local: No filters
activate Local
Local-->>SQLStatusServer: Return up to 100k
deactivate Local
loop All nodes in cluster. Up to 100 parallel
SQLStatusServer->>Node2: No filters
activate Node2
SQLStatusServer->>Node3: No filters
activate Node3
SQLStatusServer->>Node4: No filters
activate Node4
Node2-->>SQLStatusServer: Return up to 100k
deactivate Node2
Node4-->>SQLStatusServer: Return up to 100k
deactivate Node4
Node3-->>SQLStatusServer: Return up to 100k
deactivate Node3
end
SQLStatusServer-->>crdb_internal: Return up to 100k
crdb_internal->>crdb_internal: Generate the view with 100k rows
crdb_internal->>sql: Return 100k row view
sql->>sql: Filter/sort rows
sql->>user: Results
|
Previously, we were automatically fetching new data on the finegerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.
Previously, we were automatically fetching new data on the finegerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts aciton has been reduced at the time of invalidation. In the future, we should now remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.
Previously, we were automatically fetching new data on the finegerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts aciton has been reduced at the time of invalidation. In the future, we should now remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.
This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
This commit adds a new field, fetch_mode, which is a message containing an enum specifying what kind of stats to return from the combined stmts endpoint. The options are - StmtStatsOnly - TxnStatsOnly Leaving this field null will fetch both stmts and txns. If `TxnStatsOnly` is specified, we will also include the stmt stats for the stms in each txn in the returned txn payload. This is because in the client app we need the stmt stats to properly build the txn fingerprint pages. In the future, we will split this API instead of using this fetch mode flag, but this is currently preferred to make it easier to backport for perf improvements. Epic: none Part of: cockroachdb#97252 Part of: cockroachdb#97875 Release note: None
Previously, both the stmt and txns fingerprints pages were using the same api response from the /statements. This was because we need the stmts response to build txns pages. Howevever having to fetch both types of stats can slow down the request. Now that we can specify the fetch_mode as part of the request, we can split the 2 into their own api calls and redux fields. Epic: none Part of: cockroachdb#97875 Release note: None
This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
Previously, we were automatically fetching new data on the fingerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts action has been reduced at the time of invalidation. In the future, we should remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. This commit also removes issuing a stats request from the reset sql 14 stats saga, since invalidating the statements 15 will cause a refresh in the page already. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.
Previously, we were automatically fetching new data on the fingerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts action has been reduced at the time of invalidation. In the future, we should remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. This commit also removes issuing a stats request from the reset sql 14 stats saga, since invalidating the statements 15 will cause a refresh in the page already. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.
This commit adds a new field, fetch_mode, which is a message containing an enum specifying what kind of stats to return from the combined stmts endpoint. The options are - StmtStatsOnly - TxnStatsOnly Leaving this field null will fetch both stmts and txns. If `TxnStatsOnly` is specified, we will also include the stmt stats for the stms in each txn in the returned txn payload. This is because in the client app we need the stmt stats to properly build the txn fingerprint pages. In the future, we will split this API instead of using this fetch mode flag, but this is currently preferred to make it easier to backport for perf improvements. Epic: none Part of: cockroachdb#97252 Part of: cockroachdb#97875 Release note: None
Previously, both the stmt and txns fingerprints pages were using the same api response from the /statements. This was because we need the stmts response to build txns pages. Howevever having to fetch both types of stats can slow down the request. Now that we can specify the fetch_mode as part of the request, we can split the 2 into their own api calls and redux fields. Epic: none Part of: cockroachdb#97875 Release note: None
This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - cpu time - p99 (statements only) - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None
On our stmt and txn fingerprints pages, we have seen increasingly long loading times when requesting persisted sql stats for clusters with large amounts of data. This can lead to the page being unusable, as the minimum time range we allow to be requested is the past 1 hour.
We need to provide a fallback so that users can see at least some of their live workload.
Proposed experience:
cc @kevin-v-ngo , @dongniwang
EDIT - NEW APPROACH
We will be reading only from the system table (see discussion below), as the in-memory stats are actually the source of a lot of our response time issues. Limit and sort options will be added as a part of this plan.
Jira issue: CRDB-25071
The text was updated successfully, but these errors were encountered: