ui: improve sql stats fingerprints pages response time #97875

xinhaoz · 2023-03-01T21:28:35Z

On our stmt and txn fingerprints pages, we have seen increasingly long loading times when requesting persisted sql stats for clusters with large amounts of data. This can lead to the page being unusable, as the minimum time range we allow to be requested is the past 1 hour.

We need to provide a fallback so that users can see at least some of their live workload.

Proposed experience:

Provide a new time window option that is only in-memory stats - this uses the stats endpoint pre persisted sql stats that is still lying around (we can display the exact time period by surfacing the cluster setting value)
If a request to persisted stats takes over (some threshold), suggest the above view
Continue to fetch the pending persisted sql stats request in the background -> need to display this to user somehow (stack of previous, cached requests(?) - probably want to keep the cap small as sql stats has large payloads).
Communicate to user when pending request is completed

cc @kevin-v-ngo , @dongniwang

EDIT - NEW APPROACH

We will be reading only from the system table (see discussion below), as the in-memory stats are actually the source of a lot of our response time issues. Limit and sort options will be added as a part of this plan.

Jira issue: CRDB-25071

xinhaoz · 2023-03-01T22:02:51Z

I'm going to include/address the following issues as a part of this:

j82w · 2023-03-02T12:17:52Z

Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.
Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

xinhaoz · 2023-03-02T18:29:25Z

@j82w Great questions --

Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

WDYT?

j82w · 2023-03-02T22:07:23Z

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

I was hoping this feature would also reduce the initial page load times for everyone and not just the customers with large datasets. If we want to focus on just getting the one scenario to work that is fine.

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

I'm trying to think of a way to convey that it's not all data from the last 10 minutes, because different nodes will have different times they flush the data. If it just says 'Last 10 Minutes' we will likely get customer issues asking why they don't see a query that ran 2minutes ago because the flush occurred 1 minute ago. @kevin-v-ngo and/or @dongniwang do you have any suggestions?

j82w · 2023-03-03T13:47:44Z

Data point for system.statement_statistics table with 118k rows:

Over 3 minutes to return all the results
About 8 seconds without an index select * from system.statement_statistics order by statistics->'execution_statistics'->>'cnt' limit 40;
This union to get the top query of most popular columns returns in 3 seconds for some reason. The data might be skewed from how the queries were created.

(select * from system.statement_statistics
order by statistics->'execution_statistics'->>'cnt' limit 40)
union
(select * from system.statement_statistics
order by statistics->'execution_statistics'->'contentionTime'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'cpuSQLNanos'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'rowsRead'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'runLat'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'svcLat'->>'mean' limit 40)

Data point for crdb_internal.cluster_statement_statistics table with 100k rows:

About 3 seconds to just generate the view. Probably longer because it's has to fan out to all the nodes.
About 4 seconds to return all the results

Taking the above numbers into consideration would a better solution be to:

Add indexes to the system table and only return the top 40 (2 pages) to always get the result in under 7 seconds. Every time a sort or search occurs it would require a query to the db to get the new results. We could do the union above to get back the top 3 most sorted columns if we think it's an issue like the union query above.
Only show the in memory stats if the time frame is limited to under 30 minutes or the persisted results are less than an hour old.

This solution would allow users to select time frames and avoids the overhead of sending all the data back to the ui to do the computation. The trade off is that each time there is a sort/filter it will take longer.

WDYT?

xinhaoz · 2023-03-06T20:47:53Z

Adding onto the above -- posting from slack for tracking.

I spoke with @j82w on Friday who initially had some concerns about the in-memory fallback, and after doing some testing of my own with lots of in-memory data (~40k unique fingerprints), I also now feel like providing the in-memory option may not be the reliable solution we want, rather I feel like we should actually do the opposite -- default to not reading from in-memory stats and querying from the system table only (with the new limit and order by features to get interesting statements). When we have a lot of in-memory data (which I assume customers running into the loading issue often do, since the default is past 1 hour) it seems to cause all sorts of problems --

One thing I've noticed is that using the Statements rpc directly over querying the virtual table to get in-memory data is usually a lot faster -- (we don't need to create the virtual table and which aggregates across node boundaries and sorts the data). Unfortunately we would likely not be able to use that since we'd want to aggregate + order in order to limit the data returned.
When we have that much in-memory data, the flush is extremely slow. Since we only clear the in-mem container after a successful flush, it often means we don't clear the in-memory stats as often as we would assume when we have so much data because it takes so long to flush all those entries -- in my local 5 node cluster, my 40k worth of fingerprints didn't even finish flushing after an hour. We should probably put this on our roadmap as a problem to fix. Maybe it would help to lower the number of in-memory stats we allow (currently 100k)?
System table query without needing to join with in-mem seems to perform much better, and we'll be making even more improvements there with the added indexes

Given the virtual table generation and merging weith in-memory stats seems to the be bottleneck in seemingly problematic workloads (ones with high unique fingerprints), how does product feel about not including in-mem stats by default? Our flush interval is already set to 10 minutes, which is pretty frequent. One would then need to wait 10 minutes to see any data appear (after resetting). If it's important to surface that data right after resetting / starting the cluster, maybe an idea there is if we don't return anything from the system table, to then query from in-memory in the interim (we'd still need to make improvments on our in-memory response time to service that reasonably).
With these changes I think we could reliably show data as long as there have been some workload data flushed to disk, which is easier to guarantee.

j82w · 2023-03-07T14:52:51Z

The in-memory results have the following issues.

Latency increases as number of nodes increase. It's required to fan out to all the nodes. The persisted can target just the kv nodes with the data.
The filters/limits are not passed over the wire to the remote nodes. The number of results will likely cause memory issues for large number of statements.
If there is a unhealthy node the results will fail.

Here is a rough diagram explaining the process. Let me know if you have any suggestions or notice any errors.

sequenceDiagram 
    title crdb_internal fanout logic
    participant user
    participant sql
    participant crdb_internal
    participant SQLStatusServer
    participant Local
    user->>sql: Request crdb_internal.cluster_statement_statistics
    sql->>crdb_internal: Generate the view
    crdb_internal->>SQLStatusServer: Requests all stats. No filters or limits are passed in.
    SQLStatusServer->>Local: No filters
    activate Local
    Local-->>SQLStatusServer: Return up to 100k
    deactivate Local
    loop All nodes in cluster. Up to 100 parallel
    SQLStatusServer->>Node2: No filters
    activate Node2
    SQLStatusServer->>Node3: No filters
    activate Node3
    SQLStatusServer->>Node4: No filters
    activate Node4
    Node2-->>SQLStatusServer: Return up to 100k
    deactivate Node2
    Node4-->>SQLStatusServer: Return up to 100k
    deactivate Node4
    Node3-->>SQLStatusServer: Return up to 100k
    deactivate Node3
    end
    SQLStatusServer-->>crdb_internal: Return up to 100k
    crdb_internal->>crdb_internal: Generate the view with 100k rows
    crdb_internal->>sql: Return 100k row view
    sql->>sql: Filter/sort rows
    sql->>user: Results

Previously, we were automatically fetching new data on the finegerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.

Previously, we were automatically fetching new data on the finegerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts aciton has been reduced at the time of invalidation. In the future, we should now remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.

This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

This commit adds a new field, fetch_mode, which is a message containing an enum specifying what kind of stats to return from the combined stmts endpoint. The options are - StmtStatsOnly - TxnStatsOnly Leaving this field null will fetch both stmts and txns. If `TxnStatsOnly` is specified, we will also include the stmt stats for the stms in each txn in the returned txn payload. This is because in the client app we need the stmt stats to properly build the txn fingerprint pages. In the future, we will split this API instead of using this fetch mode flag, but this is currently preferred to make it easier to backport for perf improvements. Epic: none Part of: cockroachdb#97252 Part of: cockroachdb#97875 Release note: None

Previously, both the stmt and txns fingerprints pages were using the same api response from the /statements. This was because we need the stmts response to build txns pages. Howevever having to fetch both types of stats can slow down the request. Now that we can specify the fetch_mode as part of the request, we can split the 2 into their own api calls and redux fields. Epic: none Part of: cockroachdb#97875 Release note: None

This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

Previously, we were automatically fetching new data on the fingerprints pages every 5m if the time period selected was of the form 'Past xxxx of data'. We should not automatically poll because: 1. this is an expensive request, we don't want to unnecessarily send new ones if the user does not need to refresh their data. 2. It can be a weird experience to be viewing your table and have things suddenly update/change, since we don't communicate that we're fetching new data every 5m. As part of this change, we move the invalidation of data depending on the global time scale object to the saga observing the timescale update in redux, rather than in the saga that dispatches the update time scale action. This ensures that the update ts action has been reduced at the time of invalidation. In the future, we should remove the `SET_GLOBAL_TIME_SCALE` action as it is just a wrapper dispatching `SET_SCALE`. This commit also removes issuing a stats request from the reset sql 14 stats saga, since invalidating the statements 15 will cause a refresh in the page already. Epic: none Part of: cockroachdb#97875 Release note (ui change): New data is not auto fetched every 5m on the stmt / txn fingerprints pages.

This commit adds a new field, fetch_mode, which is a message containing an enum specifying what kind of stats to return from the combined stmts endpoint. The options are - StmtStatsOnly - TxnStatsOnly Leaving this field null will fetch both stmts and txns. If `TxnStatsOnly` is specified, we will also include the stmt stats for the stms in each txn in the returned txn payload. This is because in the client app we need the stmt stats to properly build the txn fingerprint pages. In the future, we will split this API instead of using this fetch mode flag, but this is currently preferred to make it easier to backport for perf improvements. Epic: none Part of: cockroachdb#97252 Part of: cockroachdb#97875 Release note: None

Previously, both the stmt and txns fingerprints pages were using the same api response from the /statements. This was because we need the stmts response to build txns pages. Howevever having to fetch both types of stats can slow down the request. Now that we can specify the fetch_mode as part of the request, we can split the 2 into their own api calls and redux fields. Epic: none Part of: cockroachdb#97875 Release note: None

This commit adds limit and sort fields to the combined statements request. These params can be used to specify how much data is returned and the priority at which to return data (e.g. top-k). The current sort options are: - service latency - cpu time - p99 (statements only) - contention time - execution count Epic: none Part of: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

This commit adds new knobs to the sql stats fingerprint pages. Users can now specify a limit and sort priority with their sql stats fingerprints requests. Closes: cockroachdb#97876 Part of: cockroachdb#97875 Release note: None

xinhaoz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer T-sql-observability labels Mar 1, 2023

xinhaoz self-assigned this Mar 1, 2023

blathers-crl bot added this to Triage in Cluster Observability Mar 1, 2023

maryliag moved this from Triage to Active Issues in Cluster Observability Mar 6, 2023

xinhaoz changed the title ~~ui: enable degradation of sql stats when persisted stats calls hangs~~ ui: enable degradation of sql stats when stats calls hangs Mar 6, 2023

xinhaoz changed the title ~~ui: enable degradation of sql stats when stats calls hangs~~ ui: improve sql stats response time Mar 9, 2023

xinhaoz changed the title ~~ui: improve sql stats response time~~ ui: improve sql stats fingerprints pages response time Mar 9, 2023

xinhaoz closed this as completed May 10, 2023

Cluster Observability automation moved this from Active Issues to Done May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ui: improve sql stats fingerprints pages response time #97875

ui: improve sql stats fingerprints pages response time #97875

xinhaoz commented Mar 1, 2023 •

edited

xinhaoz commented Mar 1, 2023

j82w commented Mar 2, 2023

xinhaoz commented Mar 2, 2023

j82w commented Mar 2, 2023

j82w commented Mar 3, 2023 •

edited

xinhaoz commented Mar 6, 2023

j82w commented Mar 7, 2023

ui: improve sql stats fingerprints pages response time #97875

ui: improve sql stats fingerprints pages response time #97875

Comments

xinhaoz commented Mar 1, 2023 • edited

EDIT - NEW APPROACH

xinhaoz commented Mar 1, 2023

j82w commented Mar 2, 2023

xinhaoz commented Mar 2, 2023

j82w commented Mar 2, 2023

j82w commented Mar 3, 2023 • edited

xinhaoz commented Mar 6, 2023

j82w commented Mar 7, 2023

xinhaoz commented Mar 1, 2023 •

edited

j82w commented Mar 3, 2023 •

edited