Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: improve sql stats fingerprints pages response time #97875

Closed
xinhaoz opened this issue Mar 1, 2023 · 7 comments
Closed

ui: improve sql stats fingerprints pages response time #97875

xinhaoz opened this issue Mar 1, 2023 · 7 comments
Assignees
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@xinhaoz
Copy link
Member

xinhaoz commented Mar 1, 2023

On our stmt and txn fingerprints pages, we have seen increasingly long loading times when requesting persisted sql stats for clusters with large amounts of data. This can lead to the page being unusable, as the minimum time range we allow to be requested is the past 1 hour.

We need to provide a fallback so that users can see at least some of their live workload.

Proposed experience:

  • Provide a new time window option that is only in-memory stats - this uses the stats endpoint pre persisted sql stats that is still lying around (we can display the exact time period by surfacing the cluster setting value)
  • If a request to persisted stats takes over (some threshold), suggest the above view
  • Continue to fetch the pending persisted sql stats request in the background -> need to display this to user somehow (stack of previous, cached requests(?) - probably want to keep the cap small as sql stats has large payloads).
  • Communicate to user when pending request is completed

cc @kevin-v-ngo , @dongniwang

EDIT - NEW APPROACH

We will be reading only from the system table (see discussion below), as the in-memory stats are actually the source of a lot of our response time issues. Limit and sort options will be added as a part of this plan.

Jira issue: CRDB-25071

@xinhaoz xinhaoz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer T-sql-observability labels Mar 1, 2023
@xinhaoz xinhaoz self-assigned this Mar 1, 2023
@blathers-crl blathers-crl bot added this to Triage in Cluster Observability Mar 1, 2023
@xinhaoz
Copy link
Member Author

xinhaoz commented Mar 1, 2023

@j82w
Copy link
Contributor

j82w commented Mar 2, 2023

  1. Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.
  2. Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

@xinhaoz
Copy link
Member Author

xinhaoz commented Mar 2, 2023

@j82w Great questions --

Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

WDYT?

@j82w
Copy link
Contributor

j82w commented Mar 2, 2023

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

I was hoping this feature would also reduce the initial page load times for everyone and not just the customers with large datasets. If we want to focus on just getting the one scenario to work that is fine.

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

I'm trying to think of a way to convey that it's not all data from the last 10 minutes, because different nodes will have different times they flush the data. If it just says 'Last 10 Minutes' we will likely get customer issues asking why they don't see a query that ran 2minutes ago because the flush occurred 1 minute ago. @kevin-v-ngo and/or @dongniwang do you have any suggestions?

@j82w
Copy link
Contributor

j82w commented Mar 3, 2023

Data point for system.statement_statistics table with 118k rows:

  1. Over 3 minutes to return all the results
  2. About 8 seconds without an index select * from system.statement_statistics order by statistics->'execution_statistics'->>'cnt' limit 40;
  3. This union to get the top query of most popular columns returns in 3 seconds for some reason. The data might be skewed from how the queries were created.
(select * from system.statement_statistics
order by statistics->'execution_statistics'->>'cnt' limit 40)
union
(select * from system.statement_statistics
order by statistics->'execution_statistics'->'contentionTime'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'cpuSQLNanos'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'rowsRead'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'runLat'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'svcLat'->>'mean' limit 40)

Data point for crdb_internal.cluster_statement_statistics table with 100k rows:

  1. About 3 seconds to just generate the view. Probably longer because it's has to fan out to all the nodes.
  2. About 4 seconds to return all the results

Taking the above numbers into consideration would a better solution be to:

  1. Add indexes to the system table and only return the top 40 (2 pages) to always get the result in under 7 seconds. Every time a sort or search occurs it would require a query to the db to get the new results. We could do the union above to get back the top 3 most sorted columns if we think it's an issue like the union query above.
  2. Only show the in memory stats if the time frame is limited to under 30 minutes or the persisted results are less than an hour old.

This solution would allow users to select time frames and avoids the overhead of sending all the data back to the ui to do the computation. The trade off is that each time there is a sort/filter it will take longer.

WDYT?

@maryliag maryliag moved this from Triage to Active Issues in Cluster Observability Mar 6, 2023
@xinhaoz
Copy link
Member Author

xinhaoz commented Mar 6, 2023

Adding onto the above -- posting from slack for tracking.

I spoke with @j82w on Friday who initially had some concerns about the in-memory fallback, and after doing some testing of my own with lots of in-memory data (~40k unique fingerprints), I also now feel like providing the in-memory option may not be the reliable solution we want, rather I feel like we should actually do the opposite -- default to not reading from in-memory stats and querying from the system table only (with the new limit and order by features to get interesting statements). When we have a lot of in-memory data (which I assume customers running into the loading issue often do, since the default is past 1 hour) it seems to cause all sorts of problems --

  • One thing I've noticed is that using the Statements rpc directly over querying the virtual table to get in-memory data is usually a lot faster -- (we don't need to create the virtual table and which aggregates across node boundaries and sorts the data). Unfortunately we would likely not be able to use that since we'd want to aggregate + order in order to limit the data returned.
  • When we have that much in-memory data, the flush is extremely slow. Since we only clear the in-mem container after a successful flush, it often means we don't clear the in-memory stats as often as we would assume when we have so much data because it takes so long to flush all those entries -- in my local 5 node cluster, my 40k worth of fingerprints didn't even finish flushing after an hour. We should probably put this on our roadmap as a problem to fix. Maybe it would help to lower the number of in-memory stats we allow (currently 100k)?
  • System table query without needing to join with in-mem seems to perform much better, and we'll be making even more improvements there with the added indexes

Given the virtual table generation and merging weith in-memory stats seems to the be bottleneck in seemingly problematic workloads (ones with high unique fingerprints), how does product feel about not including in-mem stats by default? Our flush interval is already set to 10 minutes, which is pretty frequent. One would then need to wait 10 minutes to see any data appear (after resetting). If it's important to surface that data right after resetting / starting the cluster, maybe an idea there is if we don't return anything from the system table, to then query from in-memory in the interim (we'd still need to make improvments on our in-memory response time to service that reasonably).
With these changes I think we could reliably show data as long as there have been some workload data flushed to disk, which is easier to guarantee.

@xinhaoz xinhaoz changed the title ui: enable degradation of sql stats when persisted stats calls hangs ui: enable degradation of sql stats when stats calls hangs Mar 6, 2023
@j82w
Copy link
Contributor

j82w commented Mar 7, 2023

The in-memory results have the following issues.

  1. Latency increases as number of nodes increase. It's required to fan out to all the nodes. The persisted can target just the kv nodes with the data.
  2. The filters/limits are not passed over the wire to the remote nodes. The number of results will likely cause memory issues for large number of statements.
  3. If there is a unhealthy node the results will fail.

Here is a rough diagram explaining the process. Let me know if you have any suggestions or notice any errors.

sequenceDiagram 
    title crdb_internal fanout logic
    participant user
    participant sql
    participant crdb_internal
    participant SQLStatusServer
    participant Local
    user->>sql: Request crdb_internal.cluster_statement_statistics
    sql->>crdb_internal: Generate the view
    crdb_internal->>SQLStatusServer: Requests all stats. No filters or limits are passed in.
    SQLStatusServer->>Local: No filters
    activate Local
    Local-->>SQLStatusServer: Return up to 100k
    deactivate Local
    loop All nodes in cluster. Up to 100 parallel
    SQLStatusServer->>Node2: No filters
    activate Node2
    SQLStatusServer->>Node3: No filters
    activate Node3
    SQLStatusServer->>Node4: No filters
    activate Node4
    Node2-->>SQLStatusServer: Return up to 100k
    deactivate Node2
    Node4-->>SQLStatusServer: Return up to 100k
    deactivate Node4
    Node3-->>SQLStatusServer: Return up to 100k
    deactivate Node3
    end
    SQLStatusServer-->>crdb_internal: Return up to 100k
    crdb_internal->>crdb_internal: Generate the view with 100k rows
    crdb_internal->>sql: Return 100k row view
    sql->>sql: Filter/sort rows
    sql->>user: Results

@xinhaoz xinhaoz changed the title ui: enable degradation of sql stats when stats calls hangs ui: improve sql stats response time Mar 9, 2023
@xinhaoz xinhaoz changed the title ui: improve sql stats response time ui: improve sql stats fingerprints pages response time Mar 9, 2023
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Mar 9, 2023
Previously, we were automatically fetching new data
on the finegerprints pages every 5m if the time period
selected was of the form 'Past xxxx of data'. We should
not automatically poll because:
1. this is an expensive request, we don't want to
unnecessarily send new ones if the user does not need
to refresh their data.
2. It can be a weird experience to be viewing your table
and have things suddenly update/change, since we don't
communicate that we're fetching new data every 5m.

Epic: none
Part of: cockroachdb#97875

Release note (ui change): New data is not auto fetched
every 5m on the stmt / txn fingerprints pages.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Mar 13, 2023
Previously, we were automatically fetching new data
on the finegerprints pages every 5m if the time period
selected was of the form 'Past xxxx of data'. We should
not automatically poll because:
1. this is an expensive request, we don't want to
unnecessarily send new ones if the user does not need
to refresh their data.
2. It can be a weird experience to be viewing your table
and have things suddenly update/change, since we don't
communicate that we're fetching new data every 5m.

As part of this change, we move the invalidation of data
depending on the global time scale object to the saga
observing the timescale update in redux, rather than in the
saga that dispatches the update time scale action. This
ensures that the update ts aciton has been reduced at the time
of invalidation. In the future, we should now remove the
`SET_GLOBAL_TIME_SCALE` action as it is just a wrapper
dispatching `SET_SCALE`.

Epic: none
Part of: cockroachdb#97875

Release note (ui change): New data is not auto fetched
every 5m on the stmt / txn fingerprints pages.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Mar 13, 2023
Previously, we were automatically fetching new data
on the finegerprints pages every 5m if the time period
selected was of the form 'Past xxxx of data'. We should
not automatically poll because:
1. this is an expensive request, we don't want to
unnecessarily send new ones if the user does not need
to refresh their data.
2. It can be a weird experience to be viewing your table
and have things suddenly update/change, since we don't
communicate that we're fetching new data every 5m.

As part of this change, we move the invalidation of data
depending on the global time scale object to the saga
observing the timescale update in redux, rather than in the
saga that dispatches the update time scale action. This
ensures that the update ts aciton has been reduced at the time
of invalidation. In the future, we should now remove the
`SET_GLOBAL_TIME_SCALE` action as it is just a wrapper
dispatching `SET_SCALE`.

Epic: none
Part of: cockroachdb#97875

Release note (ui change): New data is not auto fetched
every 5m on the stmt / txn fingerprints pages.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Mar 23, 2023
This commit adds limit and sort fields to the combined
statements request. These params can be used to specify
how much data is returned and the priority at which to
return data (e.g. top-k). The current sort options are:
- service latency
- contention time
- execution count

Epic: none
Part of: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Mar 23, 2023
This commit adds new knobs to the sql stats
fingerprint pages. Users can now specify a
limit and sort priority with their sql stats
fingerprints requests.

Closes: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 5, 2023
This commit adds a new field, fetch_mode, which
is a message containing an enum specifying what
kind of stats to return from the combined stmts
endpoint. The options are
- StmtStatsOnly
- TxnStatsOnly

Leaving this field null will fetch both stmts
and txns.

If `TxnStatsOnly` is specified, we will also include
the stmt stats for the stms in each txn in the returned
txn payload. This is because in the client app we need
the stmt stats to properly build the txn fingerprint pages.

In the future, we will split this API instead of using
this fetch mode flag, but this is currently preferred to
make it easier to backport for perf improvements.

Epic: none
Part of: cockroachdb#97252
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 5, 2023
Previously, both the stmt and txns fingerprints
pages were using the same api response from the
/statements. This was because we need the stmts
response to build txns pages. Howevever having to
fetch both types of stats can slow down the request.
Now that we can specify the fetch_mode as part of
the request, we can split the 2 into their own api
calls and redux fields.

Epic: none
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 5, 2023
This commit adds limit and sort fields to the combined
statements request. These params can be used to specify
how much data is returned and the priority at which to
return data (e.g. top-k). The current sort options are:
- service latency
- contention time
- execution count

Epic: none
Part of: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 6, 2023
This commit adds new knobs to the sql stats
fingerprint pages. Users can now specify a
limit and sort priority with their sql stats
fingerprints requests.

Closes: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 17, 2023
Previously, we were automatically fetching new data
on the fingerprints pages every 5m if the time period
selected was of the form 'Past xxxx of data'. We should
not automatically poll because:
1. this is an expensive request, we don't want to
unnecessarily send new ones if the user does not need
to refresh their data.
2. It can be a weird experience to be viewing your table
and have things suddenly update/change, since we don't
communicate that we're fetching new data every 5m.

As part of this change, we move the invalidation of data
depending on the global time scale object to the saga
observing the timescale update in redux, rather than in the
saga that dispatches the update time scale action. This
ensures that the update ts action has been reduced at the
time of invalidation. In the future, we should remove the
`SET_GLOBAL_TIME_SCALE` action as it is just a wrapper
dispatching `SET_SCALE`.

This commit also removes issuing a stats request from the
reset sql 14 stats saga, since invalidating the statements
15 will cause a refresh in the page already.

Epic: none
Part of: cockroachdb#97875

Release note (ui change): New data is not auto fetched
every 5m on the stmt / txn fingerprints pages.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Apr 17, 2023
Previously, we were automatically fetching new data
on the fingerprints pages every 5m if the time period
selected was of the form 'Past xxxx of data'. We should
not automatically poll because:
1. this is an expensive request, we don't want to
unnecessarily send new ones if the user does not need
to refresh their data.
2. It can be a weird experience to be viewing your table
and have things suddenly update/change, since we don't
communicate that we're fetching new data every 5m.

As part of this change, we move the invalidation of data
depending on the global time scale object to the saga
observing the timescale update in redux, rather than in the
saga that dispatches the update time scale action. This
ensures that the update ts action has been reduced at the
time of invalidation. In the future, we should remove the
`SET_GLOBAL_TIME_SCALE` action as it is just a wrapper
dispatching `SET_SCALE`.

This commit also removes issuing a stats request from the
reset sql 14 stats saga, since invalidating the statements
15 will cause a refresh in the page already.

Epic: none
Part of: cockroachdb#97875

Release note (ui change): New data is not auto fetched
every 5m on the stmt / txn fingerprints pages.
@xinhaoz xinhaoz closed this as completed May 10, 2023
Cluster Observability automation moved this from Active Issues to Done May 10, 2023
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue May 15, 2023
This commit adds a new field, fetch_mode, which
is a message containing an enum specifying what
kind of stats to return from the combined stmts
endpoint. The options are
- StmtStatsOnly
- TxnStatsOnly

Leaving this field null will fetch both stmts
and txns.

If `TxnStatsOnly` is specified, we will also include
the stmt stats for the stms in each txn in the returned
txn payload. This is because in the client app we need
the stmt stats to properly build the txn fingerprint pages.

In the future, we will split this API instead of using
this fetch mode flag, but this is currently preferred to
make it easier to backport for perf improvements.

Epic: none
Part of: cockroachdb#97252
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue May 15, 2023
Previously, both the stmt and txns fingerprints
pages were using the same api response from the
/statements. This was because we need the stmts
response to build txns pages. Howevever having to
fetch both types of stats can slow down the request.
Now that we can specify the fetch_mode as part of
the request, we can split the 2 into their own api
calls and redux fields.

Epic: none
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue May 15, 2023
This commit adds limit and sort fields to the combined
statements request. These params can be used to specify
how much data is returned and the priority at which to
return data (e.g. top-k). The current sort options are:
- service latency
- cpu time
- p99 (statements only)
- contention time
- execution count

Epic: none
Part of: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue May 16, 2023
This commit adds new knobs to the sql stats
fingerprint pages. Users can now specify a
limit and sort priority with their sql stats
fingerprints requests.

Closes: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue May 17, 2023
This commit adds new knobs to the sql stats
fingerprint pages. Users can now specify a
limit and sort priority with their sql stats
fingerprints requests.

Closes: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

2 participants