sql: fix sql activity cache for large cardinality #112350

maryliag · 2023-10-13T21:32:52Z

Note to reviewers: first commit being reviewed on #112609

Recreated from #112071

tldr: fix activity update job failing from conflict, caching to much info when their is high cardinatlity in the stats, and fix cache to store the correct top stats.

Fix activity update failure from upserting same row multiple times. This was caused by 2 nodes having the same statement, but one only executed it 2 or 3 time and the other node executed it enough to generate an index recommendation. The different index recommendation caused duplicate rows based on primary key to be inserted. It now uses index recommendation from the merged stats which uses the latest existing
recommendation (we don't update if there is no recommendation).
Customers can have transactions that have over 1k unique statement fingerprint ids. This makes it unreasonable to cache all of the statement information for all the top transactions. This commit removes the logic to cache all the statement information for the top transactions which could cause the table to grow to over 100k for a single hour. This requires the call to go the statistics table to get all the necessary statement information for the txn details.
The activity cache logic was not correctly selecting the top CPU json column. It was using the computed index column name instead of the json field name. The values shown on the UI were correct because the endpoint still aggregated several columns using the aggregated stats rather than the separate column.
The activity cache logic was using the execution count instead of the statistics count to select the top queries. In most cases this still generates the correct top queries, and the UI uses the aggregated stats which showed the correct value.

Fixes: #111780
Fixes: #111224

Release note (sql change): Fix the SQL activity update job to avoid conflicts on update, reduces the amount of data cached to just what the overview page requires, and fix the correctess of the top queries.

cockroach-teamcity · 2023-10-13T21:33:03Z

This change is

yuzefovich

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)

-- commits line 34 at r1:
This fixes #111224, right?

pkg/server/combined_statement_stats.go line 1016 at r1 (raw file):

// This does not use the activity tables because the statement information is
// aggregated to remove the transaction fingerprint id to keep the size of the
// statement_activity manageable when transaction have over 1k+ statement ids.

nit: "transaction have".

pkg/sql/sql_activity_update_job.go line 368 at r1 (raw file):

		// each of execution_count, total execution time, service_latency,cpu_sql_nanos,
		// contention_time and insert into transaction_activity table.
		// Up to 3000 rows (sql.stats.activity.top.max * 5) may be added to

Why did this change? Because we only need 5 columns? Also, do we need to reduce numberOfTopColumns to 5 too? Also, s/3000/2500/.

pkg/sql/sql_activity_update_job.go line 388 at r1 (raw file):

            merge_stats,
            ''  AS query,
            	(merge_stats -> 'statistics' ->> 'cnt')::int,

nit: seems like a few extra spaces upfront.

pkg/sql/sql_activity_update_job.go line 522 at r1 (raw file):

       merged_stats,
       max_plan,
			 (select COALESCE(array_agg(o.rec::string), (array[]::string[])) FROM jsonb_array_elements_text(merged_stats -> 'index_recommendations') o(rec)) as idx_rec,

nit: indentation.

pkg/sql/sql_activity_update_job.go line 537 at r1 (raw file):

             max(ss.agg_interval) AS max_agg_interval,
             max(ss.plan) AS max_plan,
						 merge_stats_metadata(ss.metadata) AS metadata,

nit: indentation.

pkg/sql/sql_activity_update_job.go line 540 at r1 (raw file):

    				 merge_statement_stats(ss.statistics) AS merged_stats
      FROM system.statement_statistics ss
		  inner join limit_stmt_stats using (aggregated_ts, fingerprint_id, app_name)

nit: we usually capitalize INNER JOIN and GROUP BY

pkg/sql/opt/exec/execbuilder/testdata/observability line 347 at r1 (raw file):

                    │ group by: fingerprint_id, plan_hash, app_name
                    │
                    └── • apply join (left outer)

This apply join seems a bit concerning.

pkg/sql/opt/exec/execbuilder/testdata/observability line 491 at r1 (raw file):

                                 app_name,
                                 fingerprint_id),
     limit_stmt_stats AS (SELECT aggregated_ts,

nit: indentation.

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

                    │ group by: fingerprint_id, plan_hash, app_name
                    │
                    └── • apply join (left outer)

Ditto.

pkg/sql/opt/exec/execbuilder/testdata/observability line 885 at r1 (raw file):

----
distribution: local
            vectorized: true

nit: indentation.

pkg/sql/sql_activity_update_job_test.go line 379 at r1 (raw file):

			    WHERE app_name = $3;`, 10000, 1, "topTransaction")

		// Help check for primary key conflicts

nit: missing period.

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @yuzefovich)

-- commits line 34 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This fixes #111224, right?

Right, added it

pkg/server/combined_statement_stats.go line 1016 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "transaction have".

Done

pkg/sql/sql_activity_update_job.go line 368 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why did this change? Because we only need 5 columns? Also, do we need to reduce numberOfTopColumns to 5 too? Also, s/3000/2500/.

For Stmt we have 6 and for Txn we not have 5, so I updated the comment here to 2500 since is about Txn. I also created separate numberOfTopColumns for stmt and txn to make the distinction clear

pkg/sql/sql_activity_update_job.go line 388 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like a few extra spaces upfront.

Fixed

pkg/sql/sql_activity_update_job.go line 522 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: indentation.

Fixed

pkg/sql/sql_activity_update_job.go line 537 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: indentation.

Fixed

pkg/sql/sql_activity_update_job.go line 540 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we usually capitalize INNER JOIN and GROUP BY

Fixed

pkg/sql/opt/exec/execbuilder/testdata/observability line 347 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This apply join seems a bit concerning.

any suggestions?

pkg/sql/opt/exec/execbuilder/testdata/observability line 491 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: indentation.

Fixed

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Ditto.

any suggestions?

pkg/sql/opt/exec/execbuilder/testdata/observability line 885 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: indentation.

Fixed

pkg/sql/sql_activity_update_job_test.go line 379 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: missing period.

Fixed

yuzefovich

Reviewed 2 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

any suggestions?

(select COALESCE(array_agg(o.rec::string), (array[]::string[])) FROM jsonb_array_elements_text(merged_stats -> 'index_recommendations') o(rec)) as idx_rec,

is what's causing us to use the apply join here. What we're trying to do is to array aggregate all index recommendations.

However, if we look at how merge_statement_stats does aggregation of index recommendations, StatementStatistics.Add simply keeps the latest IndexRecommendations as the code is written currently. Thus, I think we can change the query to simply use

merged_stats -> 'index_recommendations' as idx_rec,

which removes the need for apply join. WDYT?

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @yuzefovich)

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

(select COALESCE(array_agg(o.rec::string), (array[]::string[])) FROM jsonb_array_elements_text(merged_stats -> 'index_recommendations') o(rec)) as idx_rec,
is what's causing us to use the apply join here. What we're trying to do is to array aggregate all index recommendations.

However, if we look at how merge_statement_stats does aggregation of index recommendations, StatementStatistics.Add simply keeps the latest IndexRecommendations as the code is written currently. Thus, I think we can change the query to simply use
merged_stats -> 'index_recommendations' as idx_rec,
which removes the need for apply join. WDYT?

Query is now using the new builtin and the plan is updated. Let me know what you think

yuzefovich

Reviewed 10 of 10 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)

-- commits line 27 at r5:
nit: probably this sentence is no longer correct - we now use the last index recommendations that we happen to see when merging.

pkg/sql/sql_activity_update_job.go line 68 at r5 (raw file):

	settings.WithPublic)

const numberOfStmtTopColumns = 6

Should one of these be 5?

pkg/sql/appstatspb/app_stats.go line 200 at r5 (raw file):

	// Use the LastExecTimestamp to decide which object has the last error
	// and index recommendations.

nit: the comment needs an update.

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

Query is now using the new builtin and the plan is updated. Let me know what you think

The plans no longer seem concerning.

Ideally, we'd run EXPLAIN ANALYZE of the read portions of these queries (similar to what I did here) on a reasonably active database to see the performance. I guess it requires usage of multiple different fingerprints - perhaps we could use sqlsmith to generate such random queries.

pkg/sql/opt/exec/execbuilder/testdata/observability line 251 at r5 (raw file):

query T retry
EXPLAIN (VERBOSE)
      UPSERT

nit: perhaps squash UPSERT with the next line?

pkg/sql/opt/exec/execbuilder/testdata/observability line 806 at r5 (raw file):

            merge_stats,
            ''  AS query,
              (merge_stats -> 'statistics' ->> 'cnt')::int,

nit: indentation is a bit off.

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @yuzefovich)

-- commits line 27 at r5:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably this sentence is no longer correct - we now use the last index recommendations that we happen to see when merging.

Fixed

pkg/sql/sql_activity_update_job.go line 68 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Should one of these be 5?

🤦 duh, fixed

pkg/sql/appstatspb/app_stats.go line 200 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: the comment needs an update.

Done

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

The plans no longer seem concerning.

Ideally, we'd run EXPLAIN ANALYZE of the read portions of these queries (similar to what I did here) on a reasonably active database to see the performance. I guess it requires usage of multiple different fingerprints - perhaps we could use sqlsmith to generate such random queries.

isn't this what the above part of INJECT STATISTICS is trying to simulate? We are adding stats to pretend those tables are full

pkg/sql/opt/exec/execbuilder/testdata/observability line 251 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps squash UPSERT with the next line?

Done

pkg/sql/opt/exec/execbuilder/testdata/observability line 806 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: indentation is a bit off.

Fixed

yuzefovich

Reviewed 6 of 6 files at r6, 6 of 6 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

isn't this what the above part of INJECT STATISTICS is trying to simulate? We are adding stats to pretend those tables are full

Yes, this shows the right logical plan, but I'm wondering whether the actual performance (thus EXPLAIN ANALYZE) is fast enough. It's hard to say that simply by glancing at the plan. I mean, it looks like the plan has improved, but it would still be nice to see the actual runtime of the queries, especially so given that we'd be backporting this to 23.1.x I think.

pkg/sql/sql_activity_update_job_test.go line 99 at r7 (raw file):

	verifyActivityTablesAreEmpty(t, db)

	// Us random name to keep isolated during stress tests.

nit: s/Us/Use/.

yuzefovich

Probably want to backport this to 23.2 and 23.1, right?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)

pkg/sql/opt/exec/execbuilder/testdata/observability line 606 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yes, this shows the right logical plan, but I'm wondering whether the actual performance (thus EXPLAIN ANALYZE) is fast enough. It's hard to say that simply by glancing at the plan. I mean, it looks like the plan has improved, but it would still be nice to see the actual runtime of the queries, especially so given that we'd be backporting this to 23.1.x I think.

Oh, I forgot that we already addressed - probably - the main reason for slowness we observed in a customer ticket (#111303), and this PR seems to improve the plan further, so I no longer think that we need to run EXPLAIN ANALYZE to see the actual performance, so I'm ok with merging this.

That said, if you do see an easy / quick way to replicate the customer ticket that found these problems, getting a stmt bundle to see whether further improvements can be made would be a nice-to-have.

This commit introduces the function `jsonb_array_to_string_array` that converts a JSONB array to a string array format. Epic: none Release note (sql change): New builtin `jsonb_array_to_string_array` that converts JSONB array to string array.

Recreated from cockroachdb#112071 tldr: fix activity update job failing from conflict, caching to much info when their is high cardinatlity in the stats, and fix cache to store the correct top stats. 1. Fix activity update failure from upserting same row multiple times. This was caused by 2 nodes having the same statement, but one only executed it 2 or 3 time and the other node executed it enough to generate an index recommendation. The different index recommendation caused duplicate rows based on primary key to be inserted. It now uses index recommendation from the merged stats which uses the latest existing recommendation (we don't update if there is no recommendation). 2. Customers can have transactions that have over 1k unique statement fingerprint ids. This makes it unreasonable to cache all of the statement information for all the top transactions. This commit removes the logic to cache all the statement information for the top transactions which could cause the table to grow to over 100k for a single hour. This requires the call to go the statistics table to get all the necessary statement information for the txn details. 3. The activity cache logic was not correctly selecting the top CPU json column. It was using the computed index column name instead of the json field name. The values shown on the UI were correct because the endpoint still aggregated several columns using the aggregated stats rather than the separate column. 4. The activity cache logic was using the execution count instead of the statistics count to select the top queries. In most cases this still generates the correct top queries, and the UI uses the aggregated stats which showed the correct value. Fixes: cockroachdb#111780 Fixes: cockroachdb#111224 Release note (sql change): Fix the SQL activity update job to avoid conflicts on update, reduces the amount of data cached to just what the overview page requires, and fix the correctess of the top queries.

maryliag · 2023-10-21T19:43:31Z

TFTRs! I'll go ahead and merge this now

bors r+

craig · 2023-10-21T20:20:53Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-10-21T20:21:01Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from e252306 to blathers/backport-release-23.1-112350: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

error creating merge commit from e252306 to blathers/backport-release-23.2-112350: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

maryliag requested review from a team October 13, 2023 21:32

maryliag requested a review from a team as a code owner October 13, 2023 21:32

maryliag requested review from dhartunian and yuzefovich and removed request for a team October 13, 2023 21:32

yuzefovich reviewed Oct 13, 2023

View reviewed changes

maryliag force-pushed the fix-cache branch 2 times, most recently from 1e10c95 to 4702fde Compare October 13, 2023 23:29

maryliag changed the title ~~sql: fix sql activity cache for large cardinatlity~~ sql: fix sql activity cache for large cardinality Oct 13, 2023

maryliag commented Oct 13, 2023

View reviewed changes

yuzefovich reviewed Oct 16, 2023

View reviewed changes

maryliag force-pushed the fix-cache branch 2 times, most recently from dfc830e to 15545b5 Compare October 17, 2023 18:25

yuzefovich mentioned this pull request Oct 18, 2023

sql: builtint for jsonb array to string array conversion #112609

Merged

maryliag force-pushed the fix-cache branch from 15545b5 to 6ce138f Compare October 18, 2023 17:18

maryliag requested a review from a team as a code owner October 18, 2023 17:18

maryliag force-pushed the fix-cache branch 2 times, most recently from b48c477 to 285bc78 Compare October 18, 2023 18:29

maryliag commented Oct 18, 2023

View reviewed changes

yuzefovich reviewed Oct 18, 2023

View reviewed changes

maryliag force-pushed the fix-cache branch from 285bc78 to 84ce54f Compare October 18, 2023 21:16

maryliag commented Oct 18, 2023

View reviewed changes

yuzefovich reviewed Oct 18, 2023

View reviewed changes

yuzefovich approved these changes Oct 19, 2023

View reviewed changes

maryliag force-pushed the fix-cache branch from 84ce54f to 1e1ffa0 Compare October 19, 2023 19:37

maryliag added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Oct 19, 2023

maryliag requested review from a team as code owners October 20, 2023 03:00

maryliag requested review from dt, herkolategan, DarrylWong and sumeerbhola and removed request for a team, dt, herkolategan, DarrylWong and sumeerbhola October 20, 2023 03:00

maryliag force-pushed the fix-cache branch 3 times, most recently from e6c24ba to 398c76e Compare October 20, 2023 21:02

maryliag added 2 commits October 21, 2023 12:40

maryliag force-pushed the fix-cache branch from 398c76e to 37ed721 Compare October 21, 2023 16:42

craig bot merged commit 4f26bd1 into cockroachdb:master Oct 21, 2023
8 checks passed

maryliag deleted the fix-cache branch October 21, 2023 20:32

This was referenced Oct 23, 2023

release-23.1: sql: fix sql activity cache for large cardinality #112864

Merged

release-23.2: sql: fix sql activity cache for large cardinality #112865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: fix sql activity cache for large cardinality #112350

sql: fix sql activity cache for large cardinality #112350

maryliag commented Oct 13, 2023 •

edited

Loading

cockroach-teamcity commented Oct 13, 2023

yuzefovich left a comment

maryliag left a comment

yuzefovich left a comment

maryliag left a comment

yuzefovich left a comment

maryliag left a comment

yuzefovich left a comment

yuzefovich left a comment

maryliag commented Oct 21, 2023

craig bot commented Oct 21, 2023

blathers-crl bot commented Oct 21, 2023

sql: fix sql activity cache for large cardinality #112350

sql: fix sql activity cache for large cardinality #112350

Conversation

maryliag commented Oct 13, 2023 • edited Loading

cockroach-teamcity commented Oct 13, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

maryliag commented Oct 21, 2023

craig bot commented Oct 21, 2023

blathers-crl bot commented Oct 21, 2023

maryliag commented Oct 13, 2023 •

edited

Loading