pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

abarganier · 2024-03-11T18:55:43Z

In the SQL stats feature, we consider whether the statement failed as part of the statement key/statement fingerprint ID, which is used to bucket statement statistics aggregations.

This contributes to the cardinality of statement statistics by a constant factor of 2x. A statement that has failed will show up twice in the SQL Activity page, once for the successful executions, and once for the failed executions. Despite not being very useful, the contributions it makes to cardinality are detrimental to the SQL activity feature, as it has a non-negligible impact on data volume.

To improve upon this, we should remove the failure bool from the statement key, and instead include failure counts as part of the statistics object.

This patch does the following:

Removes the failure boolean from the statement key
Removes failure counts from the statement statistics metadata
Moves the failure counts into the aggrgeated statement statistics
Updates the Statement Details page to display "Failure Count" instead of "Failed?", with the associated numeric count instead of a bool val.

Release note (ui change): In the DB Console's SQL Activity page, the Statement Details page will now show "Failure Count" in place of "Failed?", showing the number of failed executions for the given statement fingerprint instead of a "Yes"/"No" value. Furthermore, previously in the SQL Activity table the same statement fingeprint would appear twice - once for failed executions, and once for successful executions. With these changes, we no longer consider these to be separate rows. Instead, we combine them into a single statement fingerprint.

Release note (sql change): A new column has been added to crdb_internal.node_statement_statistics, failure_count INT NOT NULL. It represents the number of recorded statement execution failures for the given statement, as a new component of the overall statistics.

cockroach-teamcity · 2024-03-11T18:55:53Z

This change is

abarganier

I wanted to point out a few conditionals below that deserve a look, as we're switching away from the failed bool in the stmt key, meaning conditionals that used the old boolean had to be reworked.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, and @xinhaoz)

pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

		if stmtStats.mu.data.SensitiveInfo.LastErr == "" && statistics[i].Stats.SensitiveInfo.LastErr != "" {
			stmtStats.mu.data.SensitiveInfo.LastErr = statistics[i].Stats.SensitiveInfo.LastErr
		}

NB: Without key.Failed, this seemed to me like an appropriate way to rework the conditional. PTAL.

Code quote:

		// Setting all metadata fields.
		if stmtStats.mu.data.SensitiveInfo.LastErr == "" && statistics[i].Stats.SensitiveInfo.LastErr != "" {
			stmtStats.mu.data.SensitiveInfo.LastErr = statistics[i].Stats.SensitiveInfo.LastErr
		}

pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 446 at r1 (raw file):

	if s.mu.data.SensitiveInfo.LastErr == "" && statistics.Stats.SensitiveInfo.LastErr != "" {
		s.mu.data.SensitiveInfo.LastErr = statistics.Stats.SensitiveInfo.LastErr
	}

Same as above, LMK if anyone disagrees with this reworked conditional.

Code quote:

	// Setting all metadata fields.
	if s.mu.data.SensitiveInfo.LastErr == "" && statistics.Stats.SensitiveInfo.LastErr != "" {
		s.mu.data.SensitiveInfo.LastErr = statistics.Stats.SensitiveInfo.LastErr
	}

pkg/server/application_api/sql_stats_test.go line 453 at r1 (raw file):

		var statementsInResponse []string
		for _, respStatement := range resp.Statements {
			if respStatement.Stats.FailureCount > 0 {

In this instance, instead of looking at Failed in the key, we replace with a check against the failure count.

Code quote:

		if respStatement.Stats.FailureCount > 0 {

pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

			// to avoid counting such failures. If a statement that we expect to be
			// successful is not found in the response, the test will fail later.
			if respStatement.Key.KeyData.App == testAppName && respStatement.Stats.FailureCount == 0 {

Ditto as above - we use failure count instead of the old Failed bool from the key as a replacement.

Code quote:

		if respStatement.Key.KeyData.App == testAppName && respStatement.Stats.FailureCount == 0 {

abarganier · 2024-03-11T20:47:52Z

Ah... A bunch of tests need to be updated because the fingerprint hashes have changed with the removal of the failure component. Working on updating them now.

koorosh

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @xinhaoz)

pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

NB: Without key.Failed, this seemed to me like an appropriate way to rework the conditional. PTAL.

I would omit this check && statistics[i].Stats.SensitiveInfo.LastErr != "" for simplicity reasons. In case statistics[i].Stats.SensitiveInfo.LastErr is empty then it will reassign empty string with another empty string and preserve the same logic.

pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Ditto as above - we use failure count instead of the old Failed bool from the key as a replacement.

Curious, if it makes sense to have .Stats.HasFailures bool stat?

abarganier

Okay, the majority of tests have been cleaned up! RFAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, and @xinhaoz)

pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I would omit this check && statistics[i].Stats.SensitiveInfo.LastErr != "" for simplicity reasons. In case statistics[i].Stats.SensitiveInfo.LastErr is empty then it will reassign empty string with another empty string and preserve the same logic.

Good idea! Done.

pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Curious, if it makes sense to have .Stats.HasFailures bool stat?

I think it would be duplicate information that's already represented by Stats.FailureCount, no?

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @xinhaoz)

pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

I think it would be duplicate information that's already represented by Stats.FailureCount, no?

You're right, my initial thoughts were to avoid checks like this and above with dedicated prop, but it definitely duplicates logic.

dhartunian

nice work on this. lots of details to manage. Might be worth keeping as a single commit to keep it all in sync.

Reviewed 23 of 27 files at r1, 4 of 4 files at r2, 13 of 13 files at r5, 16 of 16 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)

pkg/sql/crdb_internal.go line 1811 at r2 (raw file):

  latency_seconds_p90 FLOAT,
  latency_seconds_p99 FLOAT,
  failure_count INT NOT NULL

nit: trailing whitespace

pkg/sql/logictest/testdata/logic_test/crdb_internal line 845 at r6 (raw file):

# - one for the DROP SEQUENCE statement.
#
# We expect the SELECT entry to have max_retries > 0 because auto-retries

nit: there are two select entires below.

In the SQL stats feature, we consider whether the statement failed as part of the statement key/statement fingerprint ID, which is used to bucket statement statistics aggregations. This contributes to the cardinality of statement statistics by a constant factor of 2x. A statement that has failed will show up twice in the SQL Activity page, once for the successful executions, and once for the failed executions. Despite not being very useful, the contributions it makes to cardinality are detrimental to the SQL activity feature, as it has a non-negligible impact on data volume. To improve upon this, we should remove the failure bool from the statement key, and instead include failure counts as part of the statistics object. This patch does the following: * Removes the failure boolean from the statement key * Removes failure counts from the statement statistics metadata * Moves the failure counts into the aggrgeated statement statistics * Updates the Statement Details page to display "Failure Count" instead of "Failed?", with the associated numeric count instead of a bool val. Addresses: cockroachdb#97176 Release note (ui change): In the DB Console's SQL Activity page, the Statement Details page will now show "Failure Count" in place of "Failed?", showing the number of failed executions for the given statement fingerprint instead of a "Yes"/"No" value. Furthermore, previously in the SQL Activity table the same statement fingeprint would appear twice - once for failed executions, and once for successful executions. With these changes, we no longer consider these to be separate rows. Instead, we combine them into a single statement fingerprint. Release note (sql change): A new column has been added to `crdb_internal.node_statement_statistics`, `failure_count INT NOT NULL`. It represents the number of recorded statement execution failures for the given statement, as a new component of the overall statistics.

abarganier · 2024-03-15T16:39:14Z

TFTR!

bors r=dhartunian

abarganier

Ah, forgot to submit my replies to your comments before I bors'd 😅

Good call squashing the commits, I went ahead and did so.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian and @xinhaoz)

pkg/sql/crdb_internal.go line 1811 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: trailing whitespace

Done.

pkg/sql/logictest/testdata/logic_test/crdb_internal line 845 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: there are two select entires below.

Thanks for the catch. Fixed.

craig · 2024-03-15T17:36:24Z

Build succeeded:

In cockroachdb#120236, we removed the `failed` metadata from the `crdb_internal.statement_statistics` table. Fixes cockroachdb#120586. Fixes cockroachdb#120587. Fixes cockroachdb#120588. Fixes cockroachdb#120589. Release note: None

120857: compose: remove PG ComposeCompare test r=rafiss a=rafiss This test does not provide us much value and is too flaky to be useful. Most of the time it fails are due to minor differences in things like names, formatting, or precision, and accommodating each of these differences is not worth it. fixes #109400 fixes #116150 fixes #112154 Release note: None 121052: base,testutils,kvserver: disable lease queue in replication manual r=arulajmani a=kvoli A replica lease queue was introduced in #119155, which processes lease transfers for replicas. Previously, the replicate queue handled lease transfers. The linked change omitted updating the `ReplicationManual` test cluster mode to disable the lease queue, resulting in unexpected lease transfers in some tests. Disable the lease queue under `ReplicationManual` replication mode. Misc tests are also updated to disable/enable the lease queue appropriately. Epic: None Touches: #120605 Release note: None 121120: roachtest: deflake admission-control/multitenant-fairness r=sumeerbhola a=aadityasondhi In #120236, we removed the `failed` metadata from the `crdb_internal.statement_statistics` table. Fixes #120586. Fixes #120587. Fixes #120588. Fixes #120589. Release note: None 121132: batcheval: move test to `large` RBE pool r=rail a=rickystewart Epic: CRDB-8308 Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

abarganier requested review from dhartunian, koorosh, xinhaoz and a team March 11, 2024 18:55

abarganier requested review from a team as code owners March 11, 2024 18:55

abarganier force-pushed the sql-stats-failures-fingerprint branch from fef4fa3 to 3756c81 Compare March 11, 2024 19:45

abarganier commented Mar 11, 2024

View reviewed changes

koorosh reviewed Mar 12, 2024

View reviewed changes

abarganier force-pushed the sql-stats-failures-fingerprint branch from afbf760 to 997fdf4 Compare March 12, 2024 20:34

abarganier requested a review from a team as a code owner March 12, 2024 20:34

abarganier force-pushed the sql-stats-failures-fingerprint branch from 997fdf4 to 893bab1 Compare March 12, 2024 20:42

abarganier requested a review from koorosh March 12, 2024 20:44

abarganier commented Mar 12, 2024

View reviewed changes

abarganier force-pushed the sql-stats-failures-fingerprint branch from 893bab1 to de73f8e Compare March 13, 2024 14:25

koorosh reviewed Mar 13, 2024

View reviewed changes

abarganier force-pushed the sql-stats-failures-fingerprint branch from de73f8e to aa18f90 Compare March 13, 2024 20:16

dhartunian approved these changes Mar 14, 2024

View reviewed changes

abarganier force-pushed the sql-stats-failures-fingerprint branch from aa18f90 to a44e43c Compare March 15, 2024 16:38

abarganier requested a review from dhartunian March 15, 2024 16:39

abarganier commented Mar 15, 2024

View reviewed changes

craig bot merged commit 89414b8 into cockroachdb:master Mar 15, 2024
14 of 22 checks passed

aadityasondhi mentioned this pull request Mar 26, 2024

roachtest: deflake admission-control/multitenant-fairness #121120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

abarganier commented Mar 11, 2024

cockroach-teamcity commented Mar 11, 2024

abarganier left a comment

abarganier commented Mar 11, 2024

koorosh left a comment

abarganier left a comment

koorosh left a comment

dhartunian left a comment

abarganier commented Mar 15, 2024

abarganier left a comment

craig bot commented Mar 15, 2024

pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

Conversation

abarganier commented Mar 11, 2024

cockroach-teamcity commented Mar 11, 2024

abarganier left a comment

Choose a reason for hiding this comment

abarganier commented Mar 11, 2024

koorosh left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

abarganier commented Mar 15, 2024

abarganier left a comment

Choose a reason for hiding this comment

craig bot commented Mar 15, 2024