Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/sql/sqlstats: move failures out of stmt key/meta, into stats #120236

Merged
merged 1 commit into from Mar 15, 2024

Conversation

abarganier
Copy link
Member

Addresses: #97176
Epic: CRDB-24527

In the SQL stats feature, we consider whether the statement failed as part of the statement key/statement fingerprint ID, which is used to bucket statement statistics aggregations.

This contributes to the cardinality of statement statistics by a constant factor of 2x. A statement that has failed will show up twice in the SQL Activity page, once for the successful executions, and once for the failed executions. Despite not being very useful, the contributions it makes to cardinality are detrimental to the SQL activity feature, as it has a non-negligible impact on data volume.

To improve upon this, we should remove the failure bool from the statement key, and instead include failure counts as part of the statistics object.

This patch does the following:

  • Removes the failure boolean from the statement key
  • Removes failure counts from the statement statistics metadata
  • Moves the failure counts into the aggrgeated statement statistics
  • Updates the Statement Details page to display "Failure Count" instead of "Failed?", with the associated numeric count instead of a bool val.

Release note (ui change): In the DB Console's SQL Activity page, the Statement Details page will now show "Failure Count" in place of "Failed?", showing the number of failed executions for the given statement fingerprint instead of a "Yes"/"No" value. Furthermore, previously in the SQL Activity table the same statement fingeprint would appear twice - once for failed executions, and once for successful executions. With these changes, we no longer consider these to be separate rows. Instead, we combine them into a single statement fingerprint.

Release note (sql change): A new column has been added to crdb_internal.node_statement_statistics, failure_count INT NOT NULL. It represents the number of recorded statement execution failures for the given statement, as a new component of the overall statistics.

@abarganier abarganier requested review from dhartunian, koorosh, xinhaoz and a team March 11, 2024 18:55
@abarganier abarganier requested review from a team as code owners March 11, 2024 18:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from fef4fa3 to 3756c81 Compare March 11, 2024 19:45
Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out a few conditionals below that deserve a look, as we're switching away from the failed bool in the stmt key, meaning conditionals that used the old boolean had to be reworked.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

		if stmtStats.mu.data.SensitiveInfo.LastErr == "" && statistics[i].Stats.SensitiveInfo.LastErr != "" {
			stmtStats.mu.data.SensitiveInfo.LastErr = statistics[i].Stats.SensitiveInfo.LastErr
		}

NB: Without key.Failed, this seemed to me like an appropriate way to rework the conditional. PTAL.

Code quote:

		// Setting all metadata fields.
		if stmtStats.mu.data.SensitiveInfo.LastErr == "" && statistics[i].Stats.SensitiveInfo.LastErr != "" {
			stmtStats.mu.data.SensitiveInfo.LastErr = statistics[i].Stats.SensitiveInfo.LastErr
		}

pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 446 at r1 (raw file):

	if s.mu.data.SensitiveInfo.LastErr == "" && statistics.Stats.SensitiveInfo.LastErr != "" {
		s.mu.data.SensitiveInfo.LastErr = statistics.Stats.SensitiveInfo.LastErr
	}

Same as above, LMK if anyone disagrees with this reworked conditional.

Code quote:

	// Setting all metadata fields.
	if s.mu.data.SensitiveInfo.LastErr == "" && statistics.Stats.SensitiveInfo.LastErr != "" {
		s.mu.data.SensitiveInfo.LastErr = statistics.Stats.SensitiveInfo.LastErr
	}

pkg/server/application_api/sql_stats_test.go line 453 at r1 (raw file):

		var statementsInResponse []string
		for _, respStatement := range resp.Statements {
			if respStatement.Stats.FailureCount > 0 {

In this instance, instead of looking at Failed in the key, we replace with a check against the failure count.

Code quote:

		if respStatement.Stats.FailureCount > 0 {

pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

			// to avoid counting such failures. If a statement that we expect to be
			// successful is not found in the response, the test will fail later.
			if respStatement.Key.KeyData.App == testAppName && respStatement.Stats.FailureCount == 0 {

Ditto as above - we use failure count instead of the old Failed bool from the key as a replacement.

Code quote:

		if respStatement.Key.KeyData.App == testAppName && respStatement.Stats.FailureCount == 0 {

@abarganier
Copy link
Member Author

Ah... A bunch of tests need to be updated because the fingerprint hashes have changed with the removal of the failure component. Working on updating them now.

Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @xinhaoz)


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

NB: Without key.Failed, this seemed to me like an appropriate way to rework the conditional. PTAL.

I would omit this check && statistics[i].Stats.SensitiveInfo.LastErr != "" for simplicity reasons. In case statistics[i].Stats.SensitiveInfo.LastErr is empty then it will reassign empty string with another empty string and preserve the same logic.


pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Ditto as above - we use failure count instead of the old Failed bool from the key as a replacement.

Curious, if it makes sense to have .Stats.HasFailures bool stat?

@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from afbf760 to 997fdf4 Compare March 12, 2024 20:34
@abarganier abarganier requested a review from a team as a code owner March 12, 2024 20:34
@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from 997fdf4 to 893bab1 Compare March 12, 2024 20:42
@abarganier abarganier requested a review from koorosh March 12, 2024 20:44
Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the majority of tests have been cleaned up! RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, and @xinhaoz)


pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go line 252 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I would omit this check && statistics[i].Stats.SensitiveInfo.LastErr != "" for simplicity reasons. In case statistics[i].Stats.SensitiveInfo.LastErr is empty then it will reassign empty string with another empty string and preserve the same logic.

Good idea! Done.


pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Curious, if it makes sense to have .Stats.HasFailures bool stat?

I think it would be duplicate information that's already represented by Stats.FailureCount, no?

@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from 893bab1 to de73f8e Compare March 13, 2024 14:25
Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @xinhaoz)


pkg/server/application_api/sql_stats_test.go line 799 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

I think it would be duplicate information that's already represented by Stats.FailureCount, no?

You're right, my initial thoughts were to avoid checks like this and above with dedicated prop, but it definitely duplicates logic.

@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from de73f8e to aa18f90 Compare March 13, 2024 20:16
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice work on this. lots of details to manage. Might be worth keeping as a single commit to keep it all in sync.

Reviewed 23 of 27 files at r1, 4 of 4 files at r2, 13 of 13 files at r5, 16 of 16 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @xinhaoz)


pkg/sql/crdb_internal.go line 1811 at r2 (raw file):

  latency_seconds_p90 FLOAT,
  latency_seconds_p99 FLOAT,
  failure_count INT NOT NULL                                                   

nit: trailing whitespace


pkg/sql/logictest/testdata/logic_test/crdb_internal line 845 at r6 (raw file):

# - one for the DROP SEQUENCE statement.
#
# We expect the SELECT entry to have max_retries > 0 because auto-retries

nit: there are two select entires below.

In the SQL stats feature, we consider whether the statement failed as
part of the statement key/statement fingerprint ID, which is used to
bucket statement statistics aggregations.

This contributes to the cardinality of statement statistics by a
constant factor of 2x. A statement that has failed will show up twice in
the SQL Activity page, once for the successful executions, and once for
the failed executions. Despite not being very useful, the contributions
it makes to cardinality are detrimental to the SQL activity feature, as
it has a non-negligible impact on data volume.

To improve upon this, we should remove the failure bool from the
statement key, and instead include failure counts as part of the
statistics object.

This patch does the following:
* Removes the failure boolean from the statement key
* Removes failure counts from the statement statistics metadata
* Moves the failure counts into the aggrgeated statement statistics
* Updates the Statement Details page to display "Failure Count" instead
  of "Failed?", with the associated numeric count instead of a bool val.

Addresses: cockroachdb#97176

Release note (ui change): In the DB Console's SQL Activity page, the
Statement Details page will now show "Failure Count" in place of
"Failed?", showing the number of failed executions for the given
statement fingerprint instead of a "Yes"/"No" value. Furthermore,
previously in the SQL Activity table the same statement fingeprint
would appear twice - once for failed executions, and once for
successful executions. With these changes, we no longer consider
these to be separate rows. Instead, we combine them into a single
statement fingerprint.

Release note (sql change): A new column has been added to
`crdb_internal.node_statement_statistics`, `failure_count INT NOT NULL`.
It represents the number of recorded statement execution failures for
the given statement, as a new component of the overall statistics.
@abarganier abarganier force-pushed the sql-stats-failures-fingerprint branch from aa18f90 to a44e43c Compare March 15, 2024 16:38
@abarganier
Copy link
Member Author

TFTR!

bors r=dhartunian

Copy link
Member Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forgot to submit my replies to your comments before I bors'd 😅

Good call squashing the commits, I went ahead and did so.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian and @xinhaoz)


pkg/sql/crdb_internal.go line 1811 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: trailing whitespace

Done.


pkg/sql/logictest/testdata/logic_test/crdb_internal line 845 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: there are two select entires below.

Thanks for the catch. Fixed.

@craig
Copy link
Contributor

craig bot commented Mar 15, 2024

@craig craig bot merged commit 89414b8 into cockroachdb:master Mar 15, 2024
14 of 22 checks passed
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this pull request Mar 26, 2024
In cockroachdb#120236, we removed the
`failed` metadata from the `crdb_internal.statement_statistics` table.

Fixes cockroachdb#120586.
Fixes cockroachdb#120587.
Fixes cockroachdb#120588.
Fixes cockroachdb#120589.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 26, 2024
120857: compose: remove PG ComposeCompare test r=rafiss a=rafiss

This test does not provide us much value and is too flaky to be useful. Most of the time it fails are due to minor differences in things like names, formatting, or precision, and accommodating each of these differences is not worth it.

fixes #109400
fixes #116150
fixes #112154
Release note: None


121052: base,testutils,kvserver: disable lease queue in replication manual  r=arulajmani a=kvoli

A replica lease queue was introduced in #119155, which processes lease
transfers for replicas. Previously, the replicate queue handled lease
transfers. The linked change omitted updating the `ReplicationManual`
test cluster mode to disable the lease queue, resulting in unexpected
lease transfers in some tests.

Disable the lease queue under `ReplicationManual` replication mode. Misc
tests are also updated to disable/enable the lease queue appropriately.

Epic: None
Touches: #120605
Release note: None

121120: roachtest: deflake admission-control/multitenant-fairness r=sumeerbhola a=aadityasondhi

In #120236, we removed the `failed` metadata from the `crdb_internal.statement_statistics` table.

Fixes #120586.
Fixes #120587.
Fixes #120588.
Fixes #120589.

Release note: None

121132: batcheval: move test to `large` RBE pool r=rail a=rickystewart

Epic: CRDB-8308
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants