Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: optimize persistedsqlstats flush size check #110173

Merged
merged 1 commit into from
Sep 12, 2023
Merged

sql: optimize persistedsqlstats flush size check #110173

merged 1 commit into from
Sep 12, 2023

Conversation

j82w
Copy link
Contributor

@j82w j82w commented Sep 7, 2023

Problem:
The persistedsqlstats size check to make sure the table is not 1.5x the max size is done on every flush which is done on every node every 10 minutes by default. This can cause serialization issues as it is over the entire table. The check is unnecessary most of the time, because it should only fail if the compaction job is failing.

Solution:

  1. Reduce the check interval to only be done once an hour by default, and make it configurable.
  2. The system table is split in to 8 shards. Instead of checking the entire table count limit it to only one shard. This reduces the scope of the check and reduces the chance of serialization issues.

This was preivously reverted because of a flakey test because the size check is only done on a single shard. The tests are updated to increase the limit and the number of statements to make sure every shard has data.

Fixes: #109619

Release note (sql change): The persistedsqlstats table max size check is now done once an hour instead of every 10 minutes. This reduces the risk of serialization errors on the statistics tables.

@j82w j82w requested a review from a team September 7, 2023 13:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w)


pkg/sql/sqlstats/persistedsqlstats/cluster_settings.go line 140 at r1 (raw file):

)

// sqlStatsLimitTableCheckInterval is the cluster setting that controls the

this comment is not very clear, since it doesn't mention this is controlling the frequency of the check and not if we allow or not to pass the limit


pkg/sql/sqlstats/persistedsqlstats/cluster_settings.go line 146 at r1 (raw file):

	settings.TenantWritable,
	"sql.stats.limit_table_size.check_interval",
	"controls what interval the check is done on if the statement and "+

this sentence is a little confusing (especially this "on")


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 505 at r1 (raw file):

	})

	// Set table size check interval to 1 second.

comment needs to be updated


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 611 at r1 (raw file):

	require.False(t, limitReached)

	// Set table size check interval to .00001 second. So the next check doesn't

comment needs to be updated

Problem:
The `persistedsqlstats` size check to make sure the table is not 1.5x
the max size is done on every flush which is done on every node every
10 minutes by default. This can cause serialization issues as it is over
the entire table. The check is unnecessary most of the time, because it
should only fail if the compaction job is failing.

Solution:
1. Reduce the check interval to only be done once an hour by default,
   and make it configurable.
2. The system table is split in to 8 shards. Instead of checking the
   entire table count limit it to only one shard. This reduces the scope
   of the check and reduces the chance of serialization issues.

This was preivously reverted because of a flakey test because the size
check is only done on a single shard. The tests are updated to increase
the limit and the number of statements to make sure every shard has
data.

Fixes: #109619

Release note (sql change): The persistedsqlstats table max size check
is now done once an hour instead of every 10 minutes. This reduces the
risk of serialization errors on the statistics tables.
Copy link
Contributor Author

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)


pkg/sql/sqlstats/persistedsqlstats/cluster_settings.go line 140 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

this comment is not very clear, since it doesn't mention this is controlling the frequency of the check and not if we allow or not to pass the limit

Done.


pkg/sql/sqlstats/persistedsqlstats/cluster_settings.go line 146 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

this sentence is a little confusing (especially this "on")

Done.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 505 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

comment needs to be updated

Done.


pkg/sql/sqlstats/persistedsqlstats/flush_test.go line 611 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

comment needs to be updated

Done.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w)

@j82w
Copy link
Contributor Author

j82w commented Sep 12, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 12, 2023

Build failed (retrying...):

@craig craig bot merged commit 40dd180 into cockroachdb:master Sep 12, 2023
3 checks passed
@craig
Copy link
Contributor

craig bot commented Sep 12, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql stats: reduce flush count check
3 participants