Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix StatementStatistics.Nodes list #106587

Merged
merged 1 commit into from Jul 12, 2023
Merged

sql: fix StatementStatistics.Nodes list #106587

merged 1 commit into from Jul 12, 2023

Conversation

j82w
Copy link
Contributor

@j82w j82w commented Jul 11, 2023

  1. Fixes the nodes list to show all the nodes in StatementStatistics.Nodes.
  2. The node list is now empty if tracing is disabled. Previously it would always include the current gateway node id, but it would be missing all the other nodes. This causes confusion because it's uncertain whether the node list is complete or not.
  3. Fixes regions on EXPLAIN ANALYSE (DISTSQL) to show regions information on secondary tenenants. It was not shown before because only system tenants have acces to gossip which is used under the covers to get the node descriptors.
  4. Fixes the performance issues previously listed in sqlstats: "get regions for node" logic is extremely inefficient #102170.
  5. Fixes the test to actually validate the nodes list.

The fix was done by adding the region name to the Flow ComponentID. This means the region name is now part of the traces for the Flow ComponentID, so it no longer needs figure out the region. It gets the region information from the same trace the SQL Instance ID is obtained. Moving the collection to the QueryLevelStats avoids iterating the traces multiple times.

Fixes: #102170, fixes: #96647, fixes: #91219;
Epic: none
Release note (sql change): Fixes the StatementStatistics.Nodes to
contain all the nodes involved in the query. Adds region info to
EXPLAIN ANALYSE (DISTSQL) for seconary tenants.

@j82w j82w requested a review from yuzefovich July 11, 2023 14:44
@j82w j82w requested review from a team as code owners July 11, 2023 14:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@j82w j82w added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jul 11, 2023
Copy link
Member

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Member

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github was being weird and wouldn't let me submit my Reviewable before posting a comment ^^
LGTM, butI don't have enough context on the flow component stuff to give a stamp, so just left some nits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @yuzefovich)


pkg/sql/executor_statement_metrics.go line 185 at r1 (raw file):

		}

		if len(sqlInstanceIds) > 1 {

nit: is this necessary? I think the Sort function already has this check.


pkg/sql/execstats/traceanalyzer.go line 386 at r1 (raw file):

			if v.Component.Region != "" {
				exist := false
				if len(regions) > 0 {

nit: same here, not sure if this guard is really necessary


pkg/sql/execstats/traceanalyzer.go line 395 at r1 (raw file):

				if !exist {
					regions = append(regions, v.Component.Region)

How come we can't use CombineUnique here?


pkg/sql/execstats/traceanalyzer.go line 418 at r1 (raw file):

	a.queryLevelStats = QueryLevelStats{}
	a.queryLevelStats.SqlInstanceIds = instanceIds
	if len(regions) > 1 {

Same comment here.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @xinhaoz, and @yuzefovich)


pkg/sql/executor_statement_metrics.go line 182 at r1 (raw file):

		sqlInstanceIds = make([]int64, 0, len(queryLevelStats.SqlInstanceIds))
		for sqlInstanceId := range queryLevelStats.SqlInstanceIds {
			sqlInstanceIds = append(sqlInstanceIds, int64(sqlInstanceId))

is there any chance of duplication of nodes? This is why we had combineUnique here previously


pkg/sql/execinfrapb/component_stats.proto line 61 at r1 (raw file):

  optional string region = 5 [
    (gogoproto.nullable) = false,
    (gogoproto.customname) = "Region"];

you don't need a custom name here, because the default already creates with the format you requestes. We use custom when we want to create as ID and not Id for example


pkg/sql/execstats/traceanalyzer.go line 395 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

How come we can't use CombineUnique here?

+1 on using the combineUnique that does that check for you

Copy link
Contributor Author

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag, @xinhaoz, and @yuzefovich)


pkg/sql/executor_statement_metrics.go line 182 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is there any chance of duplication of nodes? This is why we had combineUnique here previously

No, queryLevelStats.SqlInstanceIds is a map.


pkg/sql/executor_statement_metrics.go line 185 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

nit: is this necessary? I think the Sort function already has this check.

The sort function doesn't have an initial check to just no-op. It's probably not necessary so I'll just remove it.


pkg/sql/execstats/traceanalyzer.go line 395 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

+1 on using the combineUnique that does that check for you

I was trying to avoid the additional overhead of allocating a new array each time. CombineUnique doesn't support a single item added to an array.


pkg/sql/execstats/traceanalyzer.go line 418 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Same comment here.

Done.

@j82w
Copy link
Contributor Author

j82w commented Jul 11, 2023

pkg/sql/execinfrapb/component_stats.proto line 61 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

you don't need a custom name here, because the default already creates with the format you requestes. We use custom when we want to create as ID and not Id for example

Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 15 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)


-- commits line 26 at r1:
nit: should this be in bug fix category? Perhaps multiple release notes are warranted (with different categories)?


pkg/sql/conn_executor_exec.go line 1755 at r1 (raw file):

// and the plan's flow metadata. It also populates the regions field and
// annotates the explainPlan field of the instrumentationHelper.
func populateQueryLevelStatsAndRegions(

nit: the method should be renamed, perhaps to just populateQueryLevelStats. The comment on the method also needs an adjustment.


pkg/sql/conn_executor_exec.go line 1784 at r1 (raw file):

		log.VInfof(ctx, 1, msg, ih.fingerprint, err)
	} else {
		ih.regions = queryLevelStats.Regions

nit: I think instrumentationHelper.regions field can now be removed - we can access the Regions field from the query level stats object.


pkg/sql/instrumentation.go line 681 at r3 (raw file):

// annotateExplain aggregates the statistics in the trace and annotates
// explain.Nodes with execution stats.
// It returns a list of all regions on which any of the statements

nit: this sentence should be removed now.


pkg/sql/instrumentation.go line 690 at r3 (raw file):

	// Retrieve which region each node is on.
	regionsInfo := make(map[int64]string)
	descriptors, _ := getAllNodeDescriptors(p)

I think we want to refactor regions-related logic in this method. We can remove call to getAllNodeDescriptors since that's no longer needed. Instead, we should iterate over statsMap to find all FlowComponentIDs in order to build regionsInfo map (mapping from SQL Instance ID to the region).


pkg/sql/execinfrapb/component_stats.go line 53 at r3 (raw file):

// FlowComponentID returns a ComponentID for the given flow.
func FlowComponentID(instanceID base.SQLInstanceID, flowID FlowID, region string) ComponentID {

nit: I think we could move this method into execinfra/flow_context.go and define it with *FlowCtx as the receiver. This method is called in three places, and only instanceID argument seems - on the first glance - to be different; however, that's not the case - in makeGetStatsFnForOutbox, originSQLInstanceID is the same as flowCtx.NodeID.SQLInstance(), so we should refactor that code to unify makeGetStatsFnForOutbox with two other callers, and then call newly-introduced FlowCtx.FlowComponentID in all three places.


pkg/sql/execinfrapb/component_stats.proto line 58 at r3 (raw file):

    (gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/base.SQLInstanceID"];

  // The region the component is associated with.

nit: perhaps expand the comment that at the time of writing only ComponentIDs of Flow type might have this set.


pkg/sql/execstats/traceanalyzer.go line 371 at r3 (raw file):

	instanceIds := make(map[base.SQLInstanceID]struct{}, len(a.flowStats))
	// Default to 1 since most queries only use a single region

nit: missing period.


pkg/sql/execstats/traceanalyzer.go line 382 at r3 (raw file):

		instanceIds[instanceID] = struct{}{}
		for _, v := range stats.stats {
			// Avoid duplicates and empty string

nit: ditto.


pkg/ccl/testccl/sqlstatsccl/sql_stats_test.go line 52 at r3 (raw file):

	sql.SecondaryTenantZoneConfigsEnabled.Override(ctx, &st.SV, true)

	// Shorten the closed timestamp target duration so that span configs

nit: this seems like a leftover?

@j82w
Copy link
Contributor Author

j82w commented Jul 12, 2023

pkg/sql/instrumentation.go line 690 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we want to refactor regions-related logic in this method. We can remove call to getAllNodeDescriptors since that's no longer needed. Instead, we should iterate over statsMap to find all FlowComponentIDs in order to build regionsInfo map (mapping from SQL Instance ID to the region).

Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup! :lgtm:

Reviewed 9 of 9 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)


pkg/sql/conn_executor_exec.go line 1753 at r4 (raw file):

// and populates it in the instrumentationHelper's queryLevelStatsWithErr field.
// Query-level execution statistics are collected using the statement's trace
// and the plan's flow metadata. It also populates the regions field and

nit: in the last sentence we should remove the mention of "regions" field because it's no more.


pkg/sql/colflow/vectorized_flow.go line 472 at r4 (raw file):

	flowCtx *execinfra.FlowCtx,
	statsCollectors []colexecop.VectorizedStatsCollector,
	originSQLInstanceID base.SQLInstanceID,

nit: let's remove originSQLInstanceID parameter since it's no longer used.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like one of the new tests are failing:


sql\_stats\_test.go:128: condition failed to evaluate within 4m0s: rows are not replicated to all regions \[gcp-us-east1 gcp-us-west1\]

\--- FAIL: TestSQLStatsRegions/secondary\_tenant (247.98s)

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)

@j82w
Copy link
Contributor Author

j82w commented Jul 12, 2023

It passed locally for 10 runs, so not sure why it fails in the CI. I added several new cluster settings based on other multi-region tests that should fix it.

1. Fixes the nodes list to show all the nodes in
    StatementStatistics.Nodes.
2. The node list is now empty if tracing is disabled. Previously it
    would always include the current gateway node id, but it would be
    missing all the other nodes. This causes confusion because it's
    uncertain whether the node list is complete or not.
3. Fixes regions on `EXPLAIN ANALYSE (DISTSQL)` to show regions
    information on secondary tenenants. It was not shown before because
    only system tenants have acces to gossip which is used under the
    covers to get the node descriptors.
4. Fixes the performance issues previously listed in #102170.
5. Fixes the test to actually validate the nodes list.

The fix was done by adding the region name to the Flow ComponentID. This
means the region name is now part of the traces for the Flow
ComponentID, so it no longer needs figure out the region. It gets the
region information from the same trace the SQL Instance ID is obtained.
Moving the collection to the QueryLevelStats avoids iterating the traces
multiple times.

Fixes: #102170, #96647, #91219
Epic: none
Release note (bug fix): Fixed the StatementStatistics.Nodes to
 contain all the nodes involved in the query. Fixed region info in
 `EXPLAIN ANALYSE (DISTSQL)` for seconary tenants.
@j82w
Copy link
Contributor Author

j82w commented Jul 12, 2023

The test failures were caused by there being 9 nodes. On the secondary tenants the nodes are not deterministic so it would likely go to all the nodes instead of just the expected 3 nodes. This was fixed by reducing the node count down to 3.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
:lgtm:

Reviewed 3 of 4 files at r2, 6 of 9 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @j82w and @xinhaoz)

@j82w
Copy link
Contributor Author

j82w commented Jul 12, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 12, 2023

Build succeeded:

@craig craig bot merged commit 269b9e3 into cockroachdb:master Jul 12, 2023
7 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Jul 12, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 44e6054 to blathers/backport-release-23.1-106587: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
5 participants