New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix StatementStatistics.Nodes list #106587
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github was being weird and wouldn't let me submit my Reviewable before posting a comment ^^
LGTM, butI don't have enough context on the flow component stuff to give a stamp, so just left some nits.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @yuzefovich)
pkg/sql/executor_statement_metrics.go
line 185 at r1 (raw file):
} if len(sqlInstanceIds) > 1 {
nit: is this necessary? I think the Sort function already has this check.
pkg/sql/execstats/traceanalyzer.go
line 386 at r1 (raw file):
if v.Component.Region != "" { exist := false if len(regions) > 0 {
nit: same here, not sure if this guard is really necessary
pkg/sql/execstats/traceanalyzer.go
line 395 at r1 (raw file):
if !exist { regions = append(regions, v.Component.Region)
How come we can't use CombineUnique
here?
pkg/sql/execstats/traceanalyzer.go
line 418 at r1 (raw file):
a.queryLevelStats = QueryLevelStats{} a.queryLevelStats.SqlInstanceIds = instanceIds if len(regions) > 1 {
Same comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @xinhaoz, and @yuzefovich)
pkg/sql/executor_statement_metrics.go
line 182 at r1 (raw file):
sqlInstanceIds = make([]int64, 0, len(queryLevelStats.SqlInstanceIds)) for sqlInstanceId := range queryLevelStats.SqlInstanceIds { sqlInstanceIds = append(sqlInstanceIds, int64(sqlInstanceId))
is there any chance of duplication of nodes? This is why we had combineUnique here previously
pkg/sql/execinfrapb/component_stats.proto
line 61 at r1 (raw file):
optional string region = 5 [ (gogoproto.nullable) = false, (gogoproto.customname) = "Region"];
you don't need a custom name here, because the default already creates with the format you requestes. We use custom when we want to create as ID
and not Id
for example
pkg/sql/execstats/traceanalyzer.go
line 395 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
How come we can't use
CombineUnique
here?
+1 on using the combineUnique that does that check for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag, @xinhaoz, and @yuzefovich)
pkg/sql/executor_statement_metrics.go
line 182 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
is there any chance of duplication of nodes? This is why we had combineUnique here previously
No, queryLevelStats.SqlInstanceIds is a map.
pkg/sql/executor_statement_metrics.go
line 185 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
nit: is this necessary? I think the Sort function already has this check.
The sort function doesn't have an initial check to just no-op. It's probably not necessary so I'll just remove it.
pkg/sql/execstats/traceanalyzer.go
line 395 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
+1 on using the combineUnique that does that check for you
I was trying to avoid the additional overhead of allocating a new array each time. CombineUnique doesn't support a single item added to an array.
pkg/sql/execstats/traceanalyzer.go
line 418 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Same comment here.
Done.
Previously, maryliag (Marylia Gutierrez) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 15 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)
-- commits
line 26 at r1:
nit: should this be in bug fix
category? Perhaps multiple release notes are warranted (with different categories)?
pkg/sql/conn_executor_exec.go
line 1755 at r1 (raw file):
// and the plan's flow metadata. It also populates the regions field and // annotates the explainPlan field of the instrumentationHelper. func populateQueryLevelStatsAndRegions(
nit: the method should be renamed, perhaps to just populateQueryLevelStats
. The comment on the method also needs an adjustment.
pkg/sql/conn_executor_exec.go
line 1784 at r1 (raw file):
log.VInfof(ctx, 1, msg, ih.fingerprint, err) } else { ih.regions = queryLevelStats.Regions
nit: I think instrumentationHelper.regions
field can now be removed - we can access the Regions
field from the query level stats object.
pkg/sql/instrumentation.go
line 681 at r3 (raw file):
// annotateExplain aggregates the statistics in the trace and annotates // explain.Nodes with execution stats. // It returns a list of all regions on which any of the statements
nit: this sentence should be removed now.
pkg/sql/instrumentation.go
line 690 at r3 (raw file):
// Retrieve which region each node is on. regionsInfo := make(map[int64]string) descriptors, _ := getAllNodeDescriptors(p)
I think we want to refactor regions-related logic in this method. We can remove call to getAllNodeDescriptors
since that's no longer needed. Instead, we should iterate over statsMap
to find all FlowComponentID
s in order to build regionsInfo
map (mapping from SQL Instance ID to the region).
pkg/sql/execinfrapb/component_stats.go
line 53 at r3 (raw file):
// FlowComponentID returns a ComponentID for the given flow. func FlowComponentID(instanceID base.SQLInstanceID, flowID FlowID, region string) ComponentID {
nit: I think we could move this method into execinfra/flow_context.go
and define it with *FlowCtx
as the receiver. This method is called in three places, and only instanceID
argument seems - on the first glance - to be different; however, that's not the case - in makeGetStatsFnForOutbox
, originSQLInstanceID
is the same as flowCtx.NodeID.SQLInstance()
, so we should refactor that code to unify makeGetStatsFnForOutbox
with two other callers, and then call newly-introduced FlowCtx.FlowComponentID
in all three places.
pkg/sql/execinfrapb/component_stats.proto
line 58 at r3 (raw file):
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/base.SQLInstanceID"]; // The region the component is associated with.
nit: perhaps expand the comment that at the time of writing only ComponentID
s of Flow
type might have this set.
pkg/sql/execstats/traceanalyzer.go
line 371 at r3 (raw file):
instanceIds := make(map[base.SQLInstanceID]struct{}, len(a.flowStats)) // Default to 1 since most queries only use a single region
nit: missing period.
pkg/sql/execstats/traceanalyzer.go
line 382 at r3 (raw file):
instanceIds[instanceID] = struct{}{} for _, v := range stats.stats { // Avoid duplicates and empty string
nit: ditto.
pkg/ccl/testccl/sqlstatsccl/sql_stats_test.go
line 52 at r3 (raw file):
sql.SecondaryTenantZoneConfigsEnabled.Override(ctx, &st.SV, true) // Shorten the closed timestamp target duration so that span configs
nit: this seems like a leftover?
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)
pkg/sql/conn_executor_exec.go
line 1753 at r4 (raw file):
// and populates it in the instrumentationHelper's queryLevelStatsWithErr field. // Query-level execution statistics are collected using the statement's trace // and the plan's flow metadata. It also populates the regions field and
nit: in the last sentence we should remove the mention of "regions" field because it's no more.
pkg/sql/colflow/vectorized_flow.go
line 472 at r4 (raw file):
flowCtx *execinfra.FlowCtx, statsCollectors []colexecop.VectorizedStatsCollector, originSQLInstanceID base.SQLInstanceID,
nit: let's remove originSQLInstanceID
parameter since it's no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like one of the new tests are failing:
sql\_stats\_test.go:128: condition failed to evaluate within 4m0s: rows are not replicated to all regions \[gcp-us-east1 gcp-us-west1\]
\--- FAIL: TestSQLStatsRegions/secondary\_tenant (247.98s)
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)
It passed locally for 10 runs, so not sure why it fails in the CI. I added several new cluster settings based on other multi-region tests that should fix it. |
1. Fixes the nodes list to show all the nodes in StatementStatistics.Nodes. 2. The node list is now empty if tracing is disabled. Previously it would always include the current gateway node id, but it would be missing all the other nodes. This causes confusion because it's uncertain whether the node list is complete or not. 3. Fixes regions on `EXPLAIN ANALYSE (DISTSQL)` to show regions information on secondary tenenants. It was not shown before because only system tenants have acces to gossip which is used under the covers to get the node descriptors. 4. Fixes the performance issues previously listed in #102170. 5. Fixes the test to actually validate the nodes list. The fix was done by adding the region name to the Flow ComponentID. This means the region name is now part of the traces for the Flow ComponentID, so it no longer needs figure out the region. It gets the region information from the same trace the SQL Instance ID is obtained. Moving the collection to the QueryLevelStats avoids iterating the traces multiple times. Fixes: #102170, #96647, #91219 Epic: none Release note (bug fix): Fixed the StatementStatistics.Nodes to contain all the nodes involved in the query. Fixed region info in `EXPLAIN ANALYSE (DISTSQL)` for seconary tenants.
The test failures were caused by there being 9 nodes. On the secondary tenants the nodes are not deterministic so it would likely go to all the nodes instead of just the expected 3 nodes. This was fixed by reducing the node count down to 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r2, 6 of 9 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @j82w and @xinhaoz)
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 44e6054 to blathers/backport-release-23.1-106587: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
EXPLAIN ANALYSE (DISTSQL)
to show regions information on secondary tenenants. It was not shown before because only system tenants have acces to gossip which is used under the covers to get the node descriptors.The fix was done by adding the region name to the Flow ComponentID. This means the region name is now part of the traces for the Flow ComponentID, so it no longer needs figure out the region. It gets the region information from the same trace the SQL Instance ID is obtained. Moving the collection to the QueryLevelStats avoids iterating the traces multiple times.
Fixes: #102170, fixes: #96647, fixes: #91219;
Epic: none
Release note (sql change): Fixes the StatementStatistics.Nodes to
contain all the nodes involved in the query. Adds region info to
EXPLAIN ANALYSE (DISTSQL)
for seconary tenants.