Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add cluster settings to log traces for slow lease transfers and splits #81152

Closed
irfansharif opened this issue May 9, 2022 · 1 comment · Fixed by #120943
Closed
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-support Originated from a customer P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented May 9, 2022

Is your feature request related to a problem? Please describe.

In an internal escalation (https://github.com/cockroachlabs/support/issues/1533) we attributed high tail latencies to KV-internal requests like lease transfers and splits taking a while. This presented as the following:

‹     0.974ms      0.024ms                    event:kv/kvserver/spanlatch/manager.go:517 [n250,s576,r625905/2:/Table/62/16/1498341{40…-63…}] waiting to acquire read latch /Table/62/16/14983414721{8-9}@1651156512.523480877,0, held by write latch /M{in-ax}@0,0›
‹   447.326ms    446.352ms                    event:kv/kvserver/concurrency/concurrency_manager.go:300 [n250,s576,r625905/2:/Table/62/16/1498341{40…-63…}] scanning lock table for conflicting locks›

Here we're latched behind the lease transfer request that declares a non-MVCC latch over the entire keyspace, i.e. /M{in-ax}@0,0. It's difficult however to go from this to an understanding of why the lease transfer request itself was taking that long? Was it waiting for inflight reads to complete? Was it something else?

The same is true for split requests (also declares a non-MVCC write latch over the RHS):

‹     1.121ms      0.030ms event:kv/kvserver/spanlatch/manager.go:517 [n220,s459,r625867/1:/Table/60/1{0/2022-…-1/2022-…}] waiting to acquire write latch /Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0@1651156203.790535407,0, held by write latch /Table/60/1{0/2022-04-28T08:04:21.873826Z/9576597821-1/2022-03-15T03:13:16.248485Z/85629368671}@0,0›
‹  1297.601ms   1296.480ms event:kv/kvserver/concurrency/concurrency_manager.go:300 [n220,s459,r625867/1:/Table/60/10/2022-04-28T0…] scanning lock table for conflicting locks›
‹  1298.162ms      0.561ms event:kv/kvserver/replica_send.go:120 [n220,s459,r625867/1:/Table/60/10/2022-04-28T0…] replica.Send got error: key range /Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0-/Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0/NULL outside of bounds of range /Table/60/10/2022-04-28T03:03:48.097657Z/19000424899-/Table/60/10/2022-04-28T08:04:21.873826Z/9576597821; suggested ranges: [desc: r625867:/Table/60/10/2022-04-28T0{3:03:48.097657Z/19000424899-8:04:21.873826Z/9576597821} [(n220,s459):1, (n254,s595):2, (n243,s551):3, (n232,s506):4, (n246,s562):5, next=6, gen=106966, sticky=1651115328.121799069,0], lease: repl=(n220,s459):1 seq=1980 start=1650759449.968454111,0 epo=3 pro=1650759449.967785279,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING]›

Is the split waiting for long-running inflight reads to finish? Is stats evaluation taking a while?

Describe the solution you'd like

We have sql.trace.{stmt,txn}.enable_threshold to log instances where individual statements take longer than some threshold we're trying to filter out. We could do something similar here for internal requests (I'm specifically interested in lease transfers and splits).

Additional Context

Arguably sql.trace.{stmt,txn}.enable_threshold aren't great given they apply to all statements/txns (we don't have something like #70614 to target specific tables/indexes, or a specific stmt fingerprint). Setting thresholds for splits/lease transfers in the same way doesn't seem that bad though given they're down in KV and (should) apply uniformly across ranges and finish in a timely manner (if they don't, say for slow stats computations, that's something we'd want to capture threshold).

+cc @cockroachdb/kv-notifications.

Jira issue: CRDB-15123

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. A-kv-observability labels May 9, 2022
@irfansharif irfansharif added this to To do in Observability Infrastructure via automation May 9, 2022
@irfansharif irfansharif added this to Incoming in KV via automation May 9, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@mwang1026 mwang1026 moved this from Incoming to Prioritized in KV Jun 1, 2022
@pav-kv pav-kv added the O-support Originated from a customer label Feb 13, 2024
@shralex shralex added the P-2 Issues/test failures with a fix SLA of 3 months label Feb 13, 2024
@miraradeva
Copy link
Contributor

Similar example in the replicate queue: filterTracingSpans.

miraradeva added a commit to miraradeva/cockroach that referenced this issue Mar 23, 2024
This change enables logging a trace of a split request if it exceeds a
certain duration. The duration is controlled by a new cluster setting:
`kv.split.slow_split_tracing_threshold`, which defaults to 2s.

Fixes: cockroachdb#81152

Release note: None
craig bot pushed a commit that referenced this issue Mar 26, 2024
120943: kvserver: log a trace for slow splits r=kvoli a=miraradeva

This change enables logging a trace of a split request if it exceeds a certain duration. The duration is controlled by a new cluster setting: `kv.split.slow_split_tracing_threshold`, which defaults to 2s.

Fixes: #81152

Release note: None

Co-authored-by: Mira Radeva <mira@cockroachlabs.com>
@craig craig bot closed this as completed in 6f9b7ec Mar 26, 2024
KV automation moved this from Prioritized to Closed Mar 26, 2024
msbutler pushed a commit to msbutler/cockroach that referenced this issue Mar 29, 2024
This change enables logging a trace of a split request if it exceeds a
certain duration. The duration is controlled by a new cluster setting:
`kv.split.slow_split_tracing_threshold`, which defaults to 2s.

Fixes: cockroachdb#81152

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-support Originated from a customer P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

5 participants