Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: log on excessive latch hold duration #114609

Closed
8 tasks
nvanbenschoten opened this issue Nov 16, 2023 · 1 comment · Fixed by #115746
Closed
8 tasks

kv: log on excessive latch hold duration #114609

nvanbenschoten opened this issue Nov 16, 2023 · 1 comment · Fixed by #115746
Labels
A-kv-observability A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 16, 2023

We have logging for slow latch acquisitions due to conflicts, but not for when a request itself holds on to a latch for an excessive amount of time. This kind of logging would be helpful to avoid bugs and find unexpected slowness in the system which could cascade to other requests had there been contention.

Some notes:

  • plumb cluster settings into spanlatch.Manager.
  • add new spanlatch/settings.go file.
  • in it, add new public cluster setting called kv.concurrency.long_latch_hold_duration (or something better). Give it a default value of 3s.
  • add an acquireTime time.Time field to spanlatch.Guard.
  • assign in Manager.Acquire and Manager.WaitUntilAcquired after wait succeeds.
  • consult in Manager.Release after releasing latches. If acquireTime is set and it was acquired more than kv.concurrency.long_latch_hold_duration ago, log a warning.
  • put the warning behind a log.Every(1 * time.Second) to avoid log spam.
  • make sure the warning explains some of the possible causes for long latch hold times. Also make sure it includes the request information which is added in kv: include conflicting request information in latch manager traces/logs #114601.

Jira issue: CRDB-33593

Epic CRDB-34227

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. A-kv-observability T-kv KV Team labels Nov 16, 2023
@blathers-crl blathers-crl bot added this to Incoming in KV Nov 16, 2023
@lyang24
Copy link
Contributor

lyang24 commented Jan 8, 2024

Hi Nathan, I have worked on a draft pr on this issue. I have two questions:

  1. Is there a good way to test the logger output?
    I could add a test like below when i ran it i can see the output on my ide, but i am not sure how to capture logger output to string
// How to verify logger output in go?
func TestLatchManagerLongLatchHeldLogOutput(t *testing.T) {
	settings := cluster.MakeClusterSettings()
	m := Make(nil, nil, settings)
	gr := &kvpb.GetRequest{
		RequestHeader: kvpb.RequestHeader{
			Key: roachpb.Key("test"),
		},
	}
	ba := &kvpb.BatchRequest{}
	ba.Add(gr)
	// Acquire latches, no conflict.
	lg1, err := m.Acquire(context.Background(), spans("d", "f", write, zeroTS), poison.Policy_Error, ba)
	require.NoError(t, err)
	time.Sleep(4 * time.Second)
	m.Release(lg1)
}
  1. make sure the warning explains some of the possible causes for long latch hold times I listed two of the reasons but i do not have sufficient experience let me know if there are major causes i missed on the msg.

lyang24 added a commit to lyang24/cockroach that referenced this issue Jan 13, 2024
This commit aims to help observability by logging request helding latches over
threshold. long_latch_hold_duration is a new system variable that introduced to
set the latch helding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: cockroachdb#114609

Release note (sql change): The session variable
kv.concurrency.long_latch_hold_duration has been added. latches held over the
variable threshold has a chance to get logged with request information.
lyang24 added a commit to lyang24/cockroach that referenced this issue Jan 26, 2024
This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: cockroachdb#114609

Release note: None
lyang24 added a commit to lyang24/cockroach that referenced this issue Jan 26, 2024
This commit adds context from concurrency manager's finishReq method to latch
manager's release method. The goal is to supply the right context when logging
events on latch release.

Informs: cockroachdb#114609

Release note: None
lyang24 added a commit to lyang24/cockroach that referenced this issue Feb 1, 2024
This commit adds context from concurrency manager's finishReq method to latch
manager's release method. The goal is to supply the right context when logging
events on latch release.

Informs: cockroachdb#114609

Release note: None
craig bot pushed a commit that referenced this issue Feb 1, 2024
115746: kv: log on excessive latch hold duration r=lyang24 a=lyang24

This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: #114609

Release note: None

118300: kv: ship high-resolution tscache summaries during lease transfers and range merges r=nvanbenschoten a=nvanbenschoten

Fixes #61986.
Fixes #117486.
Unblocks #118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in #118299 to ship high-resolution summaries of the timestamp cache during lease transfers and ranges merges. In doing so, it eliminates the loss of precision that occurs in an incoming leaseholder's timestamp cache when it receives a lease transfer or range merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN` reason.
2. txn push marker keys would have their timestamp advanced, leading to transactions having their commit timestamp pushed, which could lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read, also leading to transactions having their commit timestamp pushed if they wrote to those keys, which could also lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to 0B. This default configuration should be sufficient to eliminate the first two sources of retries described above. The third has not been observed as a serious issue in practice, so we configure the global budget to 0 so that we can hit a serialization fast-path.

Release note (ops change): Transaction replay protection state is now passed between the outgoing and incoming leaseholder for a range during a lease transfer. This avoids cases where lease transfers can cause transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.

Co-authored-by: lyang24 <lanqingy@usc.edu>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 71ef3d3 Feb 2, 2024
KV automation moved this from Incoming to Closed Feb 2, 2024
@exalate-issue-sync exalate-issue-sync bot reopened this Feb 20, 2024
KV automation moved this from Closed to Incoming Feb 20, 2024
KV automation moved this from Incoming to Closed Feb 20, 2024
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: cockroachdb#114609

Release note: None
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
This commit adds context from concurrency manager's finishReq method to latch
manager's release method. The goal is to supply the right context when logging
events on latch release.

Informs: cockroachdb#114609

Release note: None
@exalate-issue-sync exalate-issue-sync bot reopened this Feb 26, 2024
KV automation moved this from Closed to Incoming Feb 26, 2024
KV automation moved this from Incoming to Closed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

2 participants