sqlliveness: add timeouts to heartbeats #87533

aadityasondhi · 2022-09-07T20:26:04Z

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality

cockroach-teamcity · 2022-09-07T20:26:25Z

This change is

ajwerner · 2022-09-08T15:41:59Z

pkg/sql/sqlliveness/slstorage/test_helpers.go

@@ -21,7 +22,8 @@ import (

 // FakeStorage implements the sqlliveness.Storage interface.
 type FakeStorage struct {
-	mu struct {
+	InsertSleep time.Duration


Sleeps are generally an anti-pattern in our repo. Especially sleeps that are intended to wait for something to happen. Anything like this would need to be well motivated. I have a hard time understanding how these sleeps fit into the test.

I think I'd prefer if we did something more explicit, like have a the FakeStorage send on a channel here or. One approach is to use a chan chan struct{} where the Insert call will send a channel on a channel, and then block until it receives from the sent channel. In that way, we can be sure that the Insert does not make progress, and we don't need to wait for timing.

Thanks for pointing it out. Made changes and introduced the pattern you suggested.

ajwerner · 2022-09-08T15:46:17Z

pkg/sql/sqlliveness/slinstance/slinstance.go

+func (l *Instance) createSessionWithTimeout(
+	ctx context.Context, timeout time.Duration,
+) (*session, error) {
+	var s *session
+	var err error
+	var createChan = make(chan struct{})
+	go func() {
+		s, err = l.createSession(ctx)
+		close(createChan)
+	}()
+
+	t := timeutil.NewTimer()
+	t.Reset(timeout)
+	select {
+	case <-t.C:
+		return nil, &TimeoutError{}
+	case <-createChan:
+		return s, err
+	}
+}


this seems complex

This fails to actually cancel the underlying work, which, I think, is bad

Can we instead with context.WithTimeout and then detect the deadline exceeded error?

Changed to use context.WithTimeout

ajwerner · 2022-09-12T17:51:51Z

pkg/sql/sqlliveness/slinstance/slinstance.go

@@ -103,6 +104,7 @@ func (s *session) invokeSessionExpiryCallbacks(ctx context.Context) {
 	for _, callback := range s.mu.sessionExpiryCallbacks {
 		callback(ctx)
 	}
+	log.Fatal(ctx, "sqlliveness session expired")


I don't think this is at all reasonable. The kv node expects to be robust to this failure and the sql pod installs a callback for exactly this purpose. Please remove this from this PR and look at those callbacks. If you want to change the behavior which happens in those callbacks, please do, but in a different PR.

Makes sense, I will add this to the callback and put it in a separate PR.

ajwerner · 2022-09-12T17:53:12Z

pkg/sql/sqlliveness/slinstance/slinstance.go

@@ -237,6 +239,48 @@ func (l *Instance) extendSession(ctx context.Context, s *session) (bool, error)
 	return true, nil
 }

+func (l *Instance) createSessionWithTimeout(


I'm still not quite getting it. The context cancellation should just flow through the underlying calls. Why are you creating these separate goroutines? It feels like all you want to do is context the timeout context in the existing createSession and extendSession methods.

Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 242 at r3 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.

It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 242 at r3 (raw file):

Previously, ajwerner wrote…

It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.

I see, that makes sense. I updated the code and tests to rely on context cancellation for exiting early.

ajwerner

Reviewed 2 of 5 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

				// Unable to extend session due to unknown error.
				// Clear and stop heartbeat loop.
				log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)

I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.

Code quote:

				// Unable to extend session due to unknown error.
				// Clear and stop heartbeat loop.
				log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 155 at r6 (raw file):

	slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond)
	// Must be shorter than the storage sleep amount below
	slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)

In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.

Code quote:

	slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond)
	// Must be shorter than the storage sleep amount below
	slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 162 at r6 (raw file):

	// verify that eventually session is created successfully
	require.Eventually(

I think testutils.SucceedsSoon will both be easier to use here and will behave better under stress because it backs off its loop.

pkg/sql/sqlliveness/slstorage/test_helpers.go line 24 at r6 (raw file):

// FakeStorage implements the sqlliveness.Storage interface.
type FakeStorage struct {
	BlockCh chan struct{}

My guess is that you're going to need to put this BlockCh under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, ajwerner wrote…

I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.

I am not sure personally. Who would be the right team/person to ask?

pkg/sql/sqlliveness/slstorage/test_helpers.go line 24 at r6 (raw file):

Previously, ajwerner wrote…

My guess is that you're going to need to put this BlockCh under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.

Good call, it did have a race condition. Thanks! Made the change in the latest revision.

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 155 at r6 (raw file):

Previously, ajwerner wrote…

In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.

Thanks for the pointer. Changed.

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 162 at r6 (raw file):

Previously, ajwerner wrote…

I think testutils.SucceedsSoon will both be easier to use here and will behave better under stress because it backs off its loop.

Thanks for the pointer. Changed.

ajwerner

Reviewed 2 of 4 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slstorage/test_helpers.go line 52 at r7 (raw file):

	ctx context.Context, sid sqlliveness.SessionID, expiration hlc.Timestamp,
) error {
	s.mu.Lock()

is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.

Code quote:

	defer s.mu.Unlock()

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slstorage/test_helpers.go line 52 at r7 (raw file):

Previously, ajwerner wrote…

is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.

Thanks, fixed in the latest revision

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r6 (raw file):

				found, err = l.extendSession(ctx, s)
				return err
			}); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {

It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?

pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

I am not sure personally. Who would be the right team/person to ask?

I think you also want to avoid logging if the ctx.Err() != nil because we shouldn't log during clean server shutdown.

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r10 (raw file):

				found, err = l.extendSession(ctx, s)
				return err
			}); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {

I think you should also check if the context is canceled, which it will be during shutdown.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r6 (raw file):

Previously, ajwerner wrote…

It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?

Yes, you are correct. Fixed the behavior and tests to match it.

pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, ajwerner wrote…

I think you also want to avoid logging if the ctx.Err() != nil because we shouldn't log during clean server shutdown.

That makes sense. Changed.

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r10 (raw file):

Previously, ajwerner wrote…

I think you should also check if the context is canceled, which it will be during shutdown.

Thanks for pointing it out, fixed.

ajwerner

Very close. My refactor is a nit. Feel free to take it or leave it and merge.

Reviewed 1 of 3 files at r9, 2 of 4 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @joshimhoff)

pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r12 (raw file):

				return err
			})
			if err != nil {

this logic is getting a bit messy and hard to read how about:

switch {
case errors.HasType(err, (*contextutil.TimeoutError)(nil)):
    // Retry without clearing the session because we don't know the current status.
    t.Reset(0)
    continue
case err != nil && ctx.Err() == nil:
   log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)
   fallthrough
case err != nil:
   // TODO(ajwerner): Decide whether we actually should exit the heartbeat loop here if the context is not
   // canceled. Consider the case of an ambiguous result error: shouldn't we try again?
   l.clearSession()
   return
case !found:
  // No existing session found, immediately create one.
  l.clearSession(ctx)
  // Start next loop iteration immediately to insert a new session.
  t.Reset(0)
default:
  if log.V(2) {
     log.Infof(ctx, "extended SQL liveness session %s", s.ID()) 
  }
  t.Reset(l.gb()
}

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality

Release note: None

aadityasondhi · 2022-09-23T17:25:29Z

bors r=ajwerner

craig · 2022-09-23T18:39:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

joshimhoff · 2022-09-26T12:55:46Z

Thank you!

Can we get a 22.1 backport? CC serverless is running 22.1. Major upgrade can take some time in practice. The lack of these HBs has led to inactionable alerts hitting the pagers of our SREs.

ajwerner · 2022-09-26T14:31:07Z

Why do major upgrades lead to unavailability? I feel like I was missing some context on the motivation here.

joshimhoff · 2022-09-26T14:43:17Z

Ah sorry for confusion!

Major upgrades don't lead to unavailability. The issue fixed by this PR has nothing to do with major upgrades. CC serverless is simply running 22.1, and I expect it'll take some time to move to 22.2, based on past experience with serverless major upgrades. As a result, I'd love it if we backport this change into 22.1, not just 22.2, so as to avoid reoccurrence of the issue fixed by this PR.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/sql/sqlliveness/slinstance/BUILD.bazel line 40 at r14 (raw file):

        "//pkg/sql/sqlliveness",
        "//pkg/sql/sqlliveness/slstorage",
        "//pkg/testutils",

for future reference, you definitely want to squash a diff like this into the prior commit. All commits must build.

aadityasondhi requested review from joshimhoff and abarganier September 7, 2022 20:26

joshimhoff requested a review from ajwerner September 7, 2022 20:26

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 8eeebb5 to d305a16 Compare September 8, 2022 15:28

aadityasondhi mentioned this pull request Sep 8, 2022

[WIP] sqlliveness: add timeouts to heartbeats #85551

Closed

ajwerner suggested changes Sep 8, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from d305a16 to d13363b Compare September 12, 2022 14:40

aadityasondhi requested a review from ajwerner September 12, 2022 14:43

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 4 times, most recently from 4822910 to 934e219 Compare September 12, 2022 17:36

ajwerner reviewed Sep 12, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 934e219 to 34bf5ef Compare September 12, 2022 18:14

aadityasondhi requested a review from a team as a code owner September 12, 2022 18:14

aadityasondhi removed the request for review from a team September 12, 2022 18:23

ajwerner reviewed Sep 12, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 4 times, most recently from 0e443d6 to f356350 Compare September 12, 2022 19:52

aadityasondhi commented Sep 12, 2022

View reviewed changes

ajwerner reviewed Sep 13, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from a89e27f to 32c64d7 Compare September 16, 2022 14:32

aadityasondhi requested a review from a team September 16, 2022 14:32

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 32c64d7 to 47c41e4 Compare September 16, 2022 14:34

aadityasondhi removed the request for review from a team September 16, 2022 14:35

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 2 times, most recently from a05af60 to 664f2c3 Compare September 16, 2022 17:23

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 664f2c3 to b0add23 Compare September 16, 2022 17:45

aadityasondhi commented Sep 19, 2022

View reviewed changes

aadityasondhi requested a review from ajwerner September 19, 2022 16:48

ajwerner reviewed Sep 20, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from b0add23 to 5bbf619 Compare September 21, 2022 15:46

aadityasondhi requested a review from ajwerner September 21, 2022 15:46

aadityasondhi added the backport-22.2.x Flags PRs that need to be backported to 22.2. label Sep 21, 2022

aadityasondhi commented Sep 21, 2022

View reviewed changes

ajwerner reviewed Sep 21, 2022

View reviewed changes

aadityasondhi commented Sep 22, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 5bbf619 to 3b43176 Compare September 22, 2022 18:29

ajwerner approved these changes Sep 22, 2022

View reviewed changes

dhartunian and others added 2 commits September 23, 2022 12:45

ccl: lint fixes in tenant_side_test.go

59a70ba

Release note: None

aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 3b43176 to 59a70ba Compare September 23, 2022 16:45

craig bot merged commit 2c18ed3 into cockroachdb:master Sep 23, 2022

blathers-crl bot mentioned this pull request Sep 23, 2022

release-22.2: sqlliveness: add timeouts to heartbeats #88601

Closed

aadityasondhi deleted the aadityas/add-timeouts-to-sqlliveness-heartbeat branch September 23, 2022 18:47

aadityasondhi mentioned this pull request Sep 26, 2022

release-22.1: sqlliveness: add timeouts to heartbeats #88731

Closed

ajwerner mentioned this pull request Sep 26, 2022

ccl/multitenantccl/tenantcostclient: TestSQLLivenessExemption flakes #88743

Closed

This was referenced Sep 27, 2022

release-22.2: sqlliveness: add timeouts to heartbeats #88805

Merged

release-22.1: sqlliveness: add timeouts to heartbeats #88807

Merged

andreimatei mentioned this pull request Oct 19, 2022

sqlliveness: sql liveness heartbeat loop can exit silently #90262

Closed

andreimatei reviewed Oct 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlliveness: add timeouts to heartbeats #87533

sqlliveness: add timeouts to heartbeats #87533

aadityasondhi commented Sep 7, 2022 •

edited

cockroach-teamcity commented Sep 7, 2022

ajwerner Sep 8, 2022

aadityasondhi Sep 12, 2022

ajwerner Sep 8, 2022

aadityasondhi Sep 12, 2022

ajwerner Sep 12, 2022

aadityasondhi Sep 12, 2022

ajwerner Sep 12, 2022

aadityasondhi Sep 12, 2022

ajwerner left a comment

aadityasondhi left a comment

ajwerner left a comment

aadityasondhi left a comment

ajwerner left a comment

aadityasondhi left a comment

ajwerner left a comment

aadityasondhi left a comment

ajwerner left a comment

aadityasondhi commented Sep 23, 2022

craig bot commented Sep 23, 2022

joshimhoff commented Sep 26, 2022

ajwerner commented Sep 26, 2022

joshimhoff commented Sep 26, 2022 •

edited

andreimatei left a comment

sqlliveness: add timeouts to heartbeats #87533

sqlliveness: add timeouts to heartbeats #87533

Conversation

aadityasondhi commented Sep 7, 2022 • edited

cockroach-teamcity commented Sep 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

aadityasondhi commented Sep 23, 2022

craig bot commented Sep 23, 2022

joshimhoff commented Sep 26, 2022

ajwerner commented Sep 26, 2022

joshimhoff commented Sep 26, 2022 • edited

andreimatei left a comment

Choose a reason for hiding this comment

aadityasondhi commented Sep 7, 2022 •

edited

joshimhoff commented Sep 26, 2022 •

edited