Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlliveness: add timeouts to heartbeats #87533

Conversation

aadityasondhi
Copy link
Collaborator

@aadityasondhi aadityasondhi commented Sep 7, 2022

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -21,7 +22,8 @@ import (

// FakeStorage implements the sqlliveness.Storage interface.
type FakeStorage struct {
mu struct {
InsertSleep time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeps are generally an anti-pattern in our repo. Especially sleeps that are intended to wait for something to happen. Anything like this would need to be well motivated. I have a hard time understanding how these sleeps fit into the test.

I think I'd prefer if we did something more explicit, like have a the FakeStorage send on a channel here or. One approach is to use a chan chan struct{} where the Insert call will send a channel on a channel, and then block until it receives from the sent channel. In that way, we can be sure that the Insert does not make progress, and we don't need to wait for timing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Made changes and introduced the pattern you suggested.

Comment on lines 252 to 260
func (l *Instance) createSessionWithTimeout(
ctx context.Context, timeout time.Duration,
) (*session, error) {
var s *session
var err error
var createChan = make(chan struct{})
go func() {
s, err = l.createSession(ctx)
close(createChan)
}()

t := timeutil.NewTimer()
t.Reset(timeout)
select {
case <-t.C:
return nil, &TimeoutError{}
case <-createChan:
return s, err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this seems complex
  2. This fails to actually cancel the underlying work, which, I think, is bad

Can we instead with context.WithTimeout and then detect the deadline exceeded error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use context.WithTimeout

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from d305a16 to d13363b Compare September 12, 2022 14:40
@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 4 times, most recently from 4822910 to 934e219 Compare September 12, 2022 17:36
@@ -103,6 +104,7 @@ func (s *session) invokeSessionExpiryCallbacks(ctx context.Context) {
for _, callback := range s.mu.sessionExpiryCallbacks {
callback(ctx)
}
log.Fatal(ctx, "sqlliveness session expired")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is at all reasonable. The kv node expects to be robust to this failure and the sql pod installs a callback for exactly this purpose. Please remove this from this PR and look at those callbacks. If you want to change the behavior which happens in those callbacks, please do, but in a different PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will add this to the callback and put it in a separate PR.

@@ -237,6 +239,48 @@ func (l *Instance) extendSession(ctx context.Context, s *session) (bool, error)
return true, nil
}

func (l *Instance) createSessionWithTimeout(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not quite getting it. The context cancellation should just flow through the underlying calls. Why are you creating these separate goroutines? It feels like all you want to do is context the timeout context in the existing createSession and extendSession methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 934e219 to 34bf5ef Compare September 12, 2022 18:14
@aadityasondhi aadityasondhi requested a review from a team as a code owner September 12, 2022 18:14
@aadityasondhi aadityasondhi removed the request for review from a team September 12, 2022 18:23
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 242 at r3 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.

It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 4 times, most recently from 0e443d6 to f356350 Compare September 12, 2022 19:52
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 242 at r3 (raw file):

Previously, ajwerner wrote…

It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.

I see, that makes sense. I updated the code and tests to rely on context cancellation for exiting early.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

				// Unable to extend session due to unknown error.
				// Clear and stop heartbeat loop.
				log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)

I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.

Code quote:

				// Unable to extend session due to unknown error.
				// Clear and stop heartbeat loop.
				log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 155 at r6 (raw file):

	slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond)
	// Must be shorter than the storage sleep amount below
	slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)

In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.

Code quote:

	slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond)
	// Must be shorter than the storage sleep amount below
	slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 162 at r6 (raw file):

	// verify that eventually session is created successfully
	require.Eventually(

I think testutils.SucceedsSoon will both be easier to use here and will behave better under stress because it backs off its loop.


pkg/sql/sqlliveness/slstorage/test_helpers.go line 24 at r6 (raw file):

// FakeStorage implements the sqlliveness.Storage interface.
type FakeStorage struct {
	BlockCh chan struct{}

My guess is that you're going to need to put this BlockCh under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from a89e27f to 32c64d7 Compare September 16, 2022 14:32
@aadityasondhi aadityasondhi requested a review from a team September 16, 2022 14:32
@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 32c64d7 to 47c41e4 Compare September 16, 2022 14:34
@aadityasondhi aadityasondhi removed the request for review from a team September 16, 2022 14:35
@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch 2 times, most recently from a05af60 to 664f2c3 Compare September 16, 2022 17:23
@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 664f2c3 to b0add23 Compare September 16, 2022 17:45
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, ajwerner wrote…

I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.

I am not sure personally. Who would be the right team/person to ask?


pkg/sql/sqlliveness/slstorage/test_helpers.go line 24 at r6 (raw file):

Previously, ajwerner wrote…

My guess is that you're going to need to put this BlockCh under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.

Good call, it did have a race condition. Thanks! Made the change in the latest revision.


pkg/sql/sqlliveness/slinstance/slinstance_test.go line 155 at r6 (raw file):

Previously, ajwerner wrote…

In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.

Thanks for the pointer. Changed.


pkg/sql/sqlliveness/slinstance/slinstance_test.go line 162 at r6 (raw file):

Previously, ajwerner wrote…

I think testutils.SucceedsSoon will both be easier to use here and will behave better under stress because it backs off its loop.

Thanks for the pointer. Changed.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slstorage/test_helpers.go line 52 at r7 (raw file):

	ctx context.Context, sid sqlliveness.SessionID, expiration hlc.Timestamp,
) error {
	s.mu.Lock()

is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.

Code quote:

	defer s.mu.Unlock()

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from b0add23 to 5bbf619 Compare September 21, 2022 15:46
@aadityasondhi aadityasondhi added the backport-22.2.x Flags PRs that need to be backported to 22.2. label Sep 21, 2022
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slstorage/test_helpers.go line 52 at r7 (raw file):

Previously, ajwerner wrote…

is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.

Thanks, fixed in the latest revision

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r6 (raw file):

				found, err = l.extendSession(ctx, s)
				return err
			}); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {

It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?


pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

I am not sure personally. Who would be the right team/person to ask?

I think you also want to avoid logging if the ctx.Err() != nil because we shouldn't log during clean server shutdown.


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r10 (raw file):

				found, err = l.extendSession(ctx, s)
				return err
			}); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {

I think you should also check if the context is canceled, which it will be during shutdown.

Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r6 (raw file):

Previously, ajwerner wrote…

It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?

Yes, you are correct. Fixed the behavior and tests to match it.


pkg/sql/sqlliveness/slinstance/slinstance.go line 288 at r6 (raw file):

Previously, ajwerner wrote…

I think you also want to avoid logging if the ctx.Err() != nil because we shouldn't log during clean server shutdown.

That makes sense. Changed.


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r10 (raw file):

Previously, ajwerner wrote…

I think you should also check if the context is canceled, which it will be during shutdown.

Thanks for pointing it out, fixed.

@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 5bbf619 to 3b43176 Compare September 22, 2022 18:29
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close. My refactor is a nit. Feel free to take it or leave it and merge.

Reviewed 1 of 3 files at r9, 2 of 4 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @joshimhoff)


pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r12 (raw file):

				return err
			})
			if err != nil {

this logic is getting a bit messy and hard to read how about:

switch {
case errors.HasType(err, (*contextutil.TimeoutError)(nil)):
    // Retry without clearing the session because we don't know the current status.
    t.Reset(0)
    continue
case err != nil && ctx.Err() == nil:
   log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)
   fallthrough
case err != nil:
   // TODO(ajwerner): Decide whether we actually should exit the heartbeat loop here if the context is not
   // canceled. Consider the case of an ambiguous result error: shouldn't we try again?
   l.clearSession()
   return
case !found:
  // No existing session found, immediately create one.
  l.clearSession(ctx)
  // Start next loop iteration immediately to insert a new session.
  t.Reset(0)
default:
  if log.V(2) {
     log.Infof(ctx, "extended SQL liveness session %s", s.ID()) 
  }
  t.Reset(l.gb()
}

dhartunian and others added 2 commits September 23, 2022 12:45
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
@aadityasondhi aadityasondhi force-pushed the aadityas/add-timeouts-to-sqlliveness-heartbeat branch from 3b43176 to 59a70ba Compare September 23, 2022 16:45
@aadityasondhi
Copy link
Collaborator Author

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Sep 23, 2022

Build succeeded:

@craig craig bot merged commit 2c18ed3 into cockroachdb:master Sep 23, 2022
@aadityasondhi aadityasondhi deleted the aadityas/add-timeouts-to-sqlliveness-heartbeat branch September 23, 2022 18:47
@joshimhoff
Copy link
Collaborator

Thank you!

Can we get a 22.1 backport? CC serverless is running 22.1. Major upgrade can take some time in practice. The lack of these HBs has led to inactionable alerts hitting the pagers of our SREs.

@ajwerner
Copy link
Contributor

Why do major upgrades lead to unavailability? I feel like I was missing some context on the motivation here.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Sep 26, 2022

Ah sorry for confusion!

Major upgrades don't lead to unavailability. The issue fixed by this PR has nothing to do with major upgrades. CC serverless is simply running 22.1, and I expect it'll take some time to move to 22.2, based on past experience with serverless major upgrades. As a result, I'd love it if we backport this change into 22.1, not just 22.2, so as to avoid reoccurrence of the issue fixed by this PR.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/sqlliveness/slinstance/BUILD.bazel line 40 at r14 (raw file):

        "//pkg/sql/sqlliveness",
        "//pkg/sql/sqlliveness/slstorage",
        "//pkg/testutils",

for future reference, you definitely want to squash a diff like this into the prior commit. All commits must build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-22.2.x Flags PRs that need to be backported to 22.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sqlliveness: no deadlines on heartbeats
6 participants