Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

joshimhoff · 2022-08-03T16:21:27Z

Describe the problem
The logs aren't clear about when a SQL pod exits due to a sqlliveness session expiring. See the code here:

cockroach/pkg/sql/sqlinstance/instanceprovider/instanceprovider.go

Lines 174 to 180 in 71e1af4

    
           // Stop the instance asynchronously. This callback runs in a stopper task, 
        
           // so it can't do the shutdown (as the shutdown stops the stopper). 
        
           go func() { 
        
           	ctx, sp := p.stopper.Tracer().StartSpanCtx(context.Background(), "instance shutdown") 
        
           	defer sp.Finish() 
        
           	p.shutdownSQLInstance(ctx) 
        
           }()

.

To Reproduce
N/A

Expected behavior
CRDB should log the shutdown reason at fatal severity. CRDB should also probably return a non-zero exit code, since it is not expected for sqlliveness sessions to expire.

Additional data / screenshots
Link to a related internal support issue: https://github.com/cockroachlabs/support/issues/1736

Jira issue: CRDB-18301

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality

This is being added to improve observability for when a sqlliveness session expires. Previously, it was difficult to know if the session ended due to expiry. Resolves cockroachdb#85540 Release note: None

Previously, it was difficult to know if the session ended due to expiry. This change passes an err to the shutdown logic through the `errChan` to exit gracefully but with an error. Resolves cockroachdb#85540 Release note: None

In a previous patch, timeouts and retries were added to the sqlliveness heartbeat loop. This introduced cases where the session could potentially go between valid, invalid, and valid state again. This patch introduces a tenant flag for the sqlliveness instance. When enabled, the heartbeat loop is terminated if there are any errors in the session create or extend calls. This patch also introduces an error channel which is used to propogate sqlliveness errors to the server and cli level for better visibility on the reason for process termination. Resolves cockroachdb#85540 Release note: None

…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which has more complex shutdown logic - it prints progress as the stopper stops (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. Fixes cockroachdb#85540 Release note: None Epic: None

…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None

…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None

90208: server,sqlliveness: rework shutting down SQL pods on sqlliveness problems r=andreimatei a=andreimatei This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in #90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes #85540 Release note: None Epic: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. labels Aug 3, 2022

dhartunian added the T-observability-inf label Aug 3, 2022

blathers-crl bot added the A-observability-inf label Aug 3, 2022

exalate-issue-sync bot assigned aadityasondhi Sep 6, 2022

aadityasondhi mentioned this issue Sep 7, 2022

sqlliveness: add timeouts to heartbeats #87533

Merged

aadityasondhi mentioned this issue Sep 27, 2022

sqlliveness: fix error handling for sqlliveness sessions #88837

Closed

andreimatei mentioned this issue Oct 18, 2022

server,sqlliveness: rework shutting down SQL pods on sqlliveness problems #90208

Merged

craig bot closed this as completed in e873a0e Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

joshimhoff commented Aug 3, 2022 •

edited

Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

Comments

joshimhoff commented Aug 3, 2022 • edited

joshimhoff commented Aug 3, 2022 •

edited