Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540

Closed
joshimhoff opened this issue Aug 3, 2022 · 0 comments · Fixed by #90208
Closed
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Aug 3, 2022

Describe the problem
The logs aren't clear about when a SQL pod exits due to a sqlliveness session expiring. See the code here:

// Stop the instance asynchronously. This callback runs in a stopper task,
// so it can't do the shutdown (as the shutdown stops the stopper).
go func() {
ctx, sp := p.stopper.Tracer().StartSpanCtx(context.Background(), "instance shutdown")
defer sp.Finish()
p.shutdownSQLInstance(ctx)
}()
.

To Reproduce
N/A

Expected behavior
CRDB should log the shutdown reason at fatal severity. CRDB should also probably return a non-zero exit code, since it is not expected for sqlliveness sessions to expire.

Additional data / screenshots
Link to a related internal support issue: https://github.com/cockroachlabs/support/issues/1736

Jira issue: CRDB-18301

@joshimhoff joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. labels Aug 3, 2022
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 7, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 8, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 12, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 27, 2022
This is being added to improve observability for when a sqlliveness
session expires. Previously, it was difficult to know if the session
ended due to expiry.

Resolves cockroachdb#85540

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 28, 2022
Previously, it was difficult to know if the session ended
due to expiry. This change passes an err to the shutdown logic
through the `errChan` to exit gracefully but with an error.

Resolves cockroachdb#85540

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 28, 2022
Previously, it was difficult to know if the session ended
due to expiry. This change passes an err to the shutdown logic
through the `errChan` to exit gracefully but with an error.

Resolves cockroachdb#85540

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 29, 2022
Previously, it was difficult to know if the session ended
due to expiry. This change passes an err to the shutdown logic
through the `errChan` to exit gracefully but with an error.

Resolves cockroachdb#85540

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Oct 17, 2022
In a previous patch, timeouts and retries were added to the sqlliveness
heartbeat loop. This introduced cases where the session could potentially
go between valid, invalid, and valid state again.

This patch introduces a tenant flag for the sqlliveness instance. When
enabled, the heartbeat loop is terminated if there are any errors in the
session create or extend calls.

This patch also introduces an error channel which is used to propogate
sqlliveness errors to the server and cli level for better visibility on
the reason for process termination.

Resolves cockroachdb#85540

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 18, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which has more complex shutdown logic - it
prints progress as the stopper stops (or at least it will in cockroachdb#90143),
and it also prints a message with the cause of the exit.

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 19, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Thus, the goal of this patch is to keep the
number of callers to Stopper.Stop() small - right now there are two of
them: one in the cli, and one in the drainServer. In a followup patch, I
might remove the drainServer call, and then we'd be left with exactly
one caller to Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 21, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 24, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 24, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 25, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 26, 2022
…lems

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in cockroachdb#90143), and it also prints a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes cockroachdb#85540

Release note: None
Epic: None
craig bot pushed a commit that referenced this issue Oct 27, 2022
90208: server,sqlliveness: rework shutting down SQL pods on sqlliveness problems r=andreimatei a=andreimatei

This patch changes how stopping the stopper and, in turn, terminating
the process happens for tenant SQL pods when the sqlliveness session is
expired. Before this patch, the liveness code was calling the
Stopper.Stop() directly. This patch replaces that with bubbling up an
event up to the cli layer, which is where the logic around terminating
the process lives. That guy has  more complex shutdown logic - it has
the opportunity to perform draining if it wants (or at least it will
in #90143), and it also prints
a message with the cause of the exit.

An alternative considered was to extend the Stopper.Stop() interface
somehow with a termination message, and continue to use the stopper as
the mechanism used by arbitrary modules to signal the desire to
terminate the server. I've decided against it, though, in favor of
introducing another mechanism for signaling requests to terminate up to
the cli layer, on the argument that the stopper is passed around so
widely and used in a lot of non-production libraries, such that it's
hard to trace and tell what production modules call Stop() on it and
under which circumstances. Also, stopper.Stop() cannot be deadlocks if
called from inside a stopper task (and our code generally runs inside
stopper tasks), so we'd also need to add a new async stop interface to
the stopper.
Thus, the goal of this patch is to keep the number of callers to
Stopper.Stop() small - right now there are two of them: one in the cli,
and one in the drainServer. In a followup patch, I might remove the
drainServer call, and then we'd be left with exactly one caller to
Stopper.Stop().

Fixes #85540

Release note: None
Epic: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in e873a0e Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
3 participants