Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlliveness: no deadlines on heartbeats #85541

Closed
joshimhoff opened this issue Aug 3, 2022 · 15 comments · Fixed by #87533
Closed

sqlliveness: no deadlines on heartbeats #85541

joshimhoff opened this issue Aug 3, 2022 · 15 comments · Fixed by #87533
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Aug 3, 2022

Describe the problem
From what I can see, there are no deadlines on sqlliveness heartbeats. This means a single hung heartbeat will expire a sqliveness session. A hung heartbeat can happen for various reasons, e.g. temporary disk overload. Multi-tenant SQL pods crash if a sqliveness session expires right now. So a single hung heartbeat can actually crash a multi-tenant SQL pod. I believe we have observed this in practice, tho I'm only medium confident about that claim (link to internal support issue: https://github.com/cockroachlabs/support/issues/1736).

To Reproduce
N/A

Expected behavior
There should be deadlines on sqlliveness heartbeats.

Additional data / screenshots
Link to internal support issue: https://github.com/cockroachlabs/support/issues/1736

Jira issue: CRDB-18302

@joshimhoff joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. labels Aug 3, 2022
@ajwerner ajwerner changed the title No deadlines on sqlliveness heartbeats sqlliveness: no deadlines on sqlliveness heartbeats Aug 3, 2022
@ajwerner ajwerner changed the title sqlliveness: no deadlines on sqlliveness heartbeats sqlliveness: no deadlines on heartbeats Aug 3, 2022
dhartunian added a commit to dhartunian/cockroach that referenced this issue Aug 3, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

TODO: tests
TODO: is this the right way and place to introduce the timeouts?

Resolves cockroachdb#85541

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Sep 6, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 7, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 8, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 12, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

This change also adds a `log.Fatal()` if the session expires while
trying to extend it.

Resolves cockroachdb#85541
Resolves cockroachdb#85540

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 12, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 12, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 16, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 21, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 22, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
craig bot pushed a commit that referenced this issue Sep 23, 2022
87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality

88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru

In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with
2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not
need to construct the SQL representation of the table's schema. This
results in a marked improvement in the performance of `SHOW BACKUP`
as can be seen in #88376 (comment).

Fixes: #88376

Release note (performance improvement): `SHOW BACKUP` on a backup containing
several table descriptors is now more performant

88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner

Fixes #87246

This will also improve tracing.

Release note: None

88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna

The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server.

The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing.

Release justification: non-production code change

Release note: None

88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov

Fixes #81881
Touches #72083

Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused

88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner

A schema might not exist because it has been dropped. We need to mark the lookup as required.

Fixes #87895

Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped.

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@craig craig bot closed this as completed in 4bea89f Sep 23, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 23, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 27, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Sep 27, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Sep 28, 2022
Previously, sqlliveness heartbeat operations could block on the transactions
that were involved. This change introduces some timeouts of the length of the
heartbeat during the create and refresh operations.

Resolves cockroachdb#85541

Release note: None

Release justification: low-risk bugfix to existing functionality
@andreimatei
Copy link
Contributor

@joshimhoff , the timeout that was introduced as a result of this issue having been filed has been removed because its implementation was buggy. Further, we couldn't convince ourselves that timeouts over there serve any purpose - why would one want to fail the respective query that's trying to heartbeat the session record, only for it to be retried, as opposed to letting it hang?
There's a link above to a slack conversation that suggests that perhaps observability would be better if these queries failed instead of hanging, but that seems unconvincing - there is a metric for the success of these queries, and the absence of success tells a story.

If you have thoughts, feel free to reopen the issue.

@joshimhoff
Copy link
Collaborator Author

I don't think it's about o11y. I think it's about reliability. The ticket description re: this is admittedly a bit vague. I include it down below to see if we can make the argument more concrete.

This means a single hung heartbeat will expire a sqliveness session. A hung heartbeat can happen for various reasons, e.g. temporary disk overload. Multi-tenant SQL pods crash if a sqliveness session expires right now. So a single hung heartbeat can actually crash a multi-tenant SQL pod.

I think we can rephrase ^^ as a question: Are there situations that happen in production where without deadlines & retries a hung heartbeat will stay hung forever leading to a SQL pod crash? Unless we are very sure the answer is no, I think setting a deadline plus doing retries is a good pattern, in particular since the consequence of session expiration is currently a SQL pod crashing. CC @ajwerner in case he has thoughts.

@joshimhoff joshimhoff reopened this Nov 14, 2022
@andreimatei
Copy link
Contributor

Are there situations that happen in production where without deadlines & retries a hung heartbeat will stay hung forever leading to a SQL pod crash?

A heartbeat query can hang when faced with various outages, but in those situations the retries will hang as well - so I don't think they serve any purpose other than to muddy the code. You can imagine bugs that would cause a retry to succeed where the original query is hanging, but I don't think we want to protect against such things.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Nov 14, 2022

You can imagine bugs that would cause a retry to succeed where the original query is hanging, but I don't think we want to protect against such things.

By "bug" do you mean a bug in the CRDB code, or do you mean certain poor infra conditions, e.g. blips in the network? I think we certainly want to be robust to the latter. Re: the former, I think we do too, but I can see that as being a more contentious view.

@andreimatei
Copy link
Contributor

I mean a bug in CRDB code. I don't think sprinkling timeouts in more or less random places to protect against hypothetical bugs is a good general policy. To go to pedantry, in my opinion, as far as most code is concerned, the only thing that matters when a bug is present is how easy it is to find the bug - so things that help debugging are good; things that hide bugs are bad; and things that transparently recover from bugs are generally not worth it. In this case, the timeout we're talking about, at least in the straight-forward way to implement it, would destroy the state of the stuck operations (i.e. the stacks traces involved are gone).

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Dec 8, 2022

We don't have the same view about the utility of higher level controls reducing the impact of bugs in lower level code, but the main reason I want a timeout in place is NOT to protect against bugs in the CRDB code. I want it to keep poor infra conditions e.g. in the network from leading to SQL pod crashes. I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries). As I think more about it, I realize I haven't introspected deeply on why I want timeouts until this discussion, so thanks for having it with me. At same time, there is a more basic argument tho which is that we should not crash the server unless that actually benefits the user in some way. In the case of a stuck SQL liveness heartbeat, crashing the SQL pod is doing nothing useful for the user, and it does reduce reliability, at the very least by slowing down recovery (restart takes time esp. in some environments) and breaking a bunch of stateful SQL connections.

I am interested in hearing what some other people think about this, as I realize my view is a bit uncertain.

CC @ajwerner @bdarnell

@andreimatei
Copy link
Contributor

I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries).

I don't think it's a real thing.


I agree that killing the process on session expiration is a bad idea. The only tiny written about it that I can find is here. I have failed to impress others to rethink this, and I didn't see it within my power to change when I was touching this code (and I was touching it, btw, to fix bugs partially introduced by this timeout having been added). Killing the process is unfortunately tied to other decisions about how the sql instance IDs are allocated - other instances are quick to steal the ID corresponding to an expired session, and having multiple servers use the same ID is not good for various reasons.

@ajwerner
Copy link
Contributor

ajwerner commented Dec 8, 2022

We can definitely do more to avoid killing the process. I don't think that work is particularly related to the discussion on timeouts.

@bdarnell
Copy link
Member

bdarnell commented Dec 8, 2022

I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries).

I think this is possible if the network is acting up and one TCP connection is horked but the packets are still flowing and new connections can be created. This is why we have RPC-level heartbeats on our inter-node connections. Are those being used for SQL pods? I think they should be. Without them we're reliant on TCP keepalives which IIRC have limited configurability and long timeouts.

Other than network-level issues which are better addressed with network-level heartbeats and timeouts, I think most things that would cause one heartbeat to hang would also cause the next attempt to hang, so there's little value in using a timeout to retry the exact same request.

However, AIUI, it's not the exact same request - the update includes an expiration time which was in the future when the attempt started. If that time is now in the past, it's better to cancel the previous attempt and start over. I don't think that's going to actually make much difference because the new attempt will still hang, but at least it means that things have a chance of working as soon as the blockage is cleared, instead of leaving the now-useless original request in flight.

We implemented timeouts for this reason in node liveness, and I think it's turned out to be important there. So it seems like it's probably going to be worthwhile to do something similar for sqlliveness.

@ajwerner
Copy link
Contributor

ajwerner commented Dec 8, 2022

Are those being used for SQL pods?

Yes, the connections with the sql pods use the same heartbeat logic.

@andreimatei
Copy link
Contributor

However, AIUI, it's not the exact same request - the update includes an expiration time which was in the future when the attempt started. If that time is now in the past, it's better to cancel the previous attempt and start over. I don't think that's going to actually make much difference because the new attempt will still hang, but at least it means that things have a chance of working as soon as the blockage is cleared, instead of leaving the now-useless original request in flight.

We implemented timeouts for this reason in node liveness, and I think it's turned out to be important there. So it seems like it's probably going to be worthwhile to do something similar for sqlliveness.

The context is different between this sql liveness session that's being heartbeated here, and node liveness. For node liveness, recovering as soon as possible is pretty important: if the liveness range is only able to process one hb every, say, 10s, that one hb that it eventually processes better be one that's not already expired. (Although flapping between alive and expired is probably worse than being consistently expired, so I question even that timeout).
Here, there's two differences:

  1. The expiration of sql liveness is much slower - 40s. So the respective range has much more time to process the heartbeats before sessions expire.
  2. Once the session expires, the process shuts down. So, given the 40s liveness expiration, there's quite a limited window where new requests are better than the original, stuck request.
    Do you buy that this color makes a difference?

How I feel about this timeout is also influenced by the fact that, when we added this timeout, we also introduced a fat bug. I take that as proof that this kind of complications to the code should not be done gratuitously.

@ajwerner
Copy link
Contributor

ajwerner commented Dec 8, 2022

It seems to me like the order of operations here should be to make it unlikely that the process needs to shut down upon expiration first, and then to talk about adding a timeout.

@joshimhoff
Copy link
Collaborator Author

Appreciate the color from everyone; it's clarifying.

It seems to me like the order of operations here should be to make it unlikely that the process needs to shut down upon expiration first, and then to talk about adding a timeout.

Seems reasonable esp. given IIUC the point Andrei is making about there being little time post expiration but pre server shutdown.

How I feel about this timeout is also influenced by the fact that, when we added this timeout, we also introduced a fat bug. I take that as proof that this kind of complications to the code should not be done gratuitously.

Alternatively, that code is too complex, with the proof being that we introduced a bug by adding timeouts ;)

@bdarnell
Copy link
Member

bdarnell commented Dec 8, 2022

Once the session expires, the process shuts down. So, given the 40s liveness expiration, there's quite a limited window where new requests are better than the original, stuck request.
Do you buy that this color makes a difference?

Yes. We'd need to make session expiration non-fatal before adding a timeout would make sense.

@andreimatei
Copy link
Contributor

Thanks all. Closing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
6 participants