Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft node failure KB issue; clean up cluster troubleshooting h2 nav #14528

Merged
merged 8 commits into from
Jul 20, 2022

Conversation

taroface
Copy link
Contributor

@taroface taroface commented Jul 13, 2022

Fixes DOC-2405.

  • Added troubleshooting issue for >10s node/region/AZ failure scenarios, based on the KB article.
  • Also cleaned up the cluster troubleshooting side nav by bringing all non-h2s down a level. The nav had become very difficult to read/navigate.

@lunevalex

  • I'll also request a broader KV team review as you suggested. Note that there are two TKs in one of the bullet points; I don't know the correct CRDB versions to call out here. Can you help?
  • I also split the "liveness range" and "system.users" topics into two separate scenarios, but let me know if that was mistaken.

@taroface taroface requested a review from lunevalex July 13, 2022 16:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@netlify
Copy link

netlify bot commented Jul 13, 2022

Netlify Preview

Name Link
🔨 Latest commit 73b63d6
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/62d88bbe027f8a000842a0ef
😎 Deploy Preview https://deploy-preview-14528--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@erikgrinaker erikgrinaker self-requested a review July 13, 2022 22:59
@@ -479,6 +479,25 @@ To view command commit latency:

**Expected values for a healthy cluster**: On SSDs, this should be between 1 and 100 milliseconds. On HDDs, this should be no more than 1 second. Note that we [strongly recommend running CockroachDB on SSDs](recommended-production-settings.html#storage).

#### Impact of node failure is greater than 10 seconds

When a node fails its liveness check, [each of its leases is transferred to a healthy node](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is somewhat misleading because it implies all the leases are transferred proactively, where in reality they are transferred based on on need. In other words a lease transfer only occurs when someone needs something from the range. This someone could be a use query, one of the queues i.e. the replicate queue. If no one needs anything from the range than the range will remain without lease, which is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clarification! I decided to reword both this and the corresponding line in the Replication Layer doc. Let me know if it's now accurate.

When a node fails its liveness check, [each of its leases is transferred to a healthy node](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node
). In theory, this process should take no more than 9 seconds for liveness expiration plus the cost of 2 network roundtrips: 1 for Raft leader election, and 1 for lease acquisition.

In production, depending on the version of CockroachDB you are running, lease transfer upon node failure can be "non-cooperative" and can take longer than expected. This is observed in the following scenarios:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are per-version pages, I would say "In this version of Cockroach it works this way, etc." All lease transfers during a failure are non-cooperative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; I have amended all 4 pages to be version-specific. PTAL.

v20.2/cluster-setup-troubleshooting.md Outdated Show resolved Hide resolved
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! This is a complex and vague topic, and I don't think anyone really knows all of the details and failure modes here, so I appreciate you starting to enumerate them.

Added a couple of comments, let me know if you'd like to go deeper on any of this, but I think it's also fine to gloss over the edge cases.

v22.1/cluster-setup-troubleshooting.md Outdated Show resolved Hide resolved

- **A node comes in and out of liveness, due to network failures or overload.** Prior to version TK, a "flapping" node could repeatedly lose and acquire leases in between its intermittent heartbeats, causing wide-ranging unavailability. In versions TK and later, a node must successfully heartbeat for 30 seconds after failing a heartbeat before it is eligible to acquire leases.

- **Network issues cause connection issues between nodes or DNS.** Prior to v20.2.18 and v21.1.12, this could cause 10 to 60 seconds of unavailability as the system stalled on network throughput, preventing a speedy movement of leases and recovery. In subsequent versions, CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations.
Copy link
Contributor

@erikgrinaker erikgrinaker Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still struggle with these scenarios. If a failure doesn't cause an immediate error to be returned, but instead prevents/delays a response at all, then CockroachDB can still appear to stall for prolonged periods of time. This can occur e.g. because of an internal deadlock, a disk stall, a network outage, or for other reasons. We solved some related issues e.g. here, which was backported to 21.2 and 22.1:

cockroachdb/cockroach#81136

However, issues still remain. As an illustration of the problem, let's say a node becomes unresponsive (rather than crashing), and a client runs a SELECT * FROM table that will scan 10 ranges. This will first attempt to contact the old leaseholder of the first range, and only after that times out (after 6 seconds) will it contact a different replica which can then acquire the lease. If there hasn't been any activity on the second range in the meanwhile, then contacting the second range will also wait for 6 seconds before timing out, and so on. So in total, this query can take up to 60 seconds before returning, in the worst case where all 10 leaseholders were on the faulty node.

Another issue is asynchronous network partitions, where a node either loses connectivity with some but not all of the nodes in the cluster, or a node is able to send but not receive messages. We do not handle these scenarios very well at all, in that leases may remain on the faulty node even though clients are not able to contact the leaseholder. Do we have any documentation around this already? If not, it might be worth writing something up separately.

In all of these scenarios, the right thing to do is to shut down the node such that network requests will immediately become hard errors rather than stalling.

Copy link
Contributor Author

@taroface taroface Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR Erik! I tried to cover the "unresponsive node" topic in a new, separate bullet. But now I wonder if the "Network issues cause connection issues between nodes or DNS." bullet needs to be rephrased to sound more distinct. What do you think?

About the network partition, do you mean an asymmetric network partition, or is asynchronous another type? If the former, we have a pretty light section on the same page. It is overdue for an update. I have a high-priority ticket to expand our network troubleshooting docs, and @mwang1026 has suggested that this could focus on asymmetric network partitions, and how to proactively look for them. So I think I should save your latter example for this issue. If you have further details (on what happens in these network partition scenarios, and how one might address), please feel free to comment in that ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to cover the "unresponsive node" topic in a new, separate bullet. But now I wonder if the "Network issues cause connection issues between nodes or DNS." bullet needs to be rephrased to sound more distinct. What do you think?

There are many failure modes that can render a node unresponsive. I think it makes sense to split these different modes out into separate bullet points. Maybe something like this (not verbatim):

  • Network or DNS issues cause connection issues between nodes. In particular the case where there is no live server for the IP address or DNS lookup, such that connection attempts will hang until they time out rather than get an immediate error from the remote node. We've done some work on this, as you've outlined, and generally we expect these to resolve within a reasonable period (10-30 seconds or thereabouts), but there could be cases we've missed. For example, we have TCP-level circuit breakers that will detect an unresponsive node, and only occasionally try to connect to it, otherwise erroring immediately.

  • Disk stalls. If a node's disk stalls, this can cause write operations to stall indefinitely. This will also cause node heartbeats to fail, since we do a synchronous disk write as part of the heartbeat, and it may cause read requests to fail if they're waiting for a conflicting write to complete. Prior to the PRs listed in kvserver: disk stall prevents lease transfer cockroach#81100 (comment), these could cause leases to get stuck indefinitely, with the only mitigation being restarting the node. Pebble will often detect these stalls and kill the node (after 20 seconds in 22.1, 60 seconds in earlier versions, configurable via storage.max_sync_duration ), but there have been known gaps in this detection that were fixed in 22.1.

  • Otherwise unresponsive nodes. The typical example is an internal deadlock due to faulty code, but there can also be other cases (e.g. resource exhaustion, OS/hardware issues, and other arbitrary failures). This can cause leases to get stuck in certain cases (typically where we rely on a response from the old leaseholder before moving the lease).

In all of these cases, shutting down the node will resolve the issue.

About the network partition, do you mean an asymmetric network partition, or is asynchronous another type?

Ah, sorry, you're right -- I meant an asymmetric network partition. I agree that this is better documented separately, like in that ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these across the docs versions - let me know how they look.

Regarding circuit breakers, is that the same feature documented here: https://www.cockroachlabs.com/docs/v22.1/architecture/replication-layer.html#per-replica-circuit-breakers

If so, do you mean to emphasize that the circuit breakers introduced in v22.1 can usually throw errors (good), or that they sometimes try to connect and don't return an error (bad)?

Copy link
Contributor

@erikgrinaker erikgrinaker Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think these struck the right balance in terms of detail. One nit in the intro: I don't think 2 network round trips is accurate for election+lease, I would rephrase it as "a few network round trips".

Regarding circuit breakers, is that the same feature documented here

No, that's a different circuit breaker, but it serves a similar purpose. That circuit breaker is for scenarios where ranges lose quorum and become unresponsive -- normally, this would cause all requests/queries to that range to hang while waiting for quorum, which can be problematic since they e.g. take up slots in connections pools which can cascade into broader outages. The circuit breaker will detect this and instead return immediate hard errors for the range.

The circuit breaker we discussed above is at the RPC layer, when attempting to connect to nodes. If a node does not have a live TCP stack (i.e. the server is powered off), then connection attempts will hang until they time out. We detect this and instead immediately error on further connection attempts, to prevent such delays from affecting client requests.

I don't think we necessarily need to spell any of this out, I think what you've written here is fine.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for writing this up!

@taroface taroface requested a review from kathancox July 20, 2022 16:04
@taroface
Copy link
Contributor Author

TFTR! I really appreciate the details and examples.

Copy link
Contributor

@kathancox kathancox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments that apply to all versions, but otherwise looks good to me.


- **Network or DNS issues cause connection issues between nodes.** If there is no live server for the IP address or DNS lookup, connection attempts to a node will not return an immediate error, but will hang until timing out. This can cause unavailability and prevent a speedy movement of leases and recovery. In **v20.2.18, v21.1.12, and later**, CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations, and the connection issue should generally resolve in 10-30 seconds. However, an attempt to contact an unresponsive node could still occur in other scenarios that are not yet addressed.

- **A node's disk stalls.** A disk stall on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 60 seconds, but there are gaps in its detection. In **v21.2.13, v22.1.2, and later**, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A disk stall on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail"

Should this be: "indefinitely, causing the node's..." OR "indefinitely, which causes the nodes'"


- **A node's disk stalls.** A disk stall on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 60 seconds, but there are gaps in its detection. In **v21.2.13, v22.1.2, and later**, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.

**Otherwise unresponsive nodes.** A node can be made unresponsive by internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures. This can cause leases to become stuck in certain cases, such as when a response from the previous leaseholder is needed in order to move the lease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if reversing the sentence here and removing the passive could be a good option:

"Internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures, can make a node unresponsive."


If a node has become unresponsive without returning an error, [shut down the node](node-shutdown.html) so that network requests immediately become hard errors rather than stalling.

If you are running a version of CockroachDB that is affected by an issue described here, [upgrade to a version](upgrade-cockroach-version.html) that contains the fix for the issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth linking to or mentioning where readers could find that information? I'm assuming release notes? Perhaps directing to the major release note pages — not sure. The upgrade page mentions the breaking changes, but that doesn't necessarily contain the fixes that readers might be looks for.


- **Network or DNS issues cause connection issues between nodes.** If there is no live server for the IP address or DNS lookup, connection attempts to a node will not return an immediate error, but will hang until timing out. This can cause unavailability and prevent a speedy movement of leases and recovery. In **v21.1.12 and later**, CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations, and the connection issue should generally resolve in 10-30 seconds. However, an attempt to contact an unresponsive node could still occur in other scenarios that are not yet addressed.

- **A node's disk stalls.** A [disk stall](#disk-stalls) on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 60 seconds, but there are gaps in its detection. In **v21.2.13, v22.1.2, and later**, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as per the other version! re the "causes the node's..."


- **A node's disk stalls.** A [disk stall](#disk-stalls) on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 60 seconds, but there are gaps in its detection. In **v21.2.13, v22.1.2, and later**, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.

**Otherwise unresponsive nodes.** A node can be made unresponsive by internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures. This can cause leases to become stuck in certain cases, such as when a response from the previous leaseholder is needed in order to move the lease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as per the other version re reversing the sentence.


If a node has become unresponsive without returning an error, [shut down the node](node-shutdown.html) so that network requests immediately become hard errors rather than stalling.

If you are running a version of CockroachDB that is affected by an issue described here, [upgrade to a version](upgrade-cockroach-version.html) that contains the fix for the issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same comment here re the possible explicit direction to release notes.

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @kathancox, @lunevalex, @mwang1026, and @taroface)


v20.2/cluster-setup-troubleshooting.md line 496 at r6 (raw file):

Previously, kathancox (Kathryn Hancox) wrote…

"A disk stall on a node can cause write operations to stall indefinitely, causes the node's heartbeats to fail"

Should this be: "indefinitely, causing the node's..." OR "indefinitely, which causes the nodes'"

Clarified!


v20.2/cluster-setup-troubleshooting.md line 498 at r6 (raw file):

Previously, kathancox (Kathryn Hancox) wrote…

I wonder if reversing the sentence here and removing the passive could be a good option:

"Internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures, can make a node unresponsive."

Done.


v20.2/cluster-setup-troubleshooting.md line 504 at r6 (raw file):

Previously, kathancox (Kathryn Hancox) wrote…

Would it be worth linking to or mentioning where readers could find that information? I'm assuming release notes? Perhaps directing to the major release note pages — not sure. The upgrade page mentions the breaking changes, but that doesn't necessarily contain the fixes that readers might be looks for.

The fix versions are meant to be conveyed by the list in this very section - I've spelled this out a bit more!

@taroface taroface merged commit d8a38cb into master Jul 20, 2022
@taroface taroface deleted the kb-failure-recovery branch July 20, 2022 23:19
@taroface taroface mentioned this pull request Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants