New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: election-after-restart failed #30613
Comments
These errors likely originate in DistSQL. I assume they get sent back to the client verbatim (or there would be a structured error return in the matches below, which I've checked there isn't).
DistSender definitely works differently, it's going to wrap these either with a SendError or AmbiguousResult, but won't leak them verbatim (see What is it that we want to happen here in a DistSQL world? cc @jordanlewis |
SHA: https://github.com/cockroachdb/cockroach/commits/9f191cdad64747f7f19ec4990e559cbb19e6d372 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=928248&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/137b282943695b8d2aabcbd5d69be85b58746acd Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=933504&tab=buildLog
|
See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None
30384: workload/tpcc: adjust audit checks for 1 warehouse r=solongordon a=solongordon I updated the TPCC audit checks not to complain about a lack of remote orders/payments when there is only one warehouse. Fixes #29275 Release note: None 30817: roachtest: trace slow query in election test r=benesch a=tschottdorf See #30613 (comment). Touches #30613 Release note: None 30820: roachtest: update comment on scaledata tests r=petermattis a=tschottdorf We're already running them under basic chaos. Release note: None 30822: Update issue templates r=knz a=knz I took the proposed default Github templates and "blended" our existing issue template into it. I also created a new "performance inquiry" template given the large number of incoming inquiries that are miscategorized as bugs. 30825: rpc,server: authenticate all gRPC methods r=bdarnell,petermattis a=benesch Previously only the roachpb.Batch RPC was correctly checking for an authenticated user. All other RPCs were open to the public, even when the server was running in secure mode. To prevent future accidents of this kind, hoist the authentication check to a gRPC interceptor that is guaranteed to run before all RPCs. Release note (bug fix): A security vulnerability in which data could be leaked from or tampered with in a cluster in secure mode has been fixed. Co-authored-by: Solon Gordon <solon@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: kena <knz@users.noreply.github.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/f78cc21bb4d7bd3a79e44b4608c04684edf28179 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=939960&tab=buildLog
|
Repro'ed in a few minutes of running
on gceworker, with the output intact (not as above). Artifacts attached: out.zip Has the trace in test.log. |
Looks like the only thing we really need to look at is the below. There are basically two ranges that return RangeNotFound for significant periods of time (i.e. dozens of seconds).
|
Got this error spuriously too (probably what happened in the nightly failure above; I tweaked the test to report stderr in that case):
|
Heh, I incremented the connection timeout, now I see
cc @petermattis that thing really is everywhere. |
Meh, now it's just sitting there and not repro'ing anything. Well. It'll fail eventually. |
Ack. I'm working on it. |
Still not failing (except with the occasional breaker open). I must've gotten really lucky the first time around. |
Finally got one. The RangeNotFoundErrors all originate here:
This gets called a bunch in lots of places, but I think we're getting it from the lease renewal code. This smells like replica removal and waiting for GC, but in a 3x cluster? Time to sleuth some logs. |
I looked at another couple repros (which, btw, happened every couple minutes now) but didn't catch the snapshot thing again. Oh well. I assume this is unrelated and a sign that when lots of Raft log gets created right after the cluster start, we may sometimes truncate too aggressively (maybe the Raft status isn't fully populated at that point). |
SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog
|
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
SHA: https://github.com/cockroachdb/cockroach/commits/ac2f39fcc6be7366bc786d231890ee91e84f1c3c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=955173&tab=buildLog
|
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes #30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added. 31187: roachtest: add synctest r=bdarnell a=tschottdorf This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. The synctest command is new. It simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is used in the roachtest, but is also useful for manual use in user deployments in which we suspect there is a failure to persist data to disk. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None 31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes #31096. Fixes #31149. Fixes #31160. Fixes #31167. Release note: None 31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults. Release note: none. 31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
SHA: https://github.com/cockroachdb/cockroach/commits/b2bd8e8b6446a566b667be6094e019c1040ed98d
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=923560&tab=buildLog
The text was updated successfully, but these errors were encountered: