-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fatal storage/replica_corruption when running a schema change #36937
Comments
When trying to run debug.zip on the non-failed nodes I see:
|
@andreimatei, @nvanbenschoten, @tbg is it possible this is another symptom of #36861? |
@awoods187 you were running debug zip on node 2 but gave it node1 to talk to ( Two nodes failed with this error: n13 (r33237/1) and n1 (r33237/3). The error happens while evaluating the end transaction for a split. That two nodes died here suggests that the split was retried by another leaseholder after the first crash, and then crashed itself. The good thing about this is that the data causing the problem is likely still there, on all three replicas of the range. I was just able to verify that The Raft log is... long, so we might have a good shot at learning something from it, too. It's too late to look here today, but I'll take a look tomorrow. It would strike me as odd that this would be caused by #36861 simply because it looks so different (and skipping individual raft commands shouldn't give you MVCC-level inconsistencies), though we'll see. The error occurs in cockroach/c-deps/libroach/mvcc.cc Lines 186 to 189 in d7788f9
@awoods187 please make sure this cluster doesn't expire until Wednesday. |
In the logs, I'm seeing this fire again ( cockroach/pkg/storage/engine/mvcc.go Lines 2247 to 2250 in c65b71a
@nvanbenschoten I think I asked you about this on another thread, but I don't know if I got an answer -- do we really not expect this to fire (on deletions perhaps)? I guess this was running TPCC. The table is This one also fired: cockroach/pkg/storage/engine/mvcc.go Lines 2216 to 2218 in c65b71a
|
I think we can see that case if:
I'm guessing the comment says it's not possible because it thinks that it called
I'm actually kind of confused by these cases. It seems like the We can discuss in #36947. |
r33237.n1.txt The raft logs span ~600 entries. There is one SSTable ingestion just before the crash (crash happened after 28021, the end of the log on n13). The logs agree, so if #36861 caused this it would have had to happen more than 600 entries back. If I had to take a wild guess, I would hazard a guess that looking into that SSTable at The other thing to fix here is the error. It should've told us much more than it did, for example which key this happened on and what the rest of the meta looked like. I suppose I'll poke at the data dir with a bespoke tool and figure out what key I'm supposed to look at. |
Here's the bespoke debugging. On n1 and n15, the output is identical. On n13 it definitely looks similar but looking closer there are two differences:
Note that the key and value here are for the versioned key that causes the trouble. The associated meta key is read earlier. That n13 has a different key with that problem might be unexpected, but I think the explanation might just be that they're at different log positions. n1 and n15 are at applied index 32448; n13 is behind at 28021:
This in combination with the logs is actually a good sign, because this indicates that something happened between these two log positions that "fixed" one problem and introduced one on an adjacent key. We don't have all of the log indexes in between, but there's a hope that we can reconstruct something. |
Hmm, something I missed earlier are that there are write batches that I can't seem to decode:
For example index 27651. Going to have to look into that first. |
Via Going to soup up the write batch pretty printer so that it at least prints these two entries. |
maybe these entries are actually AddSSTables. I say this just because.. what else would they be? Usually SSTs are not printed by |
(but I'm like 99% convinced this isn't the case, because the thing decodes so cleanly into two ops that make sense:
|
Nevermind, there's nothing to check here. If this were an inlined SST, it'd be inlined into the |
You may already have guessed it, but the botched WriteBatch repr actually applies cleanly to a RocksDB engine (otherwise, we'd seen crashes before the assertion). All of this is extremely worrying. I want to blame the distributed backfill, but I don't know how. The backfill would send SSTs. These are not SSTs. There's a clear pattern in these broken write batches -- always two entries, the first writing a meta key deletion, and the second writing the deletion tombstone version (i.e. putting an empty value). This is certainly somehow related to the actual crash. For example, if the writebatch didn't clip cleanly, after the first two entries but say after three, it's conceivable that we would write a meta key saying that the next version would be a deletion, but if that deletion is missing, we might see an older version of the key that has nonzero value bytes. That would then trigger the assertion. Actually I discovered that there's more variety in these clipped batches. Go see for yourself, but here's a sample of only the errors:
These large batches look like mostly deletions, but not only that. Anyway, this all seems pretty severe. There's some chance that this is a problem in my debugging, but what problem would give these kind of perfect results? Especially the fact that these clipped write batches apply to RocksDB just fine is deeply worrying. I'll hazard a guess that if we taught RocksDB to explode on those batches, we'll get repros pretty quickly. There's nothing rare about this problem; I can't see it be caused by the distributed backfill; not sure what's going on here but something is messed up so badly that I can't imagine that it won't happen again. In fact, it's difficult to understand how exactly that could've not happened earlier. Unfortunately, I'm about to head out for the day early, but I think there's enough here for someone else to run with: teach RocksDB to assert the write batch count against what's actually in it. Run this test again. I cannot see any connection to #36861. |
I'm coming up to speed here, following in @tbg's steps. The split of
I too see all of the |
The diff --git a/pkg/storage/engine/batch.go b/pkg/storage/engine/batch.go
index 130a523404..a729f99a76 100644
--- a/pkg/storage/engine/batch.go
+++ b/pkg/storage/engine/batch.go
@@ -369,7 +369,7 @@ func rocksDBBatchVarString(repr []byte) (s []byte, orepr []byte, err error) {
}
repr = repr[n:]
if v == 0 {
- return nil, nil, nil
+ return nil, repr, nil
}
if v > uint64(len(repr)) {
return nil, nil, fmt.Errorf("malformed varstring, expected %d bytes, but only %d remaining", That seems bad. We would truncate any batch that we're reading when we encountered a 0-length varstring. This also seems so fundamentally bad that I'm super surprised we haven't seen this before. |
Looks like the bug in batch decoding was introduced in 563c010 (by me!) back in Jan 2018. |
Note that
|
I should note that it isn't clear if this batch decoding bug is the cause of the stats discrepancy. Seems possible they are connected, but I haven't worked out exactly how. |
Any KV
The first key is the intent while the second is the deletion tombstone. Note that |
https://gist.github.com/petermattis/ea649419688e70ffda0b181e90493dec shows how this stats corruption could happen. I think we're in a situation that looks like:
I'm speculating that this happened because sstable index backfill added @dt does this seem feasible? Is this a situation you considered? |
The schema change timestamp is |
Hrmm, unfortunately there are no deletion tombstones that are older than |
even if we don't observe one here, I think that situation if quite possible -- since we wait until we know there is online traffic, including deletes, to the index before we start reading the table and writing SSTs -- so if we can construct a possible problem with that scenario, then we probably have a lurking problem there whether it proves to be what happened here or not. |
@tbg's early comment (#36937 (comment)) notes that the problematic key during the stats computation has exactly the timestamp of the schema change. Time to look at what his bespoke debugging code is doing. |
I was looking at the data on
This is exactly the problematic scenario that my gist reproduces. The second entry there is from the index backfill (the timestamp is precisely the schema change timestamp of I think this more or less proves the theory correct that an sstable ingest is slipping a versioned key in between an intent and the key the intent corresponds to. This isn't something that we can support at the MVCC layer, at least not in the short term. There is a ton of code which assumes that the key corresponding to an intent immediately follows it. I think the best avenue to fix this would be to delay the index backfill for some period of time until we can be guaranteed transactions won't be writing before the schema change timestamp. Cc @tbg, @nvanbenschoten, @bdarnell for what that wait time should be. Presumably this is related to the closed timestamp settings. |
Since transactions are constantly happening for online traffic, I assume that means mean we pick a timestamp Alternatively, I'm now trying to figure out if we can just backdate the SST's keys i.e. to the statement timestamp which we know is before any txns started writing to the index key space. We'd still be reading to generate those keys at a time after those txns state so we know we won't miss anything, but we'd then write in the past. I think this could avoid interleaving, but I just want to convince myself it is OK for an online write at time 3 to shadow a backfill write derived from a read at time 4. I think it is: the online writes should be correct for everything that came in after they started -- but I'm still tying to think of a case where it isn't. |
The closed timestamp settings don't provide any strong guarantees; we'd still need to find out what timestamps have actually been closed on each range. Intent resolution could be delayed arbitrarily so there's no fixed delay that can work here. Instead of using closed timestamps, maybe we should just do a scan of the entire new index span (which will be nearly empty) at timestamp |
@dt and @vivekmenezes like @bdarnell suggestion. we'll work together on a PR to fix this. thanks! |
And to answer the question I raised above: no, we can't backdate the SST's key to before we started the online writes. @vivekmenezes helped walk me though an example where we are relying on the backfiller's write to shadow an online traffic write:
If the backfiller above then writes its entries at Anyway, we're going to work on using a Scan to calcify history at the currently used time -- after the state machine has advanced and all nodes are using it -- and stick to reading and writing at that time but knowing that the scan has made doing so safe. |
I like this idea. For posterity, here is my understanding of what happened:
|
…ader Previously, 0-length varstrings inadvertently cuased the reader to truncate the batch repr which would usually result in a call to `Next` complaining about the batch containing an unexpected number of entries. So far, it looks like the only effect of this would be "invalid batch" errors while using `cockroach debug` commands. It is possible there is a more serious effect, though. See cockroachdb#36937 Release note (bug fix): Fixed a bug in write batch decoding that could cause "invalid batch" errors while using `cockroach debug` commands to analyze data.
…ader Previously, 0-length varstrings inadvertently cuased the reader to truncate the batch repr which would usually result in a call to `Next` complaining about the batch containing an unexpected number of entries. So far, it looks like the only effect of this would be "invalid batch" errors while using `cockroach debug` commands. It is possible there is a more serious effect, though. See cockroachdb#36937 Release note (bug fix): Fixed a bug in write batch decoding that could cause "invalid batch" errors while using `cockroach debug` commands to analyze data.
36598: sql: Unify the SQL type system r=andy-kimball a=andy-kimball This PR unifies the three different SQL type systems in CRDB: `sqlbase.ColumnType` - protobuf representation of table column types `coltypes.T` - in-memory representation of table column types `types.T` - computation types The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, `sqlbase.ColumnType` stores column widths. `types.T` does not store widths, but it can store a Postgres OID value, which `sqlbase.ColumnType` is unable to do. In addition, even for supported functionality, the API's are often tricky and error-prone. For example, `types.T` values can sometimes be compared using `==`, but other times `Equivalent` must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping `types.T` instances in a special `TOIDWrapper` type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks `==` comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values. Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs. This PR unifies the three type systems into just one. The unified type is called `types.T` (in new `sql/types` package). It uses a protobuf-generated struct that is backwards-compatible with `sqlbase.ColumnType` so that it can be embedded in place of `ColumnType` in larger protobuf messages (like `sqlbase.ColumnDescriptor`). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible). Resolves #32639 36681: distsqlplan: reset MergeOrdering if windower is planned r=yuzefovich a=yuzefovich Windower doesn't guarantee maintaining the order, so merge ordering of the plan should be set to empty. Fixes: #35179. Release note: None 36962: storage/engine: fix handling of 0-length varstrings in RocksDBBatchReader r=tbg a=petermattis Previously, 0-length varstrings inadvertently cuased the reader to truncate the batch repr which would usually result in a call to `Next` complaining about the batch containing an unexpected number of entries. So far, it looks like the only effect of this would be "invalid batch" errors while using `cockroach debug` commands. It is possible there is a more serious effect, though. See #36937 Release note (bug fix): Fixed a bug in write batch decoding that could cause "invalid batch" errors while using `cockroach debug` commands to analyze data. Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Peter Mattis <petermattis@gmail.com>
36969: sql: Scan index range to calcify it before starting backfill r=dt a=dt Sending a scan request to the index span will a) ensure all intents are resolved as of that time and b) populate the tscache ensuring nothing else can sneak under that time. This is needed since a the backfill will ingest SSTs with keys at that time and ingestion just adds all the keys blindly. This could mean allowing a key to slide between an intent and the key it is adding, violating an assumption that they are always sequential. Fixes #36937. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
36947: mvcc: clean up warnings around failed intent resolution r=nvanbenschoten a=nvanbenschoten We saw "impossible" warnings fire in #36937, which indicated that these warnings had issues. It turns out that they did not seem to be correct. The "impossible" warning appears to be possible if an intent from an earlier epoch is removed and replaced by either a new intent or a committed value with a higher timestamp. Meanwhile, one of the warnings that had "false positives" doesn't actually appear to. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Describe the problem
I ran a schema change on a partitioned TPC-C 10K cluster and it killed two nodes:
Here's the error:
Here is the full logstostderr for the first node:
https://gist.github.com/awoods187/5f73970297988cb7ba511eaa75bef118
To Reproduce
Then run TPC-C:
Then make the index:
Then crash!
Environment:
v19.1.0-rc.3-11-gc65b71a
The text was updated successfully, but these errors were encountered: