Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fatal storage/replica_corruption when running a schema change #36937

Closed
awoods187 opened this issue Apr 18, 2019 · 31 comments · Fixed by #36969
Closed

sql: fatal storage/replica_corruption when running a schema change #36937

awoods187 opened this issue Apr 18, 2019 · 31 comments · Fixed by #36969
Assignees
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@awoods187
Copy link
Contributor

awoods187 commented Apr 18, 2019

Describe the problem
I ran a schema change on a partitioned TPC-C 10K cluster and it killed two nodes:

root@localhost:26257/tpcc> CREATE UNIQUE INDEX ON tpcc.order (o_w_id ASC, o_d_id ASC, o_carrier_id ASC, o_id ASC) PARTITION BY RANGE (o_w_id) (                                                                                            PARTITION p4_0 VALUES FROM (0) TO (2000),                                                                                                                                                                                           PARTITION p4_1 VALUES FROM (2000) TO (4000),                                                                                                                                                                                        PARTITION p4_2 VALUES FROM (4000) TO (6000),                                                                                                                                                                                        PARTITION p4_3 VALUES FROM (6000) TO (8000),                                                                                                                                                                                        PARTITION p4_4 VALUES FROM (8000) TO (10000));
driver: bad connection
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the transaction status: dial tcp 127.0.0.1:26257: connect: connection refused
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the database name: dial tcp 127.0.0.1:26257: connect: connection refused
root@localhost:26257/? ?>

Here's the error:

F190418 19:43:26.083552 195202485 storage/replica_corruption.go:45  [n1,s1,r33237/3:/Table/58/6/{7431/7…-8360/1…}] replica is corrupted: replica corruption (processed=true): unable to compute stats for RHS range after split: expected mvcc metadata val bytes to equal 5; got 0

Here is the full logstostderr for the first node:
https://gist.github.com/awoods187/5f73970297988cb7ba511eaa75bef118
To Reproduce

set -euxo pipefail
export CLUSTER=andy-10k
roachprod create $CLUSTER -n 16 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER --  "sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-15 cockroach c65b71a27e4d0941bf9427b5dec1ff7f096bba7b
roachprod stage $CLUSTER:16 workload
roachprod start $CLUSTER:1-15 --racks=5 
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
roachprod run $CLUSTER:16 "./workload run tpcc --warehouses=10000 --partitions=5 --duration=1s --split --scatter {pgurl:1}"
roachprod sql $CLUSTER:1
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='64MiB'

Then run TPC-C:

roachprod run $CLUSTER:16 "./workload run tpcc --warehouses=10000 --ramp=10m --active-warehouses=10000 --partitions=5 --duration=5h --tolerate-errors --split --scatter {pgurl:1-15}"

Then make the index:

CREATE UNIQUE INDEX ON tpcc.order (o_w_id ASC, o_d_id ASC, o_carrier_id ASC, o_id ASC) PARTITION BY RANGE (o_w_id) (
       PARTITION p4_0 VALUES FROM (0) TO (2000),
       PARTITION p4_1 VALUES FROM (2000) TO (4000),
       PARTITION p4_2 VALUES FROM (4000) TO (6000),
       PARTITION p4_3 VALUES FROM (6000) TO (8000),
       PARTITION p4_4 VALUES FROM (8000) TO (10000))

Then crash!
Environment:
v19.1.0-rc.3-11-gc65b71a

@awoods187 awoods187 added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Apr 18, 2019
@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes labels Apr 18, 2019
@awoods187
Copy link
Contributor Author

awoods187 commented Apr 18, 2019

When trying to run debug.zip on the non-failed nodes I see:

Andrews-MBP-2:~ andrewwoods$ roachprod run $CLUSTER:2 -- ./cockroach debug zip --url {pgurl:1} debug.zip
warning: --url specifies user/password, but command "zip" does not accept user/password details - details ignored
Error: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.33.197:26257: connect: connection refused"
Failed running "debug zip"
Error:  exit status 1

@awoods187
Copy link
Contributor Author

@petermattis
Copy link
Collaborator

@andreimatei, @nvanbenschoten, @tbg is it possible this is another symptom of #36861?

@tbg
Copy link
Member

tbg commented Apr 18, 2019

@awoods187 you were running debug zip on node 2 but gave it node1 to talk to (pgurl:1). You have to change it to pgurl:2 (or anyone that's not crashed). You can still run the command on $CLUSTER:1, btw.

Two nodes failed with this error: n13 (r33237/1) and n1 (r33237/3). The error happens while evaluating the end transaction for a split. That two nodes died here suggests that the split was retried by another leaseholder after the first crash, and then crashed itself. The good thing about this is that the data causing the problem is likely still there, on all three replicas of the range. I was just able to verify that debug check-store (which also computes the stats) falls over with the same error.

The Raft log is... long, so we might have a good shot at learning something from it, too.

It's too late to look here today, but I'll take a look tomorrow. It would strike me as odd that this would be caused by #36861 simply because it looks so different (and skipping individual raft commands shouldn't give you MVCC-level inconsistencies), though we'll see.

The error occurs in

if (meta.val_bytes() != value.size()) {
stats.status = FmtStatus("expected mvcc metadata val bytes to equal %d; got %d",
int(value.size()), int(meta.val_bytes()));
return stats;

@awoods187 please make sure this cluster doesn't expire until Wednesday.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

In the logs, I'm seeing this fire again (for this rangeedit: not actually on ths range, probably also others):

} else if v == nil {
// This should never happen as ok is true above.
log.Warningf(ctx, "unable to find value for %s @ %s (%+v vs %+v)",
intent.Key, intent.Txn.Timestamp, meta, intent.Txn)

@nvanbenschoten I think I asked you about this on another thread, but I don't know if I got an answer -- do we really not expect this to fire (on deletions perhaps)? I guess this was running TPCC. The table is /Table/58/6/{7431/7…-8360/1…}; not sure if that's the unique index or a stock TPCC one.

This one also fired:

if intent.Status == roachpb.COMMITTED && intent.Txn.Epoch == 0 {
log.Warningf(ctx, "unable to find value for %s (%+v)",
intent.Key, intent.Txn)

@nvanbenschoten
Copy link
Member

do we really not expect this to fire (on deletions perhaps)? I guess this was running TPCC. The table is /Table/58/6/{7431/7…-8360/1…}; not sure if that's the unique index or a stock TPCC one.

I think we can see that case if:

  1. txn A writes an intent at key "a", epoch 1
  2. txn A bumps its epoch to 2
  3. txn A commits at timestamp 10
  4. txn B resolves txn A's intent, removing it because it was from an old epoch
  5. txn B writes an intent at timestamp 15
  6. txn A tries to resolve its intent on key "a"

I'm guessing the comment says it's not possible because it thinks that it called mvccGetInternal with the meta timestamp, so it thinks it should always get back the first versioned value. But this isn't true for two separate reasons:

  1. we don't call mvccGetInternal with meta.Timestamp, we call it with intent.Txn.Timestamp, so we may be reading beneath the first versioned value (intent or otherwise).
  2. we call mvccGetInternal with consistent=false, so we'll ignore intents even at lower timestamp.

I'm actually kind of confused by these cases. It seems like the v == nil should be possible but the v.Timestamp != intent.Txn.Timestamp case shouldn't. Even if we guarded the v == nil case with a intent.Txn.Epoch == 0 condition like we have in the other case, a GC could cause the us to fall into that case. However, I don't see how we would ever fall into the other one. mvccGetInternal returning a value with a different timestamp means that it's returning a value with a lower timestamp. This rules out the case where the value we're searching at was GCed. So where did it go?

We can discuss in #36947.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

r33237.n1.txt
r33237.n13.txt
r33237.n15.txt

The raft logs span ~600 entries. There is one SSTable ingestion just before the crash (crash happened after 28021, the end of the log on n13). The logs agree, so if #36861 caused this it would have had to happen more than 600 entries back.

If I had to take a wild guess, I would hazard a guess that looking into that SSTable at Term:6 Index:28020 might turn up something, for example an MVCC metadata key with a zero ValBytes when it should be 5. @dt could you take a look? The sideloaded file should sit on all of the replicas.

The other thing to fix here is the error. It should've told us much more than it did, for example which key this happened on and what the rest of the meta looked like. I suppose I'll poke at the data dir with a bespoke tool and figure out what key I'm supposed to look at.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

Here's the bespoke debugging. On n1 and n15, the output is identical. On n13 it definitely looks similar but looking closer there are two differences:

  1. the n13 version has seqno 29, not 39, and
  2. n13 is for a different key:
    n13 1555616543.558774271,0 /Table/58/6/8062/6/NULL/2341/0: "\x00\x00\x00\x00\x03"
    n15 1555616543.558774271,0 /Table/58/6/8062/8/NULL/2341/0: "\x00\x00\x00\x00\x03"
    
n15: expected (implicit=false) mvcc metadata val bytes to equal 5; got 0 from meta {Txn:id:9f721313-01b1-460f-9c3f-13f8c60e046a key:"\302\211\367\037~\211\206\366\332\210" timestamp:<wall_time:1555616541751560067 > priority:105649 sequence:39  Timestamp:1555616541.751560067,0 Deleted:true KeyBytes:12 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} (hex key c28ef71f7e9000f7092588001596a9011d8aadff09 value 0000000003)

n13: expected (implicit=false) mvcc metadata val bytes to equal 5; got 0 from meta {Txn:id:9f721313-01b1-460f-9c3f-13f8c60e046a key:"\302\211\367\037~\211\206\366\332\210" timestamp:<wall_time:1555616541751560067 > priority:105649 sequence:29  Timestamp:1555616541.751560067,0 Deleted:true KeyBytes:12 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} (hex key c28ef71f7e8e00f7092588001596a9011d8aadff09 value 0000000003)

Note that the key and value here are for the versioned key that causes the trouble. The associated meta key is read earlier. implicit=false tells us (this is all bespoke code, btw) that the stats computation first read a meta kv pair (which gives us the meta pretty-printed in the output), and then we position the iterator at the next key and find the version with the 000....003 value. However, the meta key is a deletion and so the version should not have any bytes; it should be a deletion tombstone which is just a key -> nil mapping.

That n13 has a different key with that problem might be unexpected, but I think the explanation might just be that they're at different log positions. n1 and n15 are at applied index 32448; n13 is behind at 28021:

$ roachprod ssh andy-10k:1,13,15 ./cockroach-tobias debug range-data /mnt/data1/cockroach/ 33237 | grep RangeAppliedState
0.000000000,0 /Local/RangeID/33237/r/RangeAppliedState (0x0169f781d5727261736b00): raft_applied_index:32448 lease_applied_index:30047 range_stats:<contains_estimates:true last_update_nanos:1555616603174128090 intent_age:27619 gc_bytes_age:32414842 live_bytes:124866335 live_count:4316528 key_bytes:103585184 key_count:4335806 val_bytes:21795619 val_count:4336418 intent_bytes:48611 intent_count:3033 sys_bytes:23907 sys_count:313 >
0.000000000,0 /Local/RangeID/33237/r/RangeAppliedState (0x0169f781d5727261736b00): raft_applied_index:28021 lease_applied_index:25625 range_stats:<contains_estimates:true last_update_nanos:1555616576881956145 intent_age:4597 gc_bytes_age:20328512 live_bytes:94101749 live_count:3255315 key_bytes:78171720 key_count:3272137 val_bytes:16370429 val_count:3272385 intent_bytes:20472 intent_count:1341 sys_bytes:15634 sys_count:205 >
0.000000000,0 /Local/RangeID/33237/r/RangeAppliedState (0x0169f781d5727261736b00): raft_applied_index:32448 lease_applied_index:30047 range_stats:<contains_estimates:true last_update_nanos:1555616603174128090 intent_age:27619 gc_bytes_age:32414842 live_bytes:124866335 live_count:4316528 key_bytes:103585184 key_count:4335806 val_bytes:21795619 val_count:4336418 intent_bytes:48611 intent_count:3033 sys_bytes:23907 sys_count:313 >

This in combination with the logs is actually a good sign, because this indicates that something happened between these two log positions that "fixed" one problem and introduced one on an adjacent key. We don't have all of the log indexes in between, but there's a hope that we can reconstruct something.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

Hmm, something I missed earlier are that there are write batches that I can't seem to decode:

failed to decode: invalid batch: expected 40 entries but found 2

For example index 27651. Going to have to look into that first.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

Via TestFoo in https://github.com/tbg/cockroach/tree/debug/mvcc36937 I can see that this batch at that index (according to the batch reader) contains two BatchTypeValue entries only. Other than that the writebatch seems well-formed (i.e. the two entries are read cleanly, and then we end up with exactly nothing left), so if this writebatch got clipped somehow it got clipped in an unexpectedly clean place.

Going to soup up the write batch pretty printer so that it at least prints these two entries.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

maybe these entries are actually AddSSTables. I say this just because.. what else would they be? Usually SSTs are not printed by debug raft-log because they're on disk, but snapshots can inline them. There's a backfill going on here, which sends small SSTs, so.. I'll take a look if maybe I'm just trying to print an SST as a WriteBatch.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

(but I'm like 99% convinced this isn't the case, because the thing decodes so cleanly into two ops that make sense:

failed to decode: invalid batch: expected 40 entries but found 2
after:
Put: 0.000000000,0 /Table/58/6/8017/1/NULL/2347/0 (0xc28ef71f518900f7092b8800): 1555616575.206539576,0 {Txn:id:7c65cc7a-d9dc-4542-b73c-b64f4cc7b760 key:"\302\211\367\037Q\211\206\366\324\210" timestamp:<wall_time:1555616575206539576 > priority:165508 sequence:4  Timestamp:1555616575.206539576,0 Deleted:true KeyBytes:12 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Put: 1555616575.206539576,0 /Table/58/6/8017/1/NULL/2347/0 (0xc28ef71f518900f7092b88001596a9087be5413809): ""

@tbg
Copy link
Member

tbg commented Apr 19, 2019

Nevermind, there's nothing to check here. If this were an inlined SST, it'd be inlined into the AddSST portion of the replicated eval result, it wouldn't just reside in the Entry.Data part.

@tbg
Copy link
Member

tbg commented Apr 19, 2019

You may already have guessed it, but the botched WriteBatch repr actually applies cleanly to a RocksDB engine (otherwise, we'd seen crashes before the assertion).

All of this is extremely worrying. I want to blame the distributed backfill, but I don't know how. The backfill would send SSTs. These are not SSTs. There's a clear pattern in these broken write batches -- always two entries, the first writing a meta key deletion, and the second writing the deletion tombstone version (i.e. putting an empty value).

This is certainly somehow related to the actual crash. For example, if the writebatch didn't clip cleanly, after the first two entries but say after three, it's conceivable that we would write a meta key saying that the next version would be a deletion, but if that deletion is missing, we might see an older version of the key that has nonzero value bytes. That would then trigger the assertion.

r33237.n1.txt

Actually I discovered that there's more variety in these clipped batches. Go see for yourself, but here's a sample of only the errors:

$ grep "but found" awoods-36937/r33237.n1.txt  | sort | uniq -c | sort -rn
 302 failed to decode: invalid batch: expected 40 entries but found 2
   1 failed to decode: invalid batch: expected 80 entries but found 2
   1 failed to decode: invalid batch: expected 60 entries but found 2
   1 failed to decode: invalid batch: expected 143 entries but found 2
   1 failed to decode: invalid batch: expected 140 entries but found 65
   1 failed to decode: invalid batch: expected 140 entries but found 60
   1 failed to decode: invalid batch: expected 121 entries but found 93

These large batches look like mostly deletions, but not only that.

Anyway, this all seems pretty severe. There's some chance that this is a problem in my debugging, but what problem would give these kind of perfect results? Especially the fact that these clipped write batches apply to RocksDB just fine is deeply worrying.

I'll hazard a guess that if we taught RocksDB to explode on those batches, we'll get repros pretty quickly. There's nothing rare about this problem; I can't see it be caused by the distributed backfill; not sure what's going on here but something is messed up so badly that I can't imagine that it won't happen again. In fact, it's difficult to understand how exactly that could've not happened earlier.

Unfortunately, I'm about to head out for the day early, but I think there's enough here for someone else to run with: teach RocksDB to assert the write batch count against what's actually in it. Run this test again.

I cannot see any connection to #36861.

@petermattis
Copy link
Collaborator

I'm coming up to speed here, following in @tbg's steps. The split of r33237 was initiated a few seconds before the crash on n13:

I190418 19:42:56.058156 194555898 storage/replica_command.go:247 [n13,split,s13,r33237/1:/Table/58/6/{7431/7…-8360/1…}] initiating a split of this range at key /Table/58/6/8034/10/4/1729 [r33257] (71 MiB above threshold size 64 MiB)
...
E190418 19:43:00.618670 194555898 storage/replica_corruption.go:42 [n13,s13,r33237/1:/Table/58/6/{7431/7…-8360/1…}] stalling replica due to: unable to compute stats for RHS range after split: expected mvcc metadata val bytes to equal 5; got 0

I too see all of the invalid batch messages. I'm going see if I notice any commonality in those batches.

@petermattis
Copy link
Collaborator

The invalid batch errors have a simple explanation: a bug in rocksDBBatchVarString when the value is 0 bytes.

diff --git a/pkg/storage/engine/batch.go b/pkg/storage/engine/batch.go
index 130a523404..a729f99a76 100644
--- a/pkg/storage/engine/batch.go
+++ b/pkg/storage/engine/batch.go
@@ -369,7 +369,7 @@ func rocksDBBatchVarString(repr []byte) (s []byte, orepr []byte, err error) {
        }
        repr = repr[n:]
        if v == 0 {
-               return nil, nil, nil
+               return nil, repr, nil
        }
        if v > uint64(len(repr)) {
                return nil, nil, fmt.Errorf("malformed varstring, expected %d bytes, but only %d remaining",

That seems bad. We would truncate any batch that we're reading when we encountered a 0-length varstring. This also seems so fundamentally bad that I'm super surprised we haven't seen this before.

@petermattis
Copy link
Collaborator

Looks like the bug in batch decoding was introduced in 563c010 (by me!) back in Jan 2018.

@petermattis
Copy link
Collaborator

Note that /Table/58/6 is a key that corresponds to the new index being added by the schema change (table ID 58 is tpcc.order and the crdb_internal.schema_changes table shows:

root@localhost:26257/tpcc> select * from crdb_internal.schema_changes WHERE table_id=58;
  table_id | parent_id | name  | type  | target_id |                target_name                |         state         | direction
+----------+-----------+-------+-------+-----------+-------------------------------------------+-----------------------+-----------+
        58 |        52 | order | INDEX |         6 | order_o_w_id_o_d_id_o_carrier_id_o_id_key | DELETE_AND_WRITE_ONLY | ADD
(1 row)

@petermattis
Copy link
Collaborator

I should note that it isn't clear if this batch decoding bug is the cause of the stats discrepancy. Seems possible they are connected, but I haven't worked out exactly how.

@petermattis
Copy link
Collaborator

Any KV Delete op will create a batch with a 0-length values. @tbg called this and I misunderstood. For example, the raft log for r33237 shows a write batch with:

Put: 0.000000000,0 /Table/58/6/7478/1/NULL/2339/0 (0xc28ef71d368900f709238800): 1555616574.158765376,0 {Txn:id:5fc0e1e7-a945-46ca-9f80-adaa8351ffc6 key:"\302\211\367\0356\211\206\366\334\210" timestamp:<wall_time:1555616574158765376 > priority:254941 sequence:4 Timestamp:1555616574.158765376,0 Deleted:true KeyBytes:12 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Put: 1555616574.158765376,0 /Table/58/6/7478/1/NULL/2339/0 (0xc28ef71d368900f7092388001596a9083d717d4009): ""

The first key is the intent while the second is the deletion tombstone.

Note that rocksDBBatchVarString is only used from RocksDBBatchReader. The latter isn't used in many places: buildIntents, but those always have non-zero values, and VerifyBatchRepr, but that doesn't seem to have been called. So far I haven't seen any evidence that this batch decoding bug is causing the original stats discrepancy.

@petermattis
Copy link
Collaborator

https://gist.github.com/petermattis/ea649419688e70ffda0b181e90493dec shows how this stats corruption could happen. I think we're in a situation that looks like:

"foo" -> MVCCMetadata{val_bytes: 0, deleted: true}
"foo"/0.000000000,2 -> 5
"foo"/0.000000000,1 -> 0

I'm speculating that this happened because sstable index backfill added "foo"/0.000000000,2 -> 5 while a foreground transaction deleted "foo"/0.000000000,1.

@dt does this seem feasible? Is this a situation you considered?

@petermattis
Copy link
Collaborator

The schema change timestamp is 1555616543558774271 (aka 1555616543.558774271). I got this by looking at the crdb.ts.* timestamps on two of the ingested sstables (the timestamps matched as expected). Next thing to check is if there were recent transactional deletes to r33237 that had an earlier timestamp.

@petermattis
Copy link
Collaborator

Hrmm, unfortunately there are no deletion tombstones that are older than 1555616543. The oldest Put I see in the raft log is at 1555616563 which is 20 seconds after the schema change timestamp.

@dt
Copy link
Member

dt commented Apr 19, 2019

even if we don't observe one here, I think that situation if quite possible -- since we wait until we know there is online traffic, including deletes, to the index before we start reading the table and writing SSTs -- so if we can construct a possible problem with that scenario, then we probably have a lurking problem there whether it proves to be what happened here or not.

@petermattis
Copy link
Collaborator

@tbg's early comment (#36937 (comment)) notes that the problematic key during the stats computation has exactly the timestamp of the schema change. Time to look at what his bespoke debugging code is doing.

@petermattis
Copy link
Collaborator

I was looking at the data on r33237 and couldn't find the problematic bits. Silly of me as it is the RHS (r33257) that experienced the error. Dumping the data for that range on n13 and grep'ing for the problematic keys shows:

0.000000000,0 /Table/58/6/8062/6/NULL/2341/0 (0xc28ef71f7e8e00f709258800): 1555616541.751560067,0 {Txn:id:9f721313-01b1-460f-9c3f-13f8c60e046a key:"\302\211\367\037~\211\206\366\332\210" timestamp:<wall_time:1555616541751560067 > priority:105649 sequence:29  Timestamp:1555616541.751560067,0 Deleted:true KeyBytes:12 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
1555616543.558774271,0 /Table/58/6/8062/6/NULL/2341/0 (0xc28ef71f7e8e00f7092588001596a9011d8aadff09): "\x00\x00\x00\x00\x03"
1555616541.751560067,0 /Table/58/6/8062/6/NULL/2341/0 (0xc28ef71f7e8e00f7092588001596a900b1d2c78309): ""

This is exactly the problematic scenario that my gist reproduces. The second entry there is from the index backfill (the timestamp is precisely the schema change timestamp of 1555616543.558774271). The first and third entries are the intent and the corresponding value. We know these are tied together because the timestamp in the intent and the value are identical (1555616541.751560067).

I think this more or less proves the theory correct that an sstable ingest is slipping a versioned key in between an intent and the key the intent corresponds to. This isn't something that we can support at the MVCC layer, at least not in the short term. There is a ton of code which assumes that the key corresponding to an intent immediately follows it.

I think the best avenue to fix this would be to delay the index backfill for some period of time until we can be guaranteed transactions won't be writing before the schema change timestamp. Cc @tbg, @nvanbenschoten, @bdarnell for what that wait time should be. Presumably this is related to the closed timestamp settings.

@dt
Copy link
Member

dt commented Apr 19, 2019

delay the index backfill

Since transactions are constantly happening for online traffic, I assume that means mean we pick a timestamp t, wait until t if far enough in the past that we don't think it'll encounter any txns, then run at t?

Alternatively, I'm now trying to figure out if we can just backdate the SST's keys i.e. to the statement timestamp which we know is before any txns started writing to the index key space. We'd still be reading to generate those keys at a time after those txns state so we know we won't miss anything, but we'd then write in the past. I think this could avoid interleaving, but I just want to convince myself it is OK for an online write at time 3 to shadow a backfill write derived from a read at time 4. I think it is: the online writes should be correct for everything that came in after they started -- but I'm still tying to think of a case where it isn't.

@bdarnell
Copy link
Contributor

what that wait time should be. Presumably this is related to the closed timestamp settings.

The closed timestamp settings don't provide any strong guarantees; we'd still need to find out what timestamps have actually been closed on each range. Intent resolution could be delayed arbitrarily so there's no fixed delay that can work here.

Instead of using closed timestamps, maybe we should just do a scan of the entire new index span (which will be nearly empty) at timestamp t. That will set the timestamp cache and prevent any new intents being written below that time. And a consistent scan will force any existing intents below that point to resolve.

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Apr 19, 2019

@dt and @vivekmenezes like @bdarnell suggestion. we'll work together on a PR to fix this. thanks!

@dt
Copy link
Member

dt commented Apr 19, 2019

And to answer the question I raised above: no, we can't backdate the SST's key to before we started the online writes. @vivekmenezes helped walk me though an example where we are relying on the backfiller's write to shadow an online traffic write:

  • time t1: node1 has the updated schema and gets a write that should delete an index entry.
  • time t2: node2 still has the old schema, gets an insert. It does not write an index entry though, which would shadow the tombstone at t1.
  • time t3: all nodes have the updated schema and are using it -- the backfill can proceed.
  • time t4: backfiller picks time time to run at which is > t3 to be sure it doesn't miss a row that didn't get online updates. It picks t4 to read at so it will see the row written by node2 at t2, and makes an index entry for it correctly.

If the backfiller above then writes its entries at t4, it is fine: the entry for the row is correct. If it backdated it before any other traffic, the tombstone at t1 would shadow it, and we're missing an entry.

Anyway, we're going to work on using a Scan to calcify history at the currently used time -- after the state machine has advanced and all nodes are using it -- and stick to reading and writing at that time but knowing that the scan has made doing so safe.

@petermattis
Copy link
Collaborator

Instead of using closed timestamps, maybe we should just do a scan of the entire new index span (which will be nearly empty) at timestamp t.

I like this idea.

For posterity, here is my understanding of what happened:

  • Node 1 starts a schema change to add an index, bumping the table descriptor version and then proceeds to wait until all nodes have acknowledged the version bump.
  • Node 2 is notified of the updated table descriptor (via gossip).
  • Txn A starts on node 2 at timestamp T1 and uses the updated table descriptor (causing it to perform writes and deletes into the new index).
  • Node 1 sees that all of the nodes have acknowledge the version bump and starts up the backfill at timestamp T2. Note that T2 > T1.
  • Txn A performs deletions into range 33237 (these are transactional deletes, so they include intents).
  • An sstable is ingested into 33237 with timestamps all set at T2, including a user-key that was modified by Txn A.
  • Range 33237 is split. As part of committing the split we recompute the stats for the RHS and an error is generated due to this unexpected and unsupported situation.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 19, 2019
…ader

Previously, 0-length varstrings inadvertently cuased the reader to
truncate the batch repr which would usually result in a call to `Next`
complaining about the batch containing an unexpected number of
entries. So far, it looks like the only effect of this would be "invalid
batch" errors while using `cockroach debug` commands. It is possible
there is a more serious effect, though.

See cockroachdb#36937

Release note (bug fix): Fixed a bug in write batch decoding that could
cause "invalid batch" errors while using `cockroach debug` commands to
analyze data.
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 19, 2019
…ader

Previously, 0-length varstrings inadvertently cuased the reader to
truncate the batch repr which would usually result in a call to `Next`
complaining about the batch containing an unexpected number of
entries. So far, it looks like the only effect of this would be "invalid
batch" errors while using `cockroach debug` commands. It is possible
there is a more serious effect, though.

See cockroachdb#36937

Release note (bug fix): Fixed a bug in write batch decoding that could
cause "invalid batch" errors while using `cockroach debug` commands to
analyze data.
craig bot pushed a commit that referenced this issue Apr 19, 2019
36598: sql: Unify the SQL type system r=andy-kimball a=andy-kimball

This PR unifies the three different SQL type systems in CRDB:

  `sqlbase.ColumnType` - protobuf representation of table column types
  `coltypes.T` - in-memory representation of table column types
  `types.T` - computation types

The type systems are quite similar, but also subtly different. This causes bugs on a regular basis throughout the codebase. Each type system supports a different subset of the PG type functionality. For example, `sqlbase.ColumnType` stores column widths. `types.T` does not store widths, but it can store a Postgres OID value, which `sqlbase.ColumnType` is unable to do.

In addition, even for supported functionality, the API's are often tricky and error-prone. For example, `types.T` values can sometimes be compared using `==`, but other times `Equivalent` must be called, and it is up to the user to know when to use which. As another example, support for OID values require wrapping `types.T` instances in a special `TOIDWrapper` type. Our developers are expected to know when it needs to be done, but typically they either don't understand when it's needed, or else forget to do it. It also breaks `==` comparison semantics, leading to subtle bugs that only manifest in edge cases with unusual OID values.

Another big problem that comes up repeatedly are lossy conversions. Throughout the codebase, there are frequent conversions to/from the various type systems. Many conversions discard important type information, which often leads to incompatibilities with Postgres, as well as providing a breeding ground for bugs.

This PR unifies the three type systems into just one. The unified type is called `types.T` (in new `sql/types` package). It uses a protobuf-generated struct that is backwards-compatible with `sqlbase.ColumnType` so that it can be embedded in place of `ColumnType` in larger protobuf messages (like `sqlbase.ColumnDescriptor`). The API is locked down so that it's difficult to ignorantly or accidentally create invalid types. The unified type supports everything that the three previous type systems did, and is more compatible with PG as a result (with more work, even more type compatibility is possible).

Resolves #32639

36681: distsqlplan: reset MergeOrdering if windower is planned r=yuzefovich a=yuzefovich

Windower doesn't guarantee maintaining the order, so merge ordering
of the plan should be set to empty.

Fixes: #35179.

Release note: None

36962: storage/engine: fix handling of 0-length varstrings in RocksDBBatchReader r=tbg a=petermattis

Previously, 0-length varstrings inadvertently cuased the reader to
truncate the batch repr which would usually result in a call to `Next`
complaining about the batch containing an unexpected number of
entries. So far, it looks like the only effect of this would be "invalid
batch" errors while using `cockroach debug` commands. It is possible
there is a more serious effect, though.

See #36937

Release note (bug fix): Fixed a bug in write batch decoding that could
cause "invalid batch" errors while using `cockroach debug` commands to
analyze data.

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
craig bot pushed a commit that referenced this issue Apr 22, 2019
36969: sql: Scan index range to calcify it before starting backfill r=dt a=dt

Sending a scan request to the index span will a) ensure all intents are resolved as of that time and b) populate
the tscache ensuring nothing else can sneak under that time.

This is needed since a the backfill will ingest SSTs with keys at that time and ingestion just adds all the keys blindly.
This could mean allowing a key to slide between an intent and the key it is adding, violating an assumption that they are
always sequential.

Fixes #36937.

Release note: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig craig bot closed this as completed in #36969 Apr 22, 2019
craig bot pushed a commit that referenced this issue Apr 29, 2019
36947: mvcc: clean up warnings around failed intent resolution r=nvanbenschoten a=nvanbenschoten

We saw "impossible" warnings fire in #36937, which indicated that these
warnings had issues. It turns out that they did not seem to be correct.
The "impossible" warning appears to be possible if an intent from an
earlier epoch is removed and replaced by either a new intent or a
committed value with a higher timestamp. Meanwhile, one of the warnings
that had "false positives" doesn't actually appear to.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants