Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: put is inline=true, but existing value is inline=false #28205

Closed
petermattis opened this issue Aug 2, 2018 · 42 comments
Closed

storage: put is inline=true, but existing value is inline=false #28205

petermattis opened this issue Aug 2, 2018 · 42 comments
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

pq: "/Local/RangeID/136/r/AbortSpan/\"c5278963-f8eb-421b-80ba-0e0030c600fe\"": put is inline=true, but existing value is inline=false

I've seen the error above a handful of times while running the ycsb roachtests in a loop. The error has only ever occurred with workloads A and B. I'm testing historical binaries and so far the latest binary I've seen this error on was built from b5edd2a (5/29).

I have logs from these roachtest runs, but the clusters have been wiped. Reproduction seems to occur in a few hours. Running ycsb workloads A and B in a loop might reproduce on master:

~ roachtest run --count=100 -d 'ycsb/(A|B)'
@petermattis petermattis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Aug 2, 2018
@petermattis petermattis added this to the 2.1 milestone Aug 2, 2018
@petermattis
Copy link
Collaborator Author

Finished the backfill of ycsb perf numbers and I saw this error up through August binaries. So whatever is happening has not been fixed. @tschottdorf Any thoughts on debugging this?

@tbg
Copy link
Member

tbg commented Aug 13, 2018

I think printing the existing value whenever this error occurs should put us on the hot trail. The AbortSpan is supposed to be all inline.

@petermattis
Copy link
Collaborator Author

@tschottdorf The existing value is nil. That's why meta.IsInline() is returning false:

func (meta MVCCMetadata) IsInline() bool {
	return meta.RawBytes != nil
}

Is your suspicion that there is a non-metadata key present?

@petermattis
Copy link
Collaborator Author

I added a bit of additional logging to mvccPutInternal to log whenever an abort span key is written:

--- a/pkg/storage/engine/mvcc.go
+++ b/pkg/storage/engine/mvcc.go
@@ -1166,6 +1166,11 @@ func mvccPutInternal(
                return emptyKeyError()
        }

+       var isAbortSpanKey bool
+       if _, err := keys.DecodeAbortSpanKey(key, nil); err == nil {
+               isAbortSpanKey = true
+       }
+
        metaKey := MakeMVCCMetadataKey(key)
        ok, origMetaKeySize, origMetaValSize, err := mvccGetMetadata(iter, metaKey, &buf.meta)
        if err != nil {
@@ -1174,6 +1179,9 @@ func mvccPutInternal(

        // Verify we're not mixing inline and non-inline values.
        putIsInline := timestamp == (hlc.Timestamp{})
+       if isAbortSpanKey {
+               log.Infof(ctx, "%s: put inline=%t value=[% x]", metaKey, putIsInline, value)
+       }
        if ok && putIsInline != buf.meta.IsInline() {
                return errors.Errorf("%q: put is inline=%t, but existing value is inline=%t",
                        metaKey, putIsInline, buf.meta.IsInline())

I then ran the ycsb/B roachtest in a loop, eventually getting a failure:

Error: pq: "/Local/RangeID/112/r/AbortSpan/\"44daad1e-9f00-415b-ac57-b54e9301104e\"": put is inline=true, but existing value is inline=false

But the only reference to that transaction ID is:

I180814 00:53:54.199183 4048872 storage/engine/mvcc.go:1183  [n3,s3,r112/2:/Table/53/1/"user7{559…-933…}] /Local/RangeID/112/r/AbortSpan/"44daad1e-9f00-415b-ac57-b54e9301104e": put inline=true value=[80 9a f6 c3 03 0a 1d bd 89 12 75 73 65 72 37 37 37 30 35 32 37 32 37 35 36 34 31 30 34 33 32 30 35 00 01 88 12 0a 08 e4 e1 f4 c5 c1 c2 a6 a5 15 18 98 dd 15]

Seems like this is the only write of the abort span key. I'm not sure what this means. Perhaps something is taking a code path to write an abort span key without going through mvccPutInternal. Need to page this code back into memory.

@tbg tbg added this to On the horizon in KV Aug 14, 2018
@tbg
Copy link
Member

tbg commented Aug 14, 2018

Interesting. So what do you know about the raw kv pair that is read when the put error occurs? Only that the decoded meta has len(RawBytes) == 0? If you print out the whole meta, do you get a nonzero timestamp or just a zero MVCCMetadata? Can you also print out the raw iter key and value?
We touch AbortSpan keys so little that I'm tempted to think that this is tickling some engine bug.
Or perhaps a bug in the split logic, where we copy these keys around.

@petermattis
Copy link
Collaborator Author

Yeah, something funky is going on. I'm going to add a bunch more logging for abort span keys and reproduce again. Could be an engine bug. Could be something else.

@petermattis
Copy link
Collaborator Author

Here's another reproduction with the contents of MVCCMetadata:

1.logs/cockroach.log:I180814 14:55:19.288882 68397 storage/engine/mvcc.go:1183 [n1,s1,r86/1:/Table/53/1/"user5{316…-613…}] /Local/RangeID/86/r/AbortSpan/"ab4f3599-bc74-4158-9df1-39992a00f92c": put inline=true value=94b1fcbc030a1cbd891275736572353535333138343539323332393536303238000188120a08bdb7f398e9ffb1a515188cf30c meta={Txn:id:<c11eebda-c5bd-44bc-b6ed-95251fb9c3df> key:"\275\211\022user555318459232956028\000\001\210" epoch:1 timestamp:<wall_time:1534258519271844397 logical:1 > priority:2657718 sequence:3 Timestamp:1534258519.258279173,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

Is the txn-ID stored in the MVCCMetadata supposed to differ from the one stored in the abort-span key?

@petermattis
Copy link
Collaborator Author

Note that I also instrumented the low-level code paths for batch.{Put,Clear} and didn't see the abort-span key in question ever being written via those code paths. Very curious.

@tbg
Copy link
Member

tbg commented Aug 14, 2018

Wait, are you telling me that you're seeing an intent on the abort span? That's very wrong. Even seeing a versioned value would be very wrong. I guess we knew that something was very wrong before, but this is still surprising. I bet this is an intent written by the workload and it accidentally gets read back to the abort cache (i.e. it isn't really in the abort cache, just something goes wrong on the engine level).

This is in line with your findings that the abort span key was never actually written - it probably wasn't! It was just about to be written for the first time.

I think reasonable next steps are saving a data dir and dumping the keys (to see whether we persistently get this key displayed under the offending abortspan key for that run). If yes, great, problem almost fixed because we're going to track it down (I assume you're certain that you would know if it had actually been written there). If there just isn't an abortspan key, it must be a transient bug and I don't have a better idea other than printf-debugging all the way into RocksDB :(

@petermattis
Copy link
Collaborator Author

Wait, are you telling me that you're seeing an intent on the abort span? That's very wrong.

Yes, yes it is. This is reproducible, so you don't have to worry. The bug is already dead, it just doesn't know it yet.

@petermattis
Copy link
Collaborator Author

Got another two failures with a minor twist to the logging. Just before we generate the error, I seek the iterator to roachpb.KeyMin and then read the mvcc metadata again. This forces the iterator state to reset and ensures we're not seeing something cached inappropriately (i.e. we're actually reading the key from rocksdb). Same problem as before:

[n2,s2,r101/3:/Table/53/1/"user5{316…-613…}] /Local/RangeID/101/r/AbortSpan/"d4758249-4849-4b9b-a31b-ec1592ed7f00": ok=false meta={Txn:id:<7360af2b-5bf3-42d1-86be-70d9e0de6478> key:"\275\211\022user555318459232956028\000\001\210" epoch:1 timestamp:<wall_time:1534308806206257025 logical:2 > priority:185105 sequence:3 Timestamp:1534308806.206257025,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

And from a different run:

I180815 05:24:06.902673 1956067 storage/engine/mvcc.go:1187 [n3,s3,r92/2:/Table/53/1/"user5{316…-613…}] /Local/RangeID/92/r/AbortSpan/"771bcc4d-1b82-416b-9d01-dfbaa1ed8700": ok=false meta={Txn:id:<25efa1f2-2e21-4f14-b665-3bf59d2ef653> key:"\275\211\022user555318459232956028\000\001\210" epoch:1 timestamp:<wall_time:1534310646888529814 logical:1 > priority:694545 sequence:3 Timestamp:1534310646.880123188,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

In both cases we seem to have an intent on an abort span key which should be impossible. Time to go instrument at a lower level to find out where this intent is coming from.

Something that might be a coincidence, the Txn.key in both failures is \275\211\022user555318459232956028\000\001\210. The user... part is from a write to the table key. That is supposed to be a random number. Notice that this is the same as the earlier failure. Three failures, all involving the same Txn.key. That seems unlikely. But perhaps this is the first key in the skewed ycsb distribution. I wonder if that is related. (Just thinking out loud).

@nvanbenschoten
Copy link
Member

If my memory serves me correctly, @spencerkimball ran into something very similar when working on the contentionQueue. I'm having a hard time remembering the specific, but I think it had something to do with transactions that haven't been assigned keys yet being pushed. I know that's not much to go off of and I wouldn't let that distract you unless further evidence is found to suggest something going wrong with contention. It's also possible that he was seeing exactly this and it went away before he could track it down.

@spencerkimball
Copy link
Member

Yes I was seeing something like this when dealing with high contention. I put in tons of logging and kept arriving at impossible conclusions. Then something I changed caused the problem to disappear and I lost track of it. The details are hazy but I remember I even created a new engine to look up the spurious value, which seemed impossible, through a fresh code path, but my debug efforts kept foundering. It’s something subtle.

@petermattis
Copy link
Collaborator Author

Well I can still reproduce on current master. Takes between 20m and 3h to reproduce, but that's something I can track down. @spencerkimball Do you have any notes from your previous debugging effort?

@nvanbenschoten
Copy link
Member

As a quick sanity check, it might be worth trying to reproduce with 0969698 (parent of 9a54256), although there is some risk that the speedup from 9a54256 simply revealed an existing problem.

@spencerkimball
Copy link
Member

No I don't have notes. Damn, I was able to reproduce it in a minute or so using the YCSB workload A generator. But the changes I ended up with avoided the problem. I want to say the problem was much worse when I had a bug where I'd push txn's which didn't yet have a txn "key" because the txn's were still read-only. This of course caused those pushes to go to range 0, because the key was "", where they'd pile up in strange ways. I'll poke around and try to repro, but the eventual strategy for high contention is much evolved from those earlier points where I'd see the issue.

@petermattis
Copy link
Collaborator Author

@nvanbenschoten I definitely encountered this problem in binaries from April. The problem has existed for at least that long. 9a54256 was merged on May 31.

@spencerkimball
Copy link
Member

@nvanbenschoten I was just running the YCSB load generator and saw this error, which is now repeating endlessly: split failed while applying backpressure: could not find valid split key

@nvanbenschoten
Copy link
Member

s/repeating endlessly/repeating until old versions of a row get GCed/

That's almost certainly due to a single row consuming more than 128MB of space, which makes some sense given the nature of YCSB. The result is expected and actually desirable given our current handling of old MVCC versions, see #24215. The alternatives are to let the range grow without bound or to violate the GC policy.

How long were you running for before you hit that? I'd try dropping your gc_ttl and see if it goes away.

@petermattis
Copy link
Collaborator Author

The split problem sounds like a different issue. Let's file a new bug to track it.

@petermattis
Copy link
Collaborator Author

Btw, my hope and expectation is that I'll track this down tomorrow (or this weekend if reproduction trends to the longer side).

@petermattis
Copy link
Collaborator Author

I finally fixed my test setup to not wipe the cluster when a failure occurs. The next failure shows the same pattern as above, what appears to be an intent on an abort span key:

I180817 15:02:55.825840 17781 storage/engine/mvcc.go:1179 [n2,s2,r132/3:/Table/53/1/"user11392{6…-7…}] /Local/RangeID/132/r/AbortSpan/"251b5a15-06f8-466c-8630-a0c40059c100": put inline=true value=9c686186030a1ebd8912757365723131333932363934313632393930353935313930000188120a08888bcc97e984eda51518efa224 meta={Txn:id:<0724303f-3f34-45de-a556-c712f9d1af3c> key:"\275\211\022user11392694162990595190\000\001\210" epoch:1 timestamp:<wall_time:1534518175811124572 logical:2 > priority:870120 sequence:3 Timestamp:1534518175.811124572,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

I stopped the cluster and dumped the range-data for r132. The only mention of that 251b5a15-06f8-466c-8630-a0c40059c100 is the txn record:

0.000000000,0 /Local/Range/Table/53/1/"user11392694162990595190"/0/Transaction/"251b5a15-06f8-466c-8630-a0c40059c100": id=251b5a15 key=/Table/53/1/"user11392694162990595190"/0 rw=false pri=0.02767364 iso=SERIALIZABLE stat=ABORTED epo=1 ts=1534518175.816942984,0 orig=1534518175.816942984,0 max=0.000000000,0 wto=false rop=false seq=3

The stack trace I added at mvcc.go:1179 indicates that the failure is coming out of a the ResolveIntent code path. Not sure if that is related. I have code that verifies that the engine iterator is definitely seeing an abort span with an intent, but when I looked on disk there is no such abort span. One possibility is that the batch contains the messed up key. I'm going to add more logging to dump the batch contents when this error occurs.

Also, I still have the instrumentation which detects all abort span puts and that is indicating that the above abort span key was never written. Current theory is that some rare code path is stomping/reusing memory inappropriately.

@petermattis
Copy link
Collaborator Author

Happened again and my instrumentation to print out the contents of the batch didn't output anything. Either that instrumentation was incorrect or the batch was empty. I added some more instrumentation to figure out which it is. Again on this failure, the abort span key with an intent was not present on disk.

@petermattis
Copy link
Collaborator Author

Additional instrumentation shows the batch is empty which was probably to be expected. So wtf is going on that we see an abort-span-key with an intent in memory, but it doesn't appear to exist on disk?

@spencerkimball
Copy link
Member

This is where I ended up. No sensible explanation presented itself.

@petermattis
Copy link
Collaborator Author

I added some more instrumentation so that when AbortSpan.Put fails I dump all of the abort span keys for the associated range. The most recent of the error was:

I180817 17:57:55.647593 161386 storage/engine/mvcc.go:1179 [n2,s2,r121/2:/Table/53/1/"user11{375…-414…}] /Local/RangeID/121/r/AbortSpan/"e22301fe-04ce-45b4-9ff2-4c00006c64fc": put inline=true value=804d4712030a1ebd8912757365723131333932363934313632393930353935313930000188120a08eeb0d789b4b6efa51518f4812b meta={Txn:id:<3e048ad2-d748-44b0-a34f-86faa8f437e9> key:"\275\211\022user11392694162990595190\000\001\210" epoch:1 timestamp:<wall_time:1534528675641077829 logical:1 > priority:1247268 sequence:3 Timestamp:1534528675.614086902,4 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

And the logging that occurs immediately after this (on the same goroutine, using the same batch) shows 155 abort-span keys, but the key above isn't present. Hmm, I just noticed I used MVCCIterate to iterate over the abort-span keys. I'm going to change that to using an reader.Iterate instead which is lower level.

@petermattis
Copy link
Collaborator Author

Another repro, this time with the reader.Iterate version. The range had 4027 keys in the /Local/RangeID/<id>/r/AbortSpan space, but all of them were unversioned.

@petermattis
Copy link
Collaborator Author

Ok, making progress. Rather using Reader.Iterate which creates a new RocksDB iterator, I also add a version which uses Reader.NewIterator(IterOptions{Prefix: true})). This uses the iterator cached inside of the RocksDB batch. Lo-and-behold, the abort-span key in question is present. Here is the original error message:

I180817 20:36:48.859271 42987 storage/engine/mvcc.go:1179 [n1,s1,r118/1:/Table/53/1/"user11392{6…-7…}] /Local/RangeID/118/r/AbortSpan/"610053c2-c0c8-47f5-8a7d-d7d56a07ac00": put inline=true value=06cda006030a1ebd8912757365723131333932363934313632393930353935313930000188120a08b088ae87eecbf1a51518c5b729 meta={Txn:id:<4bc808eb-3169-431b-a770-befb5ad0c203> key:"\275\211\022user11392694162990595190\000\001\210" epoch:1 timestamp:<wall_time:1534538208839715285 logical:2 > priority:423827 sequence:3 Timestamp:1534538208.839715285,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>}

And then the subsequent iteration using the prefix iterator:

I180817 20:36:48.859382 42987 storage/abortspan/abortspan.go:162 [n1,s1,r118/1:/Table/53/1/"user11392{6…-7…}] [iter] /Local/RangeID/118/r/AbortSpan/"610053c2-c0c8-47f5-8a7d-d7d56a07ac00"

Immediately after where that message is logged, I iterate over all of the abort-spans for the range with:

	min := engine.MakeMVCCMetadataKey(sc.min())
	max := engine.MakeMVCCMetadataKey(sc.max())
	_ = e.Iterate(min, max, func(kv engine.MVCCKeyValue) (bool, error) {
		log.Infof(ctx, "%s\n", kv.Key)
		return false, nil
	})

And the key /Local/RangeID/118/r/AbortSpan/"610053c2-c0c8-47f5-8a7d-d7d56a07ac00" is not present! So something funky is going on either with prefix iteration or the cached iterator.

@petermattis
Copy link
Collaborator Author

(I really hope this doesn't turn out be a bug in RocksDB)

@spencerkimball
Copy link
Member

Sure sounds like it.

@petermattis
Copy link
Collaborator Author

My instrumentation has evolved to the following:

	if ok && putIsInline != buf.meta.IsInline() {
		log.Infof(ctx, "%s: put inline=%t value=%x meta=%+v iter=%T",
			metaKey, putIsInline, value, buf.meta, iter)

		switch t := iter.(type) {
		case *reusableBatchIterator:
			log.Infof(ctx, "%s: [pre] key=%s", metaKey, cToGoKey(t.iter.key))
			t.iter.reseek = true
		case *rocksDBIterator:
			log.Infof(ctx, "%s: [pre] key=%s", metaKey, cToGoKey(t.key))
			t.reseek = true
		case *batchIterator:
			log.Infof(ctx, "%s: [pre] key=%s", metaKey, cToGoKey(t.iter.key))
			t.iter.reseek = true
		}

		iter.Seek(metaKey)
		if ok, err := iter.Valid(); !ok {
		} else if err != nil {
			log.Infof(ctx, "%s: [seek] err=%v", metaKey, err)
		} else {
			log.Infof(ctx, "%s: [seek] key=%s", metaKey, iter.Key())
		}

		return errors.Errorf("%q: put is inline=%t, but existing value is inline=%t",
			metaKey, putIsInline, buf.meta.IsInline())
	}

The type switch was necessary because I wasn't quite sure of iterator I was dealing with and I wanted to reach inside of its internals.

The logs show:

I180817 21:57:20.208318 1589898 storage/engine/mvcc.go:1178 [n1,s1,r126/1:/Table/53/1/"user11392{6…-7…}] /Local/RangeID/126/r/AbortSpan/"004da400-e39c-491b-aaf3-85c210d57713": put inline=true value=8378115b030a1ebd8912757365723131333932363934313632393930353935313930000188120a0886b4af9dbcd8f2a5151894842b meta={Txn:id:<d7f45d77-2432-4ead-9467-47897988f950> key:"\275\211\022user11392694162990595190\000\001\210" epoch:1 timestamp:<wall_time:1534543040192493874 logical:2 > priority:2288140 sequence:3 Timestamp:1534543040.192493874,2 Deleted:false KeyBytes:12 ValBytes:1025 RawBytes:[] MergeTimestamp:<nil>} iter=*engine.reusableBatchIterator

Then:

I180817 21:57:20.208364 1589898 storage/engine/mvcc.go:1183 [n1,s1,r126/1:/Table/53/1/"user11392{6…-7…}] /Local/RangeID/126/r/AbortSpan/"004da400-e39c-491b-aaf3-85c210d57713": [pre] key=/Local/RangeID/126/r/AbortSpan/"004da400-e39c-491b-aaf3-85c210d57713"

Then nothing. I didn't output a log message for the case where the iterator wasn't valid. (Not sure why I did that, it's been a long day). reusableBatchIterator.iter.key is the cached key at the current iterator. It is very interesting that forcing the iterator to seek (setting reseek = true) causes that key to not be found on the subsequent Seek call.

reusableBatchIterator.iter.key (aka rocksDBIterator.key) is only set by rocksDBIterator.setState. We only access that field if the iterator is valid. Need to think for a bit about my next move.

I'd be suspicious of the recent RocksDB changes except that we've seen this bug back in April, well before those changes were made.

@petermattis
Copy link
Collaborator Author

AFAICT (at the moment), at some point we make a call to rocksDBIterator.Seek(some-abort-span-key). This is the very first time some-abort-span-key has been seen. The call finds some-abort-span-key without actually seeking on the RocksDB iterator because rocksDBIterator.key was already some-abort-span-key. This is impossible! Time to start printing out the memory addresses.

@petermattis
Copy link
Collaborator Author

Current theory: we have 93 uses of UnsafeKey and one of them isn't kosher.

@petermattis
Copy link
Collaborator Author

I instrumented the rocksDBIterator.setState to log whenever we set the key to an abort-span key. Another repro occurred, this time for the key /Local/RangeID/117/r/AbortSpan/"08510084-005c-4ff8-b4c9-ba7a1c350e8f". Again, rocksDBIterator.Seek is finding that key without actually seeking because rocksDBIterator.key already contains that value. But my logging in rocksDBIterator.setState never output a message with that key. This is looking more and more like some sort of unintentional memory sharing.

In both the rocksDBIterator.Seek and rocksDBIterator.setState instrumentation I output the address of the C++ key. Here is the seek log message in question:

I180818 01:33:34.856396 1477577 storage/engine/rocksdb.go:2030 seek(/Local/RangeID/117/r/AbortSpan/"08510084-005c-4ff8-b4c9-ba7a1c350e8f"): /Local/RangeID/117/r/AbortSpan/"08510084-005c-4ff8-b4c9-ba7a1c350e8f" [0xc428145ea0/0x7fed10c93060]: storage/engine.(*rocksDBIterator).Seek:2031,storage/engine.(*batchIterator).Seek:1364,storage/engine.mvccGetMetadata:765,storage/engine.mvccPutInternal:1170,storage/engine.mvccPutUsingIter:1107

Hard to see, but the value 0x7fed10c93060 corresponds to the C++ key data. This is the only occurrence of that value in the log files.

I audited the uses of UnsafeKey and didn't see anything funky. Perhaps there is some memory misuse in the libroach code.

@petermattis
Copy link
Collaborator Author

Yet more instrumentation shows how funky this bug is. Whenever rocksDBIterator.setState is called I record the pointer value and key that was read along with a small trace (util.GetSmallTrace). Then when a rocksDBIterator.Seek call finds the iterator already located at an abort-span I log this info:

I180818 03:44:35.416860 1688180 storage/engine/rocksdb.go:2031 0xc42ac67ba8/0x7fb23381f180: seek(/Local/RangeID/119/r/AbortSpan/"9cfb0027-4a0e-40b2-afb6-0087ac6acdb5"): /Local/RangeID/119/r/AbortSpan/"9cfb0027-4a0e-40b2-afb6-0087ac6acdb5": storage/engine.(*rocksDBIterator).Seek:2032,storage/engine.(*batchIterator).Seek:1364,storage/engine.mvccGetMetadata:765,storage/engine.mvccPutInternal:1170,storage/engine.mvccPutUsingIter:1107 [0x7fb23381f180: /Table/53/1/"user11392694162990595190"/0: storage/engine.(*rocksDBIterator).setState:2121,storage/engine.(*rocksDBIterator).Seek:2036,storage/engine.(*batchIterator).Seek:1364,storage/engine.mvccGetMetadata:765,storage/engine.mvccResolveWriteIntent:2203]

This corresponds to the following bit of logging:

		if r.valid && !r.reseek && key.Equal(r.UnsafeKey()) {
			if isAbortSpanKey(r.UnsafeKey()) {
				log.Infof(context.TODO(), "%p/%p: seek(%s): %s: %s [%s]",
					r, &r.UnsafeKey().Key[0], key, r.UnsafeKey(), util.GetSmallTrace(2), r.stateMsg)
			}
			return
		}
		r.setState(C.DBIterSeek(r.iter, goToCKey(key)))

Let's break this down. The key passed to rocksDBIterator.Seek is /Local/RangeID/119/r/AbortSpan/"9cfb0027-4a0e-40b2-afb6-0087ac6acdb5". These same bytes were already present in rocksDBIterator.key.

rocksDBIterator.Seek was called from:

storage/engine.(*rocksDBIterator).Seek:2032
storage/engine.(*batchIterator).Seek:1364
storage/engine.mvccGetMetadata:765
storage/engine.mvccPutInternal:1170
storage/engine.mvccPutUsingIter:1107

Now the super interesting part. The last setState call saw a different value in rocksDBIterator.key. Specifically, it saw /Table/53/1/"user11392694162990595190"/0. rocksDBIterator.key.key.data has the same pointer value in both the setState call and the Seek call, but the data changed. That's not supposed to happen.

The small trace for the setState call is:

storage/engine.(*rocksDBIterator).setState:2121
storage/engine.(*rocksDBIterator).Seek:2036
storage/engine.(*batchIterator).Seek:1364
storage/engine.mvccGetMetadata:765
storage/engine.mvccResolveWriteIntent:2203

I believe this corresponds to the engine.MVCCResolveWriteIntent call from batcheval.ResolveIntent.

So we made a call to MVCCResolveWriteIntent. That call seeked the iterator to /Table/53/1/"user11392694162990595190"/0 and cached that location with setState and then sometime later, without another call to setState, the key transformed into /Local/RangeID/119/r/AbortSpan/"9cfb0027-4a0e-40b2-afb6-0087ac6acdb5". (I wonder if the encoded lengths of those keys is identical).

@petermattis
Copy link
Collaborator Author

I found a bug and it is a doozy and a minor miracle that it wasn't causing more problems. The easy to understand version: rocksDBIterator caches the currently pointed to key and value inside of its struct. Every Cgo call performed by rocksDBIterator needs to be associated with a call to rocksDBIterator.setState. A handful of rocksDBIterator methods were not doing this correctly.

In 2.0 (and earlier) several additions to the Iterator interface were made that didn't follow this pattern. In particular, ComputeStats, FindSplitKey, MVCCGet and MVCCScan. The latter two calls were added in 2.0. FindSplitKey was added in 1.1 and ComputeStats was added before 1.0. While the bug has been present in these methods since their inception actually tickling the bug requires an unusual combination of circumstances. It is likely that only the recent addition of Iterator.MVCC{Get,Scan} could result in a problem in practice due.

I'm feeling pretty confident that this was causing the put-is-inline discrepancy, though I'll be running ycsb in a loop all day to verify.

@spencerkimball
Copy link
Member

Amazing debugging.

@petermattis
Copy link
Collaborator Author

While the bug has been present in these methods since their inception actually tickling the bug requires an unusual combination of circumstances.

Specifically, we need a combination of MVCC calls which use mvccGetMetadata, then MVCCGet, followed by mvccGetMetadata. And we need the keys in question to be exactly the same length (because the key length is cached in the Go side). Lastly, the target key for the final mvccGetMetadata needs to match (which is particularly difficult to arrange). Intent resolution followed by setting the abort span follows this pattern:

batcheval.ResolveIntent
-> MVCCResolveWriteIntent('/Table/53/1/"user11392694162990595190"/0')
  -> mvccGetMetadata('/Table/53/1/"user11392694162990595190"/0')
-> batcheval.SetAbortSpan()
  -> abortspan.AbortSpan.Get('/Local/RangeID/130/r/AbortSpan/"7fc79000-82d8-48bc-abd5-5b9b00b5320a"')
    -> MVCCGet('/Local/RangeID/130/r/AbortSpan/"7fc79000-82d8-48bc-abd5-5b9b00b5320a"')
  -> abortspan.AbortSpan.Put('/Local/RangeID/130/r/AbortSpan/"7fc79000-82d8-48bc-abd5-5b9b00b5320a"')
    -> mvccGetMetadata('/Local/RangeID/130/r/AbortSpan/"7fc79000-82d8-48bc-abd5-5b9b00b5320a"')

What's not obvious from the pretty-printed representation of the keys above is that /Table/53/1/"user11392694162990595190"/0 and /Local/RangeID/130/r/AbortSpan/"7fc79000-82d8-48bc-abd5-5b9b00b5320a" are both 30 bytes long when encoded. A slightly longer table key would not cause a problem. Or even a tableID that is larger which causes the variable length encoding of the tableID to be a byte longer. Also needing to match is the variable length encoding of the rangeID.

A bit of additional instrumentation shows that the first 2 conditions mentioned above are reasonably common in our tests, but the 3rd condition doesn't occur.

@petermattis
Copy link
Collaborator Author

Fixed by #28794 (linked the wrong issue in the commit message).

@nvanbenschoten
Copy link
Member

Awesome job tracking this down!

A bit of additional instrumentation shows that the first 2 conditions mentioned above are reasonably common in our tests, but the 3rd condition doesn't occur.

Doesn't occur or doesn't readily occur? If it's easy enough to detect these scenarios after the fact, it may be worth doing so loudly on master after the code freeze and leaving the detection in for a few days to search for any test flakes that were caused by this bug. That may help close a few long-standing test flakes that won't be reproducible from this point onward.

I'm also curious if you think this could have materialized as different inconcistencies than just this inline=true, but existing value is inline=false error.

@petermattis
Copy link
Collaborator Author

Doesn't occur or doesn't readily occur?

AFAICT, it doesn't occur at all. I'm still scratching my head as to how it does occur in this scenario. As near as I can tell, the original memory the iterator pointed to needs to be freed and then reallocated and used to store the key when C.MVCCGet() is called and that same key needs to be used in the the next call to rocksDBIterator.Seek.

It is straightforward to add instrumentation to detect this scenario, though I'm not sure how to do it for particular builds (such as race-only) without increasing the size of the rocksDBIterator structure. With the added instrumentation I didn't detect any instances in our tests of all 3 conditions holding. I might take another crack at this.

I'm also curious if you think this could have materialized as different inconcistencies than just this inline=true, but existing value is inline=false error.

Certainly possible.

@petermattis
Copy link
Collaborator Author

Certainly possible.

... though rare. I would guess that the recent spate of test flakiness is not due to this. Remember that this bug has been present since 2.0.

@tbg tbg moved this from On the horizon to Closed in KV Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants