New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read skew leading to logical data corruption with tablet moves #4543
Comments
I ran these tests with the release/v1.1 branch (e18986f) that will be released as v1.1.2. This contains the fix from PR #4496.
This build passes 10 individual 600-second test runs of the |
It looks like this bug is either rare or wasn't present in 1.1.1; I haven't managed to reproduce it there in a few hours of testing. |
I've managed to confirm this issue was present in 1.1.1 as well: here's a failing test case. You can reproduce this with Jepsen b7c7bdc5f8476e009d591927060e7e8e786f015a by running: lein run test-all --username admin --nodes-file ~/nodes -w bank --nemesis move-tablet --version 1.1.1 --time-limit 600 --test-count 30 It's not as common as the null record problem we saw in #4534, but it does happen. This can't be caused by the null record issue, because in this case, the account total rose from 100 to 102: if we'd propagated null state back into the DB via a transfer transaction, we could only lower, rather than raise, the account balance. |
Reopening based on last comment by @aphyr . |
Digging into this 1.1.1 failure a little more, I think this might help shed some light on things. Process 5 completes a transfer of $2 from 6 to 3 while a tablet move is taking place. The increment of account 3 takes place, but the decrement doesn't. Then process 5 fails with an {:type :ok, :f :transfer, :value {:from 6, :to 3, :amount 2}, :process 5, :time 263508090019, :index 58018}
{:type :invoke, :f :transfer, :value {:from 6, :to 1, :amount 1}, :process 5, :time 263508241599, :index 58019}
{:type :ok, :f :read, :process 16, :time 263508906869, :value {0 51, 2 1, 3 5, 4 11, 5 3, 6 8, 7 21}, :index 58020}
{:type :invoke, :f :transfer, :value {:from 0, :to 5, :amount 2}, :process 16, :time 263509029440, :index 58021}
{:type :ok, :f :read, :process 14, :time 263516597337, :value {0 51, 2 1, 3 5, 4 11, 5 3, 6 8, 7 21}, :index 58022}
{:type :invoke, :f :transfer, :value {:from 3, :to 5, :amount 2}, :process 14, :time 263516725225, :index 58023}
{:type :fail, :f :transfer, :value {:from 0, :to 5, :amount 2}, :process 16, :time 263592684274, :error :conflict, :index 58024}
{:type :invoke, :f :read, :process 16, :time 263592858323, :index 58025}
{:type :fail, :f :transfer, :value {:from 6, :to 1, :amount 1}, :process 5, :time 263596806299, :error :conflict, :index 58026}
{:type :invoke, :f :read, :process 5, :time 263596947851, :index 58027}
{:type :fail, :f :transfer, :value {:from 3, :to 5, :amount 2}, :process 14, :time 263790314458, :error :conflict, :index 58028}
{:type :invoke, :f :read, :process 14, :time 263790469958, :index 58029}
{:type :ok, :f :read, :process 16, :time 263833251597, :value {0 51, 2 1, 3 7, 4 11, 5 3, 6 8, 7 21}, :message {:type :wrong-total, :total 102, :span-id "SpanId{spanId=dec5e56fcb430e47}", :trace-id "TraceId{traceId=8811112f1f038549f03f4ecf10c33fae}"}, :error :checker-violation, :index 58030}
{:type :invoke, :f :read, :process 16, :time 263833381441, :index 58031}
{:type :fail, :f :transfer, :value {:from 2, :to 4, :amount 4}, :process 2, :time 263836386394, :error :insufficient-funds, :index 58032}
{:type :invoke, :f :read, :process 2, :time 263836519182, :index 58033}
{:type :info, :f :move-tablet, :value {"amount_0" [2 1], "amount_2" [2 1], "dgraph.type" [1 2], "key_2" [2 1], "type_2" [1 2]}, :process :nemesis, :time 264319082817, :index 58034}
{:type :fail, :f :read, :process 5, :time 264612363612, :error :old-timestamp, :index 58035} |
After extensive testing, I haven't managed to reproduce this on 1.1.1-57-g4abba793a. |
Nope, it's still present on 1.1.1-57-g4abba793a. I need to design a more aggressive variant of the bank test, I think; it's taking too long for us to catch these bugs. |
Haven't seen this in 1.1.1-59-g191232226 yet, but I also haven't given it as much time as it took to detect in 1.1.1-57-g4abba793a. |
Github issues have been deprecated. |
What version of Dgraph are you using?
1.1.1-48-g157896305
Have you tried reproducing the issue with the latest release?
1.1.1-48-g157896305 is a few days old; I'm trying 1.1.1-65 next.
What is the hardware spec (RAM, OS)?
A 5-node EC2 cluster, running m4.larges.
Steps to reproduce the issue (command/config used to run Dgraph).
With Jepsen 9f780a144248776e0143bc9699647802a29e1d28, run
I hit this case after roughly 1800 seconds of testing.
Expected behaviour and actual result.
In this bank test run, at index 39931, the total of all accounts changes from 100 to 98, and remains at 98 for the remainder of the test. Here's an excerpt from the history, restricted to just operations involving account 6:
Because there's no obvious proximate transfer operation here, this makes me wonder if maybe this single account got rolled back somehow? Whatever the cause, the effects persisted for the rest of the test!
The text was updated successfully, but these errors were encountered: