Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet more missing values with move-tablet #4575

Closed
aphyr opened this issue Jan 14, 2020 · 2 comments
Closed

Yet more missing values with move-tablet #4575

aphyr opened this issue Jan 14, 2020 · 2 comments
Assignees
Labels
area/testing/jepsen kind/bug status/accepted

Comments

@aphyr
Copy link

aphyr commented Jan 14, 2020

What version of Dgraph are you using?

1.1.1-56-ge18986f1c

Have you tried reproducing the issue with the latest release?

1.1.1-56-ge189861fc is yesterday's dev build; I haven't started with today's yet.

What is the hardware spec (RAM, OS)?

5-node LXC Jepsen cluster, on a 48-way Xeon, 128GB of ram.

Steps to reproduce the issue (command/config used to run Dgraph).

This appears to be an infrequent error. Spurious nulls were extremely common in 1.1.1 and subsequent development builds, but reproducing it with ge189 appears more difficult. With Jepsen b7c7bdc5f8476e009d591927060e7e8e786f015a, try:

lein run test --workload bank --nemesis move-tablet --time-limit 600 --test-count 20

Expected behaviour and actual result.

In this test run, Dgraph returned a single, transient nil value in response to a read of all accounts during what looks like a tablet move. I'm wondering if there could be a rare race condition here? Something that slips through before the move-ts checks kick in and prevent reads from executing on a group that doesn't host a tablet?

; The value of account 0 is ~$22; the total of all accounts is $100
{:type :ok, :f :read, :process 109, :time 323160578607, :value {0 22, 1 2, 2 14, 3 3, 4 11, 5 10, 6 32, 7 6}, :index 31059}
{:type :invoke, :f :read, :process 109, :time 323160720211, :index 31060}
{:type :fail, :f :transfer, :value {:from 4, :to 0, :amount 1}, :process 105, :time 323229577276, :error :conflict, :index 31061}
{:type :invoke, :f :transfer, :value {:from 7, :to 3, :amount 1}, :process 105, :time 323229862994, :index 31062}
{:type :fail, :f :transfer, :value {:from 7, :to 6, :amount 5}, :process 103, :time 323245905458, :error :conflict, :index 31063}
{:type :ok, :f :transfer, :value {:from 5, :to 1, :amount 2}, :process 106, :time 323246165119, :index 31064}
{:type :invoke, :f :transfer, :value {:from 7, :to 1, :amount 1}, :process 103, :time 323246360212, :index 31065}
{:type :invoke, :f :transfer, :value {:from 7, :to 4, :amount 3}, :process 106, :time 323246525060, :index 31066}
{:type :fail, :f :transfer, :value {:from 0, :to 3, :amount 3}, :process 104, :time 323364199257, :error :conflict, :index 31067}
{:type :invoke, :f :read, :process 104, :time 323364634847, :index 31068}
; We fail to observe a value for account 0; the observed total is only 78
{:type :ok, :f :read, :process 109, :time 323371946833, :value {0 nil, 1 2, 2 14, 3 3, 4 11, 5 10, 6 32, 7 6}, :message {:type :nil-balance, :nils {0 nil}, :span-id "SpanId{spanId=1c8e896d950daad6}", :trace-id "TraceId{traceId=cbd2f037e1e266644d3bb3964b5b43ee}"}, :error :checker-violation, :index 31069}
{:type :invoke, :f :read, :process 109, :time 323372719597, :index 31070}
{:type :fail, :f :read, :process 107, :time 323459240076, :error :tablet-not-served-by-instance, :index 31071}
{:type :invoke, :f :transfer, :value {:from 3, :to 5, :amount 1}, :process 107, :time 323459733773, :index 31072}
; Reads fail because the tablet isn't served by the instance we're talking to
{:type :fail, :f :read, :process 100, :time 323463996489, :error :tablet-not-served-by-instance, :index 31073}
{:type :invoke, :f :transfer, :value {:from 6, :to 0, :amount 3}, :process 100, :time 323464587264, :index 31074}
{:type :ok, :f :transfer, :value {:from 7, :to 3, :amount 1}, :process 105, :time 323521941137, :index 31075}
{:type :invoke, :f :read, :process 105, :time 323522250533, :index 31076}
{:type :fail, :f :transfer, :value {:from 7, :to 4, :amount 1}, :process 102, :time 323542895467, :error :conflict, :index 31077}
{:type :invoke, :f :read, :process 102, :time 323543269600, :index 31078}
{:type :fail, :f :transfer, :value {:from 3, :to 5, :amount 1}, :process 107, :time 323569064832, :error :conflict, :index 31079}
{:type :invoke, :f :transfer, :value {:from 5, :to 0, :amount 5}, :process 107, :time 323569537170, :index 31080}
; Dgraph prevents a transaction from executing on a tablet that's no longer held
{:type :info, :f :read, :process 108, :time 323641205824, :exception {:via [{:type io.grpc.StatusRuntimeException, :message "UNKNOWN: : StartTs: 17843 is from before MoveTs: 17853 for pred: \"amount_2\"", ...
@aphyr
Copy link
Author

aphyr commented Jan 16, 2020

I haven't been able to see this on 1.1.1-57-g4abba793a--maybe that helps us narrow down a potential cause?

@sleto-it sleto-it added kind/bug status/accepted labels Jan 17, 2020
@minhaj-shakeel
Copy link
Contributor

minhaj-shakeel commented Jul 20, 2020

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing/jepsen kind/bug status/accepted
Projects
None yet
Development

No branches or pull requests

5 participants