New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group checksum mismatch and pending mutation blocks read-only queries indefinitely #5368
Comments
Right now the code is explicitly blocking until the checksums match in two places. 1 In
I am still not entirely sure why uncommitted transactions are causing the checksums to change. But I suppose it makes sense for normal queries. I don't think read-only queries should get blocked. This could be the fix and it'd be a very easy one. I'll discuss this with Manish. |
Found the root cause but I still don't know what the proper fix. The bug is in this part of the
The latest delta gets appended to the delta that was previously received from the channel (my guess is that this was done to reduce the number of proposals). However, the GroupChecksum is lost at this step so the queries get stuck waiting for the checksums to match. Adding another tablet via the mutation in step 3, unblocks the process. If I can safely apply the mutations in the order they are received from the stream, then the fix is simply to overwrite the group checksums. I am not sure if this is the case so I'll keep looking. |
The function processing the oracle delta stream combines multiple deltas into one to reduce the number of proposal. However, the group checksums were not being updated, causing some group checksums update to be lost. gRPC guarantees that the ordering of the deltas coming from the stream so overwriting the group checksums is safe. Tested by following the steps in #5368. Fixes #5368
The function processing the oracle delta stream combines multiple deltas into one to reduce the number of proposal. However, the group checksums were not being updated, causing some group checksums update to be lost. gRPC guarantees that the ordering of the deltas coming from the stream so overwriting the group checksums is safe. Tested by following the steps in #5368. Fixes #5368
The function processing the oracle delta stream combines multiple deltas into one to reduce the number of proposal. However, the group checksums were not being updated, causing some group checksums update to be lost. gRPC guarantees that the ordering of the deltas coming from the stream so overwriting the group checksums is safe. Tested by following the steps in #5368. Fixes #5368
The function processing the oracle delta stream combines multiple deltas into one to reduce the number of proposal. However, the group checksums were not being updated, causing some group checksums update to be lost. gRPC guarantees that the ordering of the deltas coming from the stream so overwriting the group checksums is safe. Tested by following the steps in #5368. Fixes #5368
The function processing the oracle delta stream combines multiple deltas into one to reduce the number of proposal. However, the group checksums were not being updated, causing some group checksums update to be lost. gRPC guarantees that the ordering of the deltas coming from the stream so overwriting the group checksums is safe. Tested by following the steps in dgraph-io#5368. Fixes dgraph-io#5368
What version of Dgraph are you using?
v20.03.1
Have you tried reproducing the issue with the latest release?
Yes
What is the hardware spec (RAM, OS)?
Ubuntu Linux
Steps to reproduce the issue (command/config used to run Dgraph).
Create a 3 Alpha replica cluster, run a whole series of read-only queries. At the same time, open a new transaction to send a mutation that writes a new predicate. The new predicate changes the group checksum, and the read-only queries fail to respond.
These are the steps to reproduce (and here's an asciinema recording):
dgraph increment
to create a predicate and then run many read-only queries as quickly as possible (no--wait
flag, or--wait=0.1s
should work too):/mutate
to open a new txn that does not commit it.Repeat Step 3 until the read-only queries in Step 2 get blocked:
In Jaeger/zPages, you'll see a trace error for api.Dgraph.Query with the error message
Group checksum mismatch for id: 1
:http://localhost:8180/z/tracez?zspanname=api.Dgraph.Query&ztype=2&zsubtype=0
Eventually, when the open transaction gets aborted, the queries become unblocked. By default, open transactions are aborted after 5 minutes of inactivity.
Expected behaviour and actual result.
Queries should not get blocked by a pending transaction.
The text was updated successfully, but these errors were encountered: