Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Inserts can be lost during a predicate move #2338
With server-side ordering, @upsert schemas, no crashes or network faults, roughly 10 inserts/sec, and no updates or deletes, Dgraph can occasionally (once every five hours or so) lose successfully inserted records: 20180412T161038.000-0500.zip
The lost records occur during a predicate move, which suggests this issue might be related to #2321. This occurs with
and can be reproduced with Jepsen 23329ead4c4e3d8352234658026d09792f15c406 via
You can now reproduce this problem significantly faster in Jepsen eb796cfcc204c592545965968bd28ad1e6b2eff0 by using the move-tablet nemesis, which shuffles tablets around every 15 seconds or so.
With predicate moves, we can get dgraph to lose 99% of acknowledged inserts in 60 seconds:
- The reason for bug #2338 was that there was a race condition between a mutation and predicate move. Zero was not checking if a predicate is under move before allowing a commit. Thus, a mutation could get proposed in a group, then a move starts, and get committed by Zero (after the move starts). - This change this issue by ensuring that Zero checks if a predicate is being moved, before allowing commit. - Any pending transactions are also cancelled once the move starts, so this would only happen as part of a race condition and not afterward. Mechanism: - Send the real keys back to Zero, as part of Transaction Context. - Zero uses these keys to parse the predicate, and checks if that predicate is currently moving. If so, it would abort the transaction. - Also, check for `_predicate_` being moved. For some reason, if we don't consider this predicate, we could still lose data. - Before doing a mutation in Dgraph alpha, check if that tablet can be written to. - Loop until all transactions corresponding to the predicate move are aborted. Only then start the move. Tangential changes: - Update the port number for bank integration test. - Remove the separate key value or clean channel. Make it run as part of the main Node.Run loop. - Add a max function. - Small refactoring here and there.