New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: cyan crashing after on disk/memory state diverged #10723
Comments
details for all nodes, showing fatal line (includes affected key) and startup line including sha.
|
btw, I have paused continuous deployment while we look at this. We already went through three upgrades since the problem started manifesting. |
silenced alerts on cyan for 6h. |
Here's one of the failures on
Here are some entries from the raft log (ignoring all the "regular" commands and just focusing on leader and lease stuff):
In the assertion, the "on-disk" values are from the lease at index 434494, and the "in-memory" values are from the lease at index 434502 (43504 is the last entry in the lease at this point, and has apparently not been processed yet). Note that this is an extension; the start time and replica descriptor are the same. The proposed timestamps on the leases translate to Wed Nov 16 04:00:31 2016, just before this node was restarted with 981b8aa. So it looks like these leases were proposed just before the restart, and then the latter was applied after it. This suggests that #10681 is handling commands that were proposed by the earlier version (which called evaluateProposalTwice) differently than before. I think a brand-new cluster would be fine, as would a cluster upgrading from beta-20161103, so this would only be an issue for clusters upgrading from 1110 or unreleased master builds. I'm still re-reviewing #10681 to figure out what might have changed, though. |
I think I see it: This line before my PR sets a non-nil but empty WriteBatch, but this line after my PR prefers the WriteBatch that was serialized into the command over the one we just generated if it is non-nil. So there should be a quick fix to just change the tag numbers, since we're not yet using these fields yet for anything real. My upcoming cleanup PR to break up the ProposalData struct should make this logic less fragile. |
Older versions were writing non-nil but empty values into these fields, leading to problems on upgrade. Since there is no non-experimental use of these fields yet, renumber them to discard all old data. Fixes cockroachdb#10723
15 minutes and everything looks good. |
Cyan node
and keeps going for a while. attaching full log: |
@petermattis's proposal is to revert cyan back to 85f80d9 and see if it stabilizes, then try latest master again. |
Cyan wiped and restarted with 85f80d9. Will run block_writer against it for a little bit, then push the latest master. |
Cyan is wedged with the same issue as delta: #10733 |
To be clear, it got stuck while still running 85f80d9 and was never upgraded to master, right? |
That's correct, it was started with 85f80d9 and was never upgraded. On Thu, Nov 17, 2016 at 3:57 AM, Ben Darnell notifications@github.com
|
Cyan has been upgraded and we haven't seen this issue recur, so it looks like we can close it. |
cyan nodes started crashing after an upgrade to 981b8aa at 20161116-04:00 UTC.
The previous build on cyan was 85f80d9 at 20161115-23:00 UTC.
Nodes crashed within a minute with:
The text was updated successfully, but these errors were encountered: