-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline members after signalling key shouldn't be in group #752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 choices for what happens when a node fails in the reshare:
1 (current): there's a case the group transitions to a new group including the failed node. the failed node never saves it's share, so the resulting group is deficient.
2 (what i believe this change moves us to): the group transitions to a new group not including the failed node. while less members, threshold remains at it's initial setting.
3: the reshare cancels, and the original group is left intact.
(downside is that one misbehaving node in the incoming group can prevent reshares, but if that node can be identified and not added out of band, it's not a permanent DOS)
there's a case to be made that option 3 is what we'd most like to have happen.
@@ -347,6 +351,7 @@ func (d *DrandTest2) RunReshare(oldRun, newRun, newThr int, timeout time.Duratio | |||
require.NoError(d.t, err) | |||
_, err = client.InitReshare(leader.drand.priv.Public, secret, d.groupPath, force) | |||
if err != nil { | |||
fmt.Println("error in NON LEADER: ", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printlns should be turned into logs or removed before merge :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait until I mark it as ready before reviewing !! :D
From a network management perspective, I also have a preference for (3). Nevertheless, (2) is better than (1) as it makes recovery easier. |
The general goal (at least that I have) is to make the sharing process as smooth as possible. Option 2 is what enables us to simply don't care about failing nodes, no need to restart, have everybody re-ran things etc, we can just wait the next reshare session (as long as we're good with the resulting threshold of course). Option 3 forces us to isolate the bad guy so more manual operations more manual looking into logs etc. Less smooth. |
Also while thinking about it, I'm not sure we can implement option 3 at all, as that requires a consensus. WHen node A tells to all other nodes "my connection with node B dropped", everybody is supposed to stop the DKG then. But how do we trust node A ;) ? |
Only before leader starts the DKG (by sending the first packet), it can still stops the process, so if the node fails before the leader sends the final group, then it's possible to stop. We can do that it's fine. But as soon as DKG is started, we're not trusting the leader anymore so it's not possible. |
Appears to be breaking the
|
Mhhh it passes locally :| |
I've softened a bit the parameters of the test for Travis |
No description provided.