Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline members after signalling key shouldn't be in group #752

Merged
merged 8 commits into from
Sep 29, 2020
Merged

Conversation

nikkolasg
Copy link
Collaborator

No description provided.

Copy link
Member

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 choices for what happens when a node fails in the reshare:
1 (current): there's a case the group transitions to a new group including the failed node. the failed node never saves it's share, so the resulting group is deficient.
2 (what i believe this change moves us to): the group transitions to a new group not including the failed node. while less members, threshold remains at it's initial setting.
3: the reshare cancels, and the original group is left intact.
(downside is that one misbehaving node in the incoming group can prevent reshares, but if that node can be identified and not added out of band, it's not a permanent DOS)

there's a case to be made that option 3 is what we'd most like to have happen.

@@ -347,6 +351,7 @@ func (d *DrandTest2) RunReshare(oldRun, newRun, newThr int, timeout time.Duratio
require.NoError(d.t, err)
_, err = client.InitReshare(leader.drand.priv.Public, secret, d.groupPath, force)
if err != nil {
fmt.Println("error in NON LEADER: ", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printlns should be turned into logs or removed before merge :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait until I mark it as ready before reviewing !! :D

@jsoares
Copy link
Member

jsoares commented Sep 24, 2020

From a network management perspective, I also have a preference for (3). Nevertheless, (2) is better than (1) as it makes recovery easier.

@nikkolasg
Copy link
Collaborator Author

The general goal (at least that I have) is to make the sharing process as smooth as possible. Option 2 is what enables us to simply don't care about failing nodes, no need to restart, have everybody re-ran things etc, we can just wait the next reshare session (as long as we're good with the resulting threshold of course).

Option 3 forces us to isolate the bad guy so more manual operations more manual looking into logs etc. Less smooth.

@nikkolasg
Copy link
Collaborator Author

Also while thinking about it, I'm not sure we can implement option 3 at all, as that requires a consensus. WHen node A tells to all other nodes "my connection with node B dropped", everybody is supposed to stop the DKG then. But how do we trust node A ;) ?
And if we rely on "i only stop DKG if my connection as well dropped with node B" then we run into potential weaknesses/bugs, where some node see their connection down and some not. This kind of stuff requires consensus.
This is why the DKG is "threshold resistant": it allows some nodes to misbehave AND to be offline as well, that's the whole point :)

@nikkolasg
Copy link
Collaborator Author

Only before leader starts the DKG (by sending the first packet), it can still stops the process, so if the node fails before the leader sends the final group, then it's possible to stop. We can do that it's fine. But as soon as DKG is started, we're not trusting the leader anymore so it's not possible.

@nikkolasg nikkolasg marked this pull request as ready for review September 28, 2020 19:57
@willscott
Copy link
Member

Appears to be breaking the TestDrandDKGBroadcastDeny - core test:

ERRROR:  rpc error: code = Unknown desc = drand: err during DKG: drand: error from dkg: node can only process justifications after processing responses

@nikkolasg
Copy link
Collaborator Author

Mhhh it passes locally :|

@nikkolasg
Copy link
Collaborator Author

I've softened a bit the parameters of the test for Travis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants