Offline members after signalling key shouldn't be in group #752

nikkolasg · 2020-09-24T15:50:51Z

No description provided.

willscott

There are 3 choices for what happens when a node fails in the reshare:
1 (current): there's a case the group transitions to a new group including the failed node. the failed node never saves it's share, so the resulting group is deficient.
2 (what i believe this change moves us to): the group transitions to a new group not including the failed node. while less members, threshold remains at it's initial setting.
3: the reshare cancels, and the original group is left intact.
(downside is that one misbehaving node in the incoming group can prevent reshares, but if that node can be identified and not added out of band, it's not a permanent DOS)

there's a case to be made that option 3 is what we'd most like to have happen.

willscott · 2020-09-24T15:53:35Z

core/util_test.go

@@ -347,6 +351,7 @@ func (d *DrandTest2) RunReshare(oldRun, newRun, newThr int, timeout time.Duratio
 		require.NoError(d.t, err)
 		_, err = client.InitReshare(leader.drand.priv.Public, secret, d.groupPath, force)
 		if err != nil {
+			fmt.Println("error in NON LEADER: ", err)


printlns should be turned into logs or removed before merge :)

wait until I mark it as ready before reviewing !! :D

jsoares · 2020-09-24T16:00:45Z

From a network management perspective, I also have a preference for (3). Nevertheless, (2) is better than (1) as it makes recovery easier.

nikkolasg · 2020-09-24T16:01:10Z

The general goal (at least that I have) is to make the sharing process as smooth as possible. Option 2 is what enables us to simply don't care about failing nodes, no need to restart, have everybody re-ran things etc, we can just wait the next reshare session (as long as we're good with the resulting threshold of course).

Option 3 forces us to isolate the bad guy so more manual operations more manual looking into logs etc. Less smooth.

nikkolasg · 2020-09-24T16:05:19Z

Also while thinking about it, I'm not sure we can implement option 3 at all, as that requires a consensus. WHen node A tells to all other nodes "my connection with node B dropped", everybody is supposed to stop the DKG then. But how do we trust node A ;) ?
And if we rely on "i only stop DKG if my connection as well dropped with node B" then we run into potential weaknesses/bugs, where some node see their connection down and some not. This kind of stuff requires consensus.
This is why the DKG is "threshold resistant": it allows some nodes to misbehave AND to be offline as well, that's the whole point :)

nikkolasg · 2020-09-24T16:10:06Z

Only before leader starts the DKG (by sending the first packet), it can still stops the process, so if the node fails before the leader sends the final group, then it's possible to stop. We can do that it's fine. But as soon as DKG is started, we're not trusting the leader anymore so it's not possible.

willscott · 2020-09-28T20:45:17Z

Appears to be breaking the TestDrandDKGBroadcastDeny - core test:

ERRROR:  rpc error: code = Unknown desc = drand: err during DKG: drand: error from dkg: node can only process justifications after processing responses

nikkolasg · 2020-09-28T20:46:32Z

Mhhh it passes locally :|

nikkolasg · 2020-09-29T07:37:31Z

I've softened a bit the parameters of the test for Travis

failing test that reproduces the issue

0db68ea

willscott reviewed Sep 24, 2020

View reviewed changes

nikkolasg mentioned this pull request Sep 24, 2020

Fix eviction of absent participant drand/kyber#20

Merged

nikkolasg added 5 commits September 25, 2020 08:41

failing test before updating kyber

ff21a9d

comment on callback

d8ce60e

updated with latest versions of kyber & bls12-381

d2347d0

include compatibility test

3a9090e

Merge branch 'master' into fix/714

8bda8cd

nikkolasg marked this pull request as ready for review September 28, 2020 19:57

nikkolasg added 2 commits September 29, 2020 08:15

relaxed parameters for CI

a73fcae

simplified test for broadcast

0dad7cc

willscott approved these changes Sep 29, 2020

View reviewed changes

nikkolasg merged commit d02bb8c into master Sep 29, 2020

nikkolasg deleted the fix/714 branch September 29, 2020 15:18

willscott mentioned this pull request Sep 29, 2020

Update kyber to v1.1.3 & test compatibility with v1.1.2 #753

Closed

nikkolasg mentioned this pull request Sep 13, 2021

Node can end up in the group despite not having received a share #714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline members after signalling key shouldn't be in group #752

Offline members after signalling key shouldn't be in group #752

nikkolasg commented Sep 24, 2020

willscott left a comment

willscott Sep 24, 2020

nikkolasg Sep 24, 2020

jsoares commented Sep 24, 2020 •

edited

Loading

nikkolasg commented Sep 24, 2020

nikkolasg commented Sep 24, 2020

nikkolasg commented Sep 24, 2020

willscott commented Sep 28, 2020

nikkolasg commented Sep 28, 2020

nikkolasg commented Sep 29, 2020

Offline members after signalling key shouldn't be in group #752

Offline members after signalling key shouldn't be in group #752

Conversation

nikkolasg commented Sep 24, 2020

willscott left a comment

Choose a reason for hiding this comment

willscott Sep 24, 2020

Choose a reason for hiding this comment

nikkolasg Sep 24, 2020

Choose a reason for hiding this comment

jsoares commented Sep 24, 2020 • edited Loading

nikkolasg commented Sep 24, 2020

nikkolasg commented Sep 24, 2020

nikkolasg commented Sep 24, 2020

willscott commented Sep 28, 2020

nikkolasg commented Sep 28, 2020

nikkolasg commented Sep 29, 2020

jsoares commented Sep 24, 2020 •

edited

Loading