Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting a node with different number of peers prevents it to connect to the cluster #1174

Closed
ch1bo opened this issue Nov 22, 2023 · 1 comment · Fixed by #1179
Closed
Assignees
Labels
bug 🐛 Something isn't working
Milestone

Comments

@ch1bo
Copy link
Member

ch1bo commented Nov 22, 2023

Context & versions

All recent versions, seen on 9abb099

Analyses

Misbehaviour

This bug was observed through the following sequence of operations:

  1. A group of 5 people share keys and network addresses to form a Head among themselves
  2. One of the parties (alice) inadvertently misconfigure their node by "forgetting" another party's (bob) configuration
  3. The nodes connect to each other and start sending Ping notifications
  4. The alice's node is not seen by bob's
  5. alice stops her node, reconfigure it, then restarts
  6. Now alice is not seen by any other party's node

Troubleshooting

Looking at their logs, alice's peers notice they are seeing the following message repeatedly:

{"timestamp":"2023-11-22T09:34:57.526354646Z","threadId":3675,"namespace":"HydraNode-\" bob\"","message":{"reliability":{"fromParty":{"vkey":"0f23a3124d401d89e8f6cfb724eb00ac073c74edc93a4db9b0889c999f2415fe"},"numberOfParties":5,"partyAcks":[0,0,0,0],"tag":"ReceivedMalformedAcks"},"tag":"Reliability"}}

This means that alice is still sending Reliability layer messages as if she only had 3 peers. Investigating further the issue we realised this was caused by the acknowledgments persistence mechanism we have put in place as part of #1101

  • In its first run alice's node save the acknowledged messages' vector for 4 parties
  • In the second run, alice's node load the saved vector which is still 4, but number of parties should now be 5.

Expected behaviour

The user should be warned that there's an inconsistency between the currently configured network peers and the saved state. It should probably not be possible to start a node in such a situation unknowingly as this situation could come not from a misconfiguration but from an unsuspecting party forming a new head with a different configuration, and reusing inadvertently persisted state from a previous run.

In general, this issue highlights the need for a better strategy on how the hydra-node persists its state and what the user can do about it.

@ch1bo ch1bo added the bug 🐛 Something isn't working label Nov 22, 2023
@ffakenz ffakenz self-assigned this Nov 22, 2023
@ffakenz ffakenz added this to the 0.14.0 milestone Nov 22, 2023
@ffakenz
Copy link
Contributor

ffakenz commented Nov 23, 2023

This problem was reproduced by having a node missing --hydra-verification-key and --cardano-verification-key for a peer.

In the code we check that the number of hydra-verification-keys and cardano-verification-keys matches, but we do not check they also match the list of --peer configured.

So here we have 2 different things going on:

  1. we should verify the number of peers, hydra-verification-keys and cardano-verification-keys matches.
  2. when we restart the network, we need to check that its configuration is consistent with what is persisted on acks.

@ghost ghost changed the title Network produces MalformedAcks even after fixing --peer list Restarting a node with different number of peers prevents it to connect to the cluster Nov 24, 2023
@ghost ghost closed this as completed in #1179 Nov 28, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
2 participants