Don't persist the network messages and their acknowledgements #1417

v0d1ch · 2024-05-07T14:50:23Z

Why

We already saw a couple of times now that resending the lost/missed messages was not so robust since the Head still got stuck but now it wasn't clear what are the values we would need to set for the acks counter in order to fix the problem/resend lost messages.

The initial idea of having the reliability layer store the index of sent/seen messages and replay them seems to not work so well because of couple of factors:

~~The node can crash after sending a message but before storing it on disk~~
The node can crash after storing the message on disk but before putting it to a queue for processing

On top of everything the saved acks do not correspond directly to stored network messages since we store them separately and this is not atomic process which can lead to failures.

In general it seems like storing the network messages and acks is not beneficial enough to justify it and therefore I propose this idea to remove it completely and keep them in memory like before.

By doing this we rely on other nodes not to crash at the same time so the node that crashed could catch up by re-receiving the lost network messages from other node/s.

There are couple of issues related to implementation of reliability layer of hydra-node like #1079 and a bug item that could be related #1202

What

Remove persistence handle from the hydra-node networking part.

How

Remove completely the MessagePersistence argument to withNetwork function which eliminates message storing/reading from disk.

The text was updated successfully, but these errors were encountered:

ch1bo · 2024-05-07T15:44:52Z

I think we should re-open #1079 and maybe reconsider the persisted queue too to mitigate messages that were transmitted, but not handled in the head logic (i.e. on the receiving side).

v0d1ch · 2024-05-08T07:42:14Z

You mean also add the option to persist whatever was in the queue at certain point in time? At which point would we persist the data?

ffakenz · 2024-05-16T11:49:51Z

If u remove the network-messages then you cannot re-transmit messages anymore.

On the other hand, removing the acknowledgements would result in lots of messages over the network. However it might work and lead to a head getting unstuck.

locallycompact · 2024-05-16T11:52:46Z

Why do we need to persist messages at all? Why can we not do what the cardano-node is doing? A cardano-node only needs to ask whoever is available whether they have a longer chain. It does not rely on people storing outbound messages. Why should our situation be different?

ch1bo · 2024-05-23T10:30:22Z

A cardano-node only needs to ask whoever is available whether they have a longer chain.

This is a good idea and something which was considered in the past (as mentioned in this ADR; there are other PRs and logbook items we could dig up). It was called a "pull-base approach" and this is what I think you mean in the essence of your comment. Have the network participants pull data from each other then sending it.

However, that approach ultimately will come to similar questions about persistence. How much of data would we keep on the other end such that network participants can pull it?

locallycompact · 2024-05-23T11:17:36Z

Nothing on the network layer. You only ask another node what it believes to be the case regarding the actual chain state and then you verify it. The chain (history of snapshots), and the working area (signatures from unconfirmed snapshots).

ch1bo · 2024-07-22T15:47:20Z

We just stumbled over the acks being persisted again today. In this case someone had wiped their acks persistence, but someone else did not (and/or the --peer configuration also changed).

We concluded that having the acks persisted is very fragile and it's unclear when and when not to reset that piece of state!

v0d1ch added the 💭 idea An idea or feature request label May 7, 2024

v0d1ch self-assigned this May 7, 2024

locallycompact mentioned this issue May 16, 2024

Stress test the network reliability #1436

Open

2 tasks

locallycompact added 💬 feature A feature on our roadmap green 💚 Low complexity or well understood feature and removed 💭 idea An idea or feature request labels May 27, 2024

ch1bo mentioned this issue Jul 23, 2024

NOT MERGE: Less reliable persistence prototype #1495

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't persist the network messages and their acknowledgements #1417

Don't persist the network messages and their acknowledgements #1417

v0d1ch commented May 7, 2024 •

edited by ffakenz

Loading

ch1bo commented May 7, 2024

v0d1ch commented May 8, 2024

ffakenz commented May 16, 2024

locallycompact commented May 16, 2024

ch1bo commented May 23, 2024

locallycompact commented May 23, 2024

ch1bo commented Jul 22, 2024

Don't persist the network messages and their acknowledgements #1417

Don't persist the network messages and their acknowledgements #1417

Comments

v0d1ch commented May 7, 2024 • edited by ffakenz Loading

Why

What

How

ch1bo commented May 7, 2024

v0d1ch commented May 8, 2024

ffakenz commented May 16, 2024

locallycompact commented May 16, 2024

ch1bo commented May 23, 2024

locallycompact commented May 23, 2024

ch1bo commented Jul 22, 2024

v0d1ch commented May 7, 2024 •

edited by ffakenz

Loading