Skip to content

Side load of fully-signed snapshot #1858

@noonio

Description

@noonio

Todo:

  • Refine this description and potential solutions
  • Define a scenario that we can use to test solutions
    • [] Need to find out how to construct these "diverging views" and how to resolve (pumba sets? Maybe, if any still fail after raft!)
  • Understand why this is a better solution than Clear pending transactions API command #1284 (i.e. doesn't fall into the same trap)

Description

Processing transactions in a Hydra head requires each node to agree on transactions. The protocol will validate transactions (on NewTx command) against it's local view of the ledger state, using the passed --ledger-protocol-parameters. As transactions can be valid or invalid based on configuration (or to some extent exact build versions of hydra-node), it is possible that one node accepts a transaction, while the peer nodes do not.

Currently, this means that the node which accepted the transaction now has a different local state than the other nodes and might try to spend outputs that other nodes don't see available. For example, when using hydraw, the node would be using outputs introduced by the previous pixel paint transaction, but other nodes will deem any new transaction invalid with a BadInputs error.

Within this feature, we want to improve the UX of hydra-node in presence of such misalignments.

Note: We should only adopt snapshots that are enforceable on L1.

Suggested solution

  • Allow adoption of a new snapshot This snapshot has to be:

    • the same for everyone, and to have a snapshot number and version strictly bigger than previous.
    • enforceable and valid against the current state of the protocol on L1:
      • must be signed by everyone (somehow).
  • Allow introspection of the current snapshot in a particular node

    • We want to be able to notice if the head has become stuck. The why might be tricky but the whom would be sufficient for the time being.
    • We want to be able to observed who is missing to sign the current snapshot in flight (which is preventing from getting it confirmed)
    • Having flow metrics would not help in this scenario, given we can face a use-case where a single tx its being rejected by one of the peers and this does not depend on volume.
  • Work out what constraints are required to accept a new snapshot

What

In this setting, every peer has to manually cooperate by posting a command to reset their local state to a previous snapshot confirmed.

Scenarios

  1. A configuration discrepancy (like: --ledger-protocol-parameters) arises after Head is open, examples maxTxSize/maxTxExecutionUnits could be good for for the first NewTx but too much for the second, making a peer come in disagreement.

    • this also requires the miss-configured peer to reset its node and fix their config to be aligned with the rest of the party.
  2. Having a peer going offline for too long and missing to catchup or resending AckSn.

Additional context

Compared to #1284, the solution does not depend on how long a peer becomes offline.
Here when a peer becomes back online, the networking layer will make sure to catch up the reconnecting peer, but if someone clear its pending txs while doing so, will create a worse scenario, where parties end up in different confirmed snapshots.

Metadata

Metadata

Assignees

Labels

💭 ideaAn idea or feature request

Projects

Status

Done ✔

Status

2025 Q3

Relationships

None yet

Development

No branches or pull requests

Issue actions