Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State sync from local snapshot: Understand problem, explore solutions #28

Closed
4 tasks done
Tracked by #27
sergio-mena opened this issue Dec 23, 2022 · 4 comments · Fixed by #801
Closed
4 tasks done
Tracked by #27

State sync from local snapshot: Understand problem, explore solutions #28

sergio-mena opened this issue Dec 23, 2022 · 4 comments · Fixed by #801
Assignees
Labels
adr Issue relating to the implementation of a specific ADR P:integrator-experience Priority: Improve experience for integrators P:operator-experience Priority: Improve experience for operators
Milestone

Comments

@sergio-mena
Copy link
Contributor

sergio-mena commented Dec 23, 2022

Tasks:

DoD:

  • We have made sure that users want the problem solved
  • We have decided on a solution, and implemented it
  • The ADR is merged

Original issue: tendermint/tendermint#9947

@sergio-mena sergio-mena added the backlog A prioritized task in the team's backlog label Dec 23, 2022
@adizere adizere added this to the 2023-Q2 milestone Apr 21, 2023
@adizere adizere added adr Issue relating to the implementation of a specific ADR P:operator-experience Priority: Improve experience for operators P:integrator-experience Priority: Improve experience for integrators labels May 2, 2023
@cason
Copy link
Contributor

cason commented May 5, 2023

So, some feedback on the state here:

Analyze and understand the problem ADR083 is trying to address

We understand.

Gauge users' interest in solving this problem (how painful it is not to solve it?)

They definitely want. It is very painful not to, they are ready to solve the problem by them own.

Adapt ADR083 according to the solution decided, finalize and merge it

My question here is another. Should we use old ADR 083, new ADR 103 (it will be renumbered, but anyway) to propose the new solution discussed on May 4th meeting? Or should we start a new ADR X with that solution?

A possible solution would be:

  • Keep ADR 103 as it is, with the original solution. Cite ADR X as an alternative. Deprecate ADR 103 in favor of ADR X.
  • Create a new ADR X proposing the new solution. Point to ADR 103 as an alternative.

@cason
Copy link
Contributor

cason commented May 5, 2023

Keep ADR 103 as it is, with the original solution. Cite ADR X as an alternative. Deprecate ADR 103 in favor of ADR X.

What I mean here is to keep ADR 103 proposing a solution that is not a local sync, as it involves no snapshot. Make it clear on its wording. Point out some disadvantages of this solution.

@sergio-mena
Copy link
Contributor Author

The way I see it (with the context I have):

  • Set ADR 103 as rejected
  • Write ADR X as proposed
  • Add references to each other in a visible place

@thanethomson thanethomson removed the backlog A prioritized task in the team's backlog label May 16, 2023
@cason
Copy link
Contributor

cason commented May 23, 2023

So, we have to consider here another option that was implemented (is already merged) by the SDK. It consists of two parts:

  1. feat: add local snapshots management commands cosmos/cosmos-sdk#16067

This PR creates a snapshot manager binary simd to manage snapshots of a SDK-based application. From my understanding, the SDK has already a module to manage application snapshots, so what this program does is to offer a CLI interface to interact with the snapshots manager.

In the context of this issue, 3 commands are relevant:

  • dump: exports a snapshot produced by the application to a tarball (a tar.gz file). The snapshot is identified by the height it was taken.
  • load: loads a snapshot from a tarball, adding it to the snapshots store. Together with dump, this enables out-of-band snapshot transfer.
  • restore: apply a snapshot present in the store to the application. Since the application is stored in a leveldb-style DB, this is possible even with a running application, it is enough to lock the store.

These three commands, plus the snapshot manager CLI, enables transferring the application state between two nodes in a pretty safe way. The problem at this point is that the restored snapshot might produce an application state that is not compatible with the CometBFT state (putting it simple, the heights don't match).

  1. feat!: bootstrap comet cmd for local state sync  cosmos/cosmos-sdk#16061

This second PR bootstraps CometBFT's state (namely, state store and block store) at a given height, relying on a light client. The light client configuration is set on the CometBFT's configuration file. The light client does its magic and, given a height of the blockchain, produces a state and a commit object. The state store is created/initialized from the state object. The block store stores the commit. With this information set to height H, and provided that the application state reflects the execution of block H, consensus can start from height H+1.

The code added by this PR is identical to the code used by State Sync to bootstrap a node. In the case of State Sync, the light client is configured with the height of the candidate snapshot retrieved from the p2p network. Assuming that the snapshot is valid and can be fully downloaded and installed, CometBFT is bootstrap at the snapshot's height, and starts its operation (either via block sync, or directly via consensus) from the next height.

Regarding the SDK solution, specifically, the bootstrap-state command has an optional height parameter. If set, the light client is used to produce the relevant state for that height. When not set, this should be the most common case, height is set to the application's latest commit height. Notice that in this case, after running the above described procedure, the state of CometBFT (state and block store) will be consistent with the application state.

This solution, putting in simple works, setups a state for all components (state store, block store, and application) that enables the famous Handshake procedure to execute without errors. This procedure in this case will essentially submit an Info ABCI call to the application in order to retrieve its appHash and compare it with the information stored in the state store, which should match by construction.

Raneet10 pushed a commit to 0xPolygon/cometbft that referenced this issue Nov 20, 2023
dev: chg: solve vulns and set maxInboundPeer and maxOutBoundPeer to 100
cometcrafter pushed a commit to graphprotocol/cometbft that referenced this issue May 1, 2024
cometbft#28) (cometbft#32)

* perf(internal/bits): Significantly speedup bitArray.PickRandom (backport cometbft#2841) (cometbft#2887)

This PR significantly speeds up bitArray.PickRandom which is used in
VoteGossip and BlockPart gossip. We saw for a query serving full node,
over an hour, this was a very large amount of RAM allocations. (75GB of
RAM!)

![image](https://github.com/cometbft/cometbft/assets/6440154/755918a5-0cef-4e67-a47e-ce8a56aa1cd5)

This PR drops it down to 0 allocations, and makes the routine 10x faster
on my machine.

OLD:
```
BenchmarkPickRandomBitArray-12           1545199               846.1 ns/op          1280 B/op          1 allocs/op
```
NEW:
```
BenchmarkPickRandomBitArray-12          22192857                75.39 ns/op            0 B/op          0 allocs/op
```

I think the new tests I wrote make this more tested than the old code
that was here tbh, but pls let me know if theres more tests we'd like to
see!

---

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2841 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

* fix failing test

* changelog

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(cherry picked from commit f6abfce)

Co-authored-by: Adam Tucker <adam@osmosis.team>
cometcrafter pushed a commit to graphprotocol/cometbft that referenced this issue May 1, 2024
cometbft#28)

* perf(internal/bits): Significantly speedup bitArray.PickRandom (backport cometbft#2841) (cometbft#2887)

This PR significantly speeds up bitArray.PickRandom which is used in
VoteGossip and BlockPart gossip. We saw for a query serving full node,
over an hour, this was a very large amount of RAM allocations. (75GB of
RAM!)

![image](https://github.com/cometbft/cometbft/assets/6440154/755918a5-0cef-4e67-a47e-ce8a56aa1cd5)

This PR drops it down to 0 allocations, and makes the routine 10x faster
on my machine.

OLD:
```
BenchmarkPickRandomBitArray-12           1545199               846.1 ns/op          1280 B/op          1 allocs/op
```
NEW:
```
BenchmarkPickRandomBitArray-12          22192857                75.39 ns/op            0 B/op          0 allocs/op
```

I think the new tests I wrote make this more tested than the old code
that was here tbh, but pls let me know if theres more tests we'd like to
see!

---

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2841 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

* fix failing test

* changelog

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adr Issue relating to the implementation of a specific ADR P:integrator-experience Priority: Improve experience for integrators P:operator-experience Priority: Improve experience for operators
Projects
No open projects
Status: Done
5 participants