Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable local statesync/snapshot restore #13521

Closed
wants to merge 3 commits into from

Conversation

chillyvee
Copy link
Contributor

@chillyvee chillyvee commented Oct 12, 2022

Seeking guidance for commit:
82f708f (rebased on main for ease of viewing, not tested, but good for comment)

Branch which works with Juno Uni Testnet https://github.com/chillyvee/cosmos-sdk/tree/cv_local_statesync_juno_uni

This PR enables cosmos-sdk to initiate local snapshot restore with tendermint finishing up the stateStore and blockStore data restoration.

This development is based on juno's uni testnet (cosmos-sdk v0.45.6, tendermint v0.34.20) and pairs with tendermint PR enhancement tendermint/tendermint#9541

This PR enables a validator quickly restore without needing to search for remote snapshot servers. In basis, if a validator already takes periodic snapshots, it is wasteful in time and bandwidth to search over P2P in hopes of finding a snapshot.

However, an RPC server is still required to fetch the last, current, and next blocks to fully restore stateStore and blockStore.

statesync/stateprovider.go

146:    lastLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height), time.Now())
150:    currentLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+1), time.Now())
154:    nextLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+2), time.Now())

The use case is to quickly reduce disk usage which normally grows over time to an arguably large size without the need to slowly prune IAVL trees.

Before further development, we'd like to ask:

Do you prefer such work to be done on the main branch and then backport?

If testing is required, how much additional verification is preferred? For example, juno/uni was chosen because this proves that cosmwasm stores are properly restored. However it would not be possible to verify with just cosmos-sdk tests.

Please take a look at the structure of code and indicate whether any rules regarding ABCI or API/Responsibilities have been broken.

The current code design is meant for ease of review/comment before adding multiple additional layers of abstraction to put each bit of code in the right place.

What is an efficient way to check if local stores already have existing height data prior to restoring the snapshot?


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@tac0turtle
Copy link
Member

hey looks like there was an issue with the branch you have locally. any way you could rebase to main?

@alexanderbez
Copy link
Contributor

Ok, with that frame, I will re-review. I also would encourage we name this "feature" (I think it's a feature?) something like "Lazy State Sync". WDYT?

server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated
@@ -244,6 +247,53 @@ func startStandAlone(ctx *Context, appCreator types.AppCreator) error {
return WaitForQuitSignals()
}

func restoreLocalSnapshot(ctx *Context, app types.Application, ssRestoreHeight uint64) error {
ctx.Logger.Info(fmt.Sprintf("Local Statesync Restore Snapshot With Height: %d", ssRestoreHeight))
ctx.Config.StateSync.RestoreHeight = ssRestoreHeight // Should be a one-shot command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to update the config?

Copy link
Contributor Author

@chillyvee chillyvee Oct 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove (maybe can't, see next comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Used here in the tendermint layer to detect a statesync, but there might be a way to remove it.

https://github.com/tendermint/tendermint/pull/9541/files#diff-4eaef000463512634f080db6f15d6aeb1b856e995a148bbeea5ee94758210fe1R281

Copy link
Contributor Author

@chillyvee chillyvee Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing with the tendermint maintainers if there's a more preferred way to send a signal that cosmos-sdk performed a "Lazy State Sync".

They are asking to consider a slightly different code path which might require a variable in a different place (or no variable at all).

tendermint/tendermint#9541 (comment)

I'm having some trouble implementing the idea they provided due to some objects not being in scope to call so I'll post back here as soon as I get some assistance.

server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
@chillyvee
Copy link
Contributor Author

Thank you for the review. Will clean up the code as recommended in a day or two.

@chillyvee
Copy link
Contributor Author

Ok, with that frame, I will re-review. I also would encourage we name this "feature" (I think it's a feature?) something like "Lazy State Sync". WDYT?

"Lazy State Sync" works.

Earlier thought of "Local State Sync", but perhaps the word "Local" is a bit overused.

@chillyvee
Copy link
Contributor Author

chillyvee commented Oct 25, 2022

Regarding the usage of ctx.Logger.Error, do you have any recommendations on how to return an error after Logging it to console?

Latest changes based on your feedback are added to this branch.

@alexanderbez
Copy link
Contributor

Regarding the usage of ctx.Logger.Error, do you have any recommendations on how to return an error after Logging it to console?

IIRC, error should not cause the process to halt. You can return the error after you log it.

@chillyvee
Copy link
Contributor Author

Appreciate all the reviews and recommendations.

Please hold off merging until we finalize the tendermint side of this effort. :)

@alexanderbez alexanderbez added the S:blocked Status: Blocked label Oct 27, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 3, 2022
@chillyvee
Copy link
Contributor Author

Hey who's calling us stale?!

@tac0turtle tac0turtle removed the stale label Dec 3, 2022
@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 13, 2023
@alexanderbez
Copy link
Contributor

alexanderbez commented Jan 16, 2023

@chillyvee this feature is really awesome and I personally would love to see this land. I know this is contingent upon TM work being delivered first. Any progress on that?

Or is there anyway we can define or merge this w/o waiting on TM s.t. when TM finally does include the necessary changes, this will just automatically work?

Also, I'm more than happy to push this over the finish line myself too if you're too busy. Either way, just lmk 👍

@github-actions github-actions bot removed the stale label Jan 17, 2023
@chillyvee
Copy link
Contributor Author

Any progress on that?

Or is there anyway we can define or merge this w/o waiting on TM s.t. when TM finally does include the necessary changes, this will just automatically work?

On TM - Do you know if we're switching the official tendermint repo? If so we may need to issue a new PR.

There was a bit of talk about how to structure the code in TM (how to take the implementation, restructure where various lines of code go). In the process, a lot of intermediate changes were requested, which raised questions, but not all were answered because the maintainer there had other priorities.

Also TM wants to merge against main/head and backport, so it will take a few more iterations before we can reference from cosmos-sdk.

Do you think it's possible to work together on both sides of cosmos-sdk/tendermint repo?

@chillyvee
Copy link
Contributor Author

Or is there anyway we can define or merge this w/o waiting

Will check again :) There are a few more follow in features that might be worth adding in, but trying to keep code scope small so we can merge "quickly" without additional delays

@yihuang
Copy link
Collaborator

yihuang commented Jan 17, 2023

it could be useful even without the tm part, for example transform an archive node to a non-archive one fast, by restore to a previous snapshot and replay the blocks from tm.

@alexanderbez
Copy link
Contributor

So the TM repo is moved to Informal's org and will be renamed in the coming days/weeks. Work will continue there, so yes, be sure to make sure all relevant issues and PRs are there now @chillyvee

@aaronc aaronc removed their assignment Jan 25, 2023
@tac0turtle
Copy link
Member

hey, really love this work, I will close it until we have the change in comet. We dont have a timeline for that, so dont want to leave this pr open for a unknown amount of time. Once we have a timeline i will contact you. Sorry for the situation. The comet team is working hard so i can see this happening soon.

@tac0turtle tac0turtle closed this Feb 19, 2023
@yihuang yihuang mentioned this pull request May 8, 2023
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:blocked Status: Blocked
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants