Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upgrade module #4233

Merged
Merged

Conversation

@aaronc
Copy link
Contributor

aaronc commented Apr 29, 2019

This is a reopening of PR #3979 which was closed when the develop branch was removed. See that PR for previous discussion.

Upgrading live chains has been previously discussed in #1079 and there is a WIP spec in #2116. Neither of these provide an actual implementation of how to coordinate a live chain upgrade on the software level. My understanding and experience with Tendermint chains is that without a software coordination mechanism, validators can easily get into inconsistent state because they all need to be stopped at precisely the same point in the state machine cycle.

This PR provides a module for performing live chain upgrades that has been developed for Regen Ledger and tested against our testnets. It may or may not be what Cosmos SDK wants, but just sharing it in case it is...

This module attempts to take a minimalist approach to coordinating a live chain upgrade and can be integrated with any governance mechanism. Here are a few of its features:

  • allows upgrades to be scheduled at a future block height or after a future block time
  • crashes the blockchain state machine in BeginBlock when an upgrade is required and doesn't allow it to restart until new software with the expected upgrade is started
  • provides a hook for performing state migrations once upgraded software is started
  • allows for custom "crash" behavior that could be used to trigger automatic installation of the new software

This PR doesn't currently include any integration with the Cosmos gov module, but that could be easily done if this upgrade method works for Cosmos hub.

  • Linked to github-issue with discussion and accepted design OR link to spec that describes this work (linked to issues, the specification is described in the docs which are live here: https://godoc.org/github.com/regen-network/regen-ledger/x/upgrade).
  • Wrote tests
  • Updated relevant documentation (docs/) - includes through go package docs
  • Added a relevant changelog entry: sdkch add [section] [stanza] [message]
  • rereviewed Files changed in the github PR explorer

For Admin Use:

  • Added appropriate labels to PR (ex. wip, ready-for-review, docs)
  • Reviewers Assigned
  • Squashed all commits, uses message "Merge pull request #XYZ: [title]" (coding standards)
@codecov

This comment was marked as outdated.

Copy link

codecov bot commented Apr 29, 2019

Codecov Report

Merging #4233 into master will decrease coverage by 0.13%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4233      +/-   ##
==========================================
- Coverage   60.18%   60.04%   -0.14%     
==========================================
  Files         212      215       +3     
  Lines       15187    15248      +61     
==========================================
+ Hits         9140     9156      +16     
- Misses       5418     5448      +30     
- Partials      629      644      +15
1 similar comment
@codecov

This comment was marked as outdated.

Copy link

codecov bot commented Apr 29, 2019

Codecov Report

Merging #4233 into master will decrease coverage by 0.13%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4233      +/-   ##
==========================================
- Coverage   60.18%   60.04%   -0.14%     
==========================================
  Files         212      215       +3     
  Lines       15187    15248      +61     
==========================================
+ Hits         9140     9156      +16     
- Misses       5418     5448      +30     
- Partials      629      644      +15
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 29, 2019

Codecov Report

Merging #4233 into master will decrease coverage by 1%.
The diff coverage is 62.91%.

@@            Coverage Diff             @@
##           master    #4233      +/-   ##
==========================================
- Coverage   54.61%   53.61%   -1.01%     
==========================================
  Files         299      298       -1     
  Lines       18177    17897     -280     
==========================================
- Hits         9928     9596     -332     
- Misses       7464     7550      +86     
+ Partials      785      751      -34
@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 6, 2019

@rigelrozanski I was just thinking I'd mention after our chat last week that the approach here is really agnostic to whether it's governance or validator signaling that makes the upgrade happen. I think that really comes down to a larger governance discussion. All this module does is coordinate chain halts and restarts based on whoever calls ScheduleUpgrade. So ScheduleUpgrade could be called by the gov module, it could be called by a "validator readiness" module once an 80% threshold is reached, or maybe even some combination of the two.

Copy link
Contributor

ethanfrey left a comment

Nice idea and good start on providing migration/upgrade tooling.
I would love more comments and thoughts from the cosmos-sdk team, as kill the blockchain, dump start, and start a new chain doesn't seem like a viable long-term solution.

I wonder what happens to all the exchanges next time this happens....

The app must then integrate the upgrade keeper with its governance module as appropriate. The governance module
should call ScheduleUpgrade to schedule an upgrade and ClearUpgradePlan to cancel a pending upgrade.
Performing Upgrades

This comment has been minimized.

Copy link
@ethanfrey

ethanfrey May 10, 2019

Contributor

This is an interesting idea. And wonderful documentation. Both these package-level comments, as well as all the comments on types.

For halting, this works well. However, I think we need a more complete strategy for upgrades. I will expand on this o the issue. But yeah, this seems a nice first step.

This comment has been minimized.

Copy link
@jackzampolin

jackzampolin Oct 8, 2019

Member

Does this file need to get updated with the changes that have been made?

This comment has been minimized.

Copy link
@fedekunze

fedekunze Oct 9, 2019

Contributor

we need to eventually transition or duplicate part of this godoc to a new docs/ guideline for how to upgrade live chain. cc: @gamarin2 @hschoenburg


upgradeTime := keeper.plan.Time
upgradeHeight := keeper.plan.Height
if (!upgradeTime.IsZero() && !blockTime.Before(upgradeTime)) || upgradeHeight <= blockHeight {

This comment has been minimized.

Copy link
@ethanfrey

ethanfrey May 10, 2019

Contributor

Nice logic. I think the feature switch needs to be more integrated in the rest of the framework.

@ethanfrey

This comment has been minimized.

Copy link
Contributor

ethanfrey commented May 10, 2019

I think this is a great beginning to the upgrade plans for cosmos-sdk.

What this does provide is a way to halt the chain (in the future at a predefined time/height schedule - smart). And a way to restart the chain when the new software is deployed. However, if I understand properly, it still requires - chain halts, validators deploy new software, chain restarts (as soon as >2/3 are live on the new code).

Also, if I want to replay the chain, in order to reproduce it, I will have to run v1 until it halts, then replace with v2, then start up again.

TL;DR: looks great. key addition is some switch so when syncing a chain later v2 of the software can dete

I think a nice addition here would be something like "feature switches". (which may be possible with the current infrastructure, using the onUpgrade callback, but I am proposing more tooling here. For example, go-ethereum has a lookup of block heights when certain features are enabled. These are then checked at points in the code to eg. enable replay protection or not

Better than storing the config files, one could store this information on-chain. This means v2 will query for "cip 21 enabled" for example and decide whether to run in v1 or v2 mode. Then, when replaying a chain, v2 will run in v1 mode until the halt point. Since the handler is already registered, it will immediately execute onUpgrade (setting the feature switch to on) and continue the rest of the blocks using the new v2 functionality.

Then we also need a way to see if users are sending v1 or v2 messages, whether models are v1 or v2. We recently added migrations support in weave and one side-effect was adding a Metadata/Schema attribute to top-level Message and Model structs (which are the ones that we first get on de-serialization). We then check this against which are enabled, possibly do on-the-fly upgrades, or simply reject false versions.

This code has not been tried out on a live chain yet, and likely has many points for improvement. Glad to get feedback and cross-pollination on such techniques.

@rigelrozanski

This comment has been minimized.

Copy link
Member

rigelrozanski commented May 10, 2019

brain dump from concepts discussed during an sdk core-dev call on chain upgrades:

  • folks seems to be agreeable to the idea of direct state updates as we've previously discussed
  • this stands good with some of the ideas proposed in this PR for init-chain upgrades
  • an important point raised by the team is that we need the capability to halt both tendermint and the abci-application before restarting - each version of the application will often have a new version of both for a while.
    • due to this ^ we discussed that it would be positive to introduce a new middleware responsible for determining which versions of both the application and tendermint should be running and from between which blocks
    • version and block range information could be loaded from a config, for dynamic upgrades the sdk could write to this config directly
      • this may have negative security implications, we ought to consider alternatives
    • this middleware would exist between the tendermint-abci and the application
    • this middleware would effectively require the power to shutdown and restart tendermint and application instances, potentially through running these in containers.

CC @sunnya97 @jackzampolin @jaekwon @alexanderbez

@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 15, 2019

@ethanfrey

Also, if I want to replay the chain, in order to reproduce it, I will have to run v1 until it halts, then replace with v2, then start up again.

I have been working on having a smooth solution to this using NixOS which may be a solution for nodes that are willing to run that. I could see something similar being done in a docker-based deployment scenario or even with some sort of binary manager that downloaded the correct binaries at upgrade points. Validators, however, might have more complicated setups that wouldn't be covered by these approaches. What I what suggest as a general principle is at a minimum let's create an easy replay recipe for casual users that want to run a full node. The replay path for validators may or may not be as simple.

I think a nice addition here would be something like "feature switches". (which may be possible with the current infrastructure, using the onUpgrade callback, but I am proposing more tooling here. For example, go-ethereum has a lookup of block heights when certain features are enabled. These are then checked at points in the code to eg. enable replay protection or not

I think the biggest challenge with doing a feature switch type approach is that it places quite a bit of burden on engineers to correctly code the feature switches. I actually think it would be good if projects were coded with that sort of discipline, but it might not be too realistic near term. Using some sort of binary switching (like I'm doing with NixOS) would make things a bit easier so that the exact same binary would be replayed at each phase of the upgrade.

@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 15, 2019

@rigelrozanski

  • an important point raised by the team is that we need the capability to halt both tendermint and the abci-application before restarting - each version of the application will often have a new version of both for a while.

Why does panic'ing in an ABCI handler like BeginBlock not halt Tendermint as well? Won't Tendermint's state machine fail to advance without some sort of non-error response from the ABCI app?

@rigelrozanski

This comment has been minimized.

Copy link
Member

rigelrozanski commented May 16, 2019

You're correct I believe it should. What are you thinking? I'll also mention a point I forgot earlier: We need the capability for the validator set to hard fork the hub without governance approval under a broken-governance or last-resort scenario. Kinda like the ability to have manual controls for the blockchain need be

@ethanfrey

This comment has been minimized.

Copy link
Contributor

ethanfrey commented May 17, 2019

I think the biggest challenge with doing a feature switch type approach is that it places quite a bit of burden on engineers to correctly code the feature switches. I actually think it would be good if projects were coded with that sort of discipline, but it might not be too realistic near term. Using some sort of binary switching (like I'm doing with NixOS) would make things a bit easier so that the exact same binary would be replayed at each phase of the upgrade.

I agree and the overhead of testing multiple code paths and transitions is rather high.

I love the idea of triggering a os-level (NixOS / docker / etc) switch of the binary at some point. Like we register binaries with tags chain v1.0, chain v1.1 (maybe even self-built), and then the app (or better some higher-level supervisor) can swap them out on some trigger from the chain.

Great idea

@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 20, 2019

@rigelrozanski

You're correct I believe it should. What are you thinking?

Well just that panic'ing in the ABCI app may be all that's needed to stop both the app and Tendermint. There might not be any special changes required at the Tendermint level

I'll also mention a point I forgot earlier: We need the capability for the validator set to hard fork the hub without governance approval under a broken-governance or last-resort scenario. Kinda like the ability to have manual controls for the blockchain need be

Does Tendermint have some sort of "backdoor" that allows one to set the expected validator set outside of the ABCI process? Would that be maybe the main functionality needed to support a hard-fork?

Another similar scenario that's occurred to me is what if some indeterminism causes a consensus failure when there is no bad behavior, just bad code. I think a similar hard-fork like fix would be needed, but in this case you might need to delete the last block because it causes consensus failure on the ABCI side. I think for this just the ability to import blocks only up to a certain height would support this. Although, maybe it's not needed because the failure is only on the ABCI app side and the consensus failure can probably be fixed with an app upgrade without having to delete the failing Tendermint block.

@ethanfrey

This comment has been minimized.

Copy link
Contributor

ethanfrey commented May 20, 2019

I second @aaronc here

Seems like it is doable. And if this upgrade path only covers 95% of the case. Not DAO hack revert state and fend off Etc fork craziness. For such extreme cases, some custom upgrade coordination would be needed. But that should be the exception not the rule.

How are you going to adjust the inflation rate calculation logic gracefully? State dump and chain restart? I think this proposal would work there and provide a much nicer experience

@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 20, 2019

Seems like it is doable. And if this upgrade path only covers 95% of the case. Not DAO hack revert state and fend off Etc fork craziness. For such extreme cases, some custom upgrade coordination would be needed. But that should be the exception not the rule.

Yes, this PR is for the happy path. I agree there should be some mechanism to support the unhappy path but let's make that a separate issue.

How are you going to adjust the inflation rate calculation logic gracefully? State dump and chain restart? I think this proposal would work there and provide a much nicer experience

@ethanfrey I'm not really familiar with how this happens. My hope has been to avoid state dumps because that approach causes transaction history to be lost (not viable for our use case).

@ethanfrey

This comment has been minimized.

Copy link
Contributor

ethanfrey commented May 23, 2019

@ethanfrey I'm not really familiar with how this happens. My hope has been to avoid state dumps because that approach causes transaction history to be lost (not viable for our use case).

Sorry for my unclear comment. I was asking how any changes to the cosmos hub can be made in the current state? I think this proposal would allow a way of gracefully upgrading binaries at the proper locations and thus not requiring a state dump/chain restart, as was done on the last hub upgrade. So far that seems the only existing path, and I encourage the core dev team to support useful tools that cover 90+% of upgrades, rather than challenge due to some possible edge cases they would not work (and which would have to revert to current extreme upgrade path).

Basically, I am asking.... @rigelrozanski why is this proposal frozen for many weeks without any real feedback? (except a braindump saying you are generally cool with this line of thought) If there is a serious design or code error here, please point it out. If not, it would be great to have a path forward on this.

(I have also been the victim of my PRs hanging months with little to no feedback, and I think this doesn't encourage open source contributions outside of the core team. If you (cosmos/icf/all in bits) wants open source contributions from the community, it would be good to give a bit more feedback, direction, and support to such initiatives as this. I think some healthy feedback here could help evolve a very nice solution with input from all parties).

@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented May 28, 2019

After discussing a bit with @zmanian , one thing that is clear to me now that wasn't before is that it will be a while before the Tendermint block structure is stable. So while it may be possible to restructure state smoothly without creating a new chain, rewriting Tendermint blocks is impossible because signatures will be invalid.

So, this upgrade approach could still be useful for cases where an upgrade is doesn't involve any breaking changes on the Tendermint side.

We are planning to test this with a public https://github.com/regen-network/regen-ledger testnet, hopefully as soon as next week.

An alternate idea proposed by @AFDudley for maintaining the continuity of transactions even when a new chain needs to start from height 0 is including some reference to the block hash and chain-id of the previous chain in the genesis file of the new chain. Then some sort of transaction indexer could build up a continuous transaction history.

But again, it doesn't sound like this issue negates the usefulness of this "happy path" upgrade support in cases where it will work.

@aaronc aaronc force-pushed the regen-network:regen-network/upgrade-module branch 2 times, most recently from 9a2b150 to 0889aa6 Jun 3, 2019
@alexanderbez alexanderbez mentioned this pull request Jun 5, 2019
0 of 4 tasks complete
@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented Jun 18, 2019

We discussed the plans for doing a test of this upgrade module with Regen Network's testnet in our community meeting today. The planned timing is as follows:

  • have code ready for our community meeting next Tuesday and submit a governance proposal to do the upgrade with a pre-defined commit hash and upgrade time (not block height, but actual consensus time)
  • have 4 days to vote on the gov proposal
  • do the upgrade at the pre-determined time, possibly Mon, July 1st

We also discussed governance deciding on a predetermined time vs upgrade signalling as proposed here: #1079 (comment). It was brought up that the downside of the signalling approach is that it sort of forces validators to race to get the upgrade and could produce anxiety because you can't predict how quickly others will upgrade and if you are in the last third you could get slashed for being slow. So a pre-determined time or block-height decided in the governance process seemed preferable to those present because it gives a sense of predictability and allows for planning.

In Berlin, I chatted briefly with @ebuchman about Tendermint stability. While there are some important changes coming, it seems like there is the possibility and willingness to do this in such a way that "happy path" upgrades could still be possible. We could make that process easier to manage by using Prototool breaking change checker on the Tendermint block .proto definitions.

@aaronc aaronc mentioned this pull request Jul 2, 2019
1 of 6 tasks complete
@alexanderbez alexanderbez mentioned this pull request Jul 5, 2019
0 of 4 tasks complete
x/upgrade/doc.go Outdated Show resolved Hide resolved
x/upgrade/doc.go Outdated Show resolved Hide resolved
@ethanfrey ethanfrey force-pushed the regen-network:regen-network/upgrade-module branch 2 times, most recently from 4b99328 to aba301f Jul 8, 2019
@ethanfrey ethanfrey force-pushed the regen-network:regen-network/upgrade-module branch from 4d95253 to 9009405 Jul 15, 2019
@ethanfrey ethanfrey mentioned this pull request Jul 15, 2019
4 of 5 tasks complete
@aaronc

This comment has been minimized.

Copy link
Contributor Author

aaronc commented Jul 16, 2019

Please note that this PR will soon depend on #4724 in order to perform store migrations that can't be done within the ABCI methods (because the store won't even load without these migrations). #4724 handles cases when KVStores are deleted or renamed as has happened with v0.36.0.

Also as a follow-up to our discussion last week @sunnya97, I want to point out that having a managing process that downloads new binaries would work well on top of this upgrade module approach. At a very basic level the managing process could watch stdout of the gaiad/xrnd daemon and look for UPGRADE REQUIRED messages which indicate upgrade points when a new binary is needed.

@fedekunze fedekunze added the WIP label Jul 19, 2019
x/upgrade/internal/types/keys.go Outdated Show resolved Hide resolved
x/upgrade/internal/keeper/keeper.go Outdated Show resolved Hide resolved

// QuerierKey is used to handle abci_query requests
QuerierKey = ModuleName

This comment has been minimized.

Copy link
@golangcibot

golangcibot Nov 5, 2019

Contributor

File is not gofmt-ed with -s (from gofmt)

upgradeHandlers map[string]types.UpgradeHandler
}


This comment has been minimized.

Copy link
@golangcibot

golangcibot Nov 5, 2019

Contributor

File is not gofmt-ed with -s (from gofmt)

Suggested change
@ethanfrey ethanfrey mentioned this pull request Nov 5, 2019
3 of 5 tasks complete
@ethanfrey

This comment has been minimized.

Copy link
Contributor

ethanfrey commented Nov 5, 2019

@bez all issues should be addressed now. along with a few more detected while integrating with gaia.

Also, please check out cosmos/gaia#184 and run through the demo upgrade procedure (you will want a machine where ~/.gaiad and ~/.gaiacli don't have anything valuable). It should give you much more confidence this works well in production

fedekunze and others added 2 commits Nov 5, 2019
@alexanderbez alexanderbez mentioned this pull request Nov 8, 2019
5 of 8 tasks complete
Copy link
Contributor

alexanderbez left a comment

ACK

@alexanderbez alexanderbez merged commit d81d461 into cosmos:master Nov 8, 2019
9 of 10 checks passed
9 of 10 checks passed
ci/circleci: build-docs-1 Your tests failed on CircleCI
Details
GolangCI No issues found!
Details
ci/circleci: setup-dependencies Your tests passed on CircleCI!
Details
ci/circleci: test-cover Your tests passed on CircleCI!
Details
ci/circleci: test-sim-after-import Your tests passed on CircleCI!
Details
ci/circleci: test-sim-import-export Your tests passed on CircleCI!
Details
ci/circleci: test-sim-multi-seed-short Your tests passed on CircleCI!
Details
ci/circleci: test-sim-nondeterminism Your tests passed on CircleCI!
Details
ci/circleci: update-swagger-docs Your tests passed on CircleCI!
Details
ci/circleci: upload-coverage Your tests passed on CircleCI!
Details
SDK Sprint: 03 (10/29 - 11/12) automation moved this from Review in progress to Done Nov 8, 2019
@fedekunze fedekunze mentioned this pull request Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
SDK Sprint
  
Review in progress
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.