Skip to content
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.

Problem (Fix #1835): non-determinism in reward distribution #1859

Closed
wants to merge 1 commit into from

Conversation

yihuang
Copy link
Collaborator

@yihuang yihuang commented Jul 1, 2020

Solution:

  • Add block height assertion in begin block to prevent consensus
    connection re-connect

@devashishdxt
Copy link
Collaborator

What was the cause of non-determinism?

@yihuang
Copy link
Collaborator Author

yihuang commented Jul 1, 2020

What was the cause of non-determinism?

  • begin block n
  • consensus connection closed
  • tendermint restarted
  • begin block n again

begin block n event might be processed more than once before commit.

Solution:
- Add block height assertion in begin block to prevent consensus
  connection re-connect
@codecov
Copy link

codecov bot commented Jul 1, 2020

Codecov Report

Merging #1859 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1859      +/-   ##
==========================================
+ Coverage   66.15%   66.16%   +0.01%     
==========================================
  Files         205      205              
  Lines       26088    26090       +2     
==========================================
+ Hits        17258    17263       +5     
+ Misses       8830     8827       -3     
Impacted Files Coverage Δ
chain-abci/src/app/mod.rs 83.55% <100.00%> (+0.07%) ⬆️
chain-abci/tests/abci_app.rs 94.69% <100.00%> (+<0.01%) ⬆️
chain-abci/tests/punishment.rs 100.00% <100.00%> (ø)
chain-core/src/state/tendermint.rs 63.18% <0.00%> (+1.64%) ⬆️

@devashishdxt
Copy link
Collaborator

What was the cause of non-determinism?

  • begin block n
  • consensus connection closed
  • tendermint restarted
  • begin block n again

begin block n event might be processed more than once before commit.

Why tendermint’s consensus connection got closed? Was there any error on tendermint side that caused tendermint to terminate connection?

@yihuang
Copy link
Collaborator Author

yihuang commented Jul 1, 2020

What was the cause of non-determinism?

  • begin block n
  • consensus connection closed
  • tendermint restarted
  • begin block n again

begin block n event might be processed more than once before commit.

Why tendermint’s consensus connection got closed? Was there any error on tendermint side that caused tendermint to terminate connection?

There are logs in slack discussions

@devashishdxt
Copy link
Collaborator

Can we figure out the reason for tendermint’s consensus connection disconnection from logs?

@yihuang
Copy link
Collaborator Author

yihuang commented Jul 1, 2020

Can we figure out the reason for tendermint’s consensus connection disconnection from logs?

The abci logged an empty begin block response for some reason:

Jun 26 20:19:17 sgx-testnet-uksouth-fullnode-202005040919-1 chain-abci[3801]: [2020-06-26T20:19:17Z DEBUG abci::server] Return Response! begin_block {}

@tomtau
Copy link
Contributor

tomtau commented Jul 2, 2020

Can we figure out the reason for tendermint’s consensus connection disconnection from logs?

"funky" deployment where processes are restarted individually

Copy link
Contributor

@tomtau tomtau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sort of assertion should be in rust-abci instead of in each application.
upgrading to abci = "0.7.1" doesn't work?

@yihuang
Copy link
Collaborator Author

yihuang commented Jul 2, 2020

I think this sort of assertion should be in rust-abci instead of in each application.
upgrading to abci = "0.7.1" doesn't work?

I think current abci don't assert the internals of messages, add assertion on block height could solve the issue more thoroughly. maybe abci in the future can add more assertions on state machine transition or internals of messages.

@tomtau
Copy link
Contributor

tomtau commented Jul 2, 2020

I think this sort of assertion should be in rust-abci instead of in each application.
upgrading to abci = "0.7.1" doesn't work?

I think current abci don't assert the internals of messages, add assertion on block height could solve the issue more thoroughly. maybe abci in the future can add more assertions on state machine transition or internals of messages.

yes tendermint/rust-abci#49

there are also a few ignored tests: https://github.com/crypto-com/chain/blob/master/chain-abci/tests/abci_app.rs#L633

current abci doesn't assert them, but a new version doesn't throw away the errors like before, so it'd be good to check whether this fixes the connection drop / it'll fail with the connection drop (so one restarts from the committed state)

@yihuang yihuang closed this Jul 3, 2020
@devashishdxt
Copy link
Collaborator

@tomtau What should ideally be the behaviour in such scenarios? Should ABCI app panic or should it discard current uncommitted state and start processing new BeginBlock?

@tomtau
Copy link
Contributor

tomtau commented Jul 3, 2020

@devashishdxt it's not specified anywhere at the moment.
maybe the easiest is if it gracefully shutdowns on connection drops, so that it restarts altogether with the TM process and reloads the committed state -- I tried with "0.7.1", and it doesn't shutdown/panic

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants