Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledger: fix possible dbRound unsynchronization for trackers and registry #3910

Merged
merged 4 commits into from
Apr 26, 2022

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Apr 23, 2022

Summary

There is an issue with trackerRegistry.dbRound value and cached dbRound values stored in trackers.
Although they are updated under the same lock, produceCommittingTask might use non-updated dbRound and give it to trackers with updated state.
The fix is simple: have dbRound usage and produceCommittingTask invocation under the same lock.

Test Plan

Added a new test that dances with locks and simulates the race. Unfortunately after the wider locking it looks like impossible to recreate it.

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #3910 (09cc2c6) into master (163ad18) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3910      +/-   ##
==========================================
- Coverage   50.04%   50.04%   -0.01%     
==========================================
  Files         394      394              
  Lines       68465    68465              
==========================================
- Hits        34266    34264       -2     
- Misses      30494    30498       +4     
+ Partials     3705     3703       -2     
Impacted Files Coverage Δ
ledger/tracker.go 74.67% <100.00%> (ø)
ledger/blockqueue.go 82.18% <0.00%> (-2.88%) ⬇️
cmd/tealdbg/debugger.go 72.69% <0.00%> (-0.81%) ⬇️
ledger/acctupdates.go 68.51% <0.00%> (-0.66%) ⬇️
network/wsNetwork.go 62.99% <0.00%> (ø)
catchup/service.go 68.88% <0.00%> (+0.24%) ⬆️
catchup/peerSelector.go 100.00% <0.00%> (+1.04%) ⬆️
crypto/merkletrie/node.go 93.48% <0.00%> (+1.86%) ⬆️
crypto/merkletrie/trie.go 68.61% <0.00%> (+2.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 163ad18...09cc2c6. Read the comment docs.

Copy link
Contributor

@tolikzinovyev tolikzinovyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the bug exactly? I couldn't follow your explanation.

ledger/tracker.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, though I know very little about our locking discipline for trackers. But I think I'm seeing that the registry has a dbround, and the intent is to pass down the dbround to all the registered trackers. Previously, the round could change between when it was taken from the registry and passed down to produceCommittingTask, now it can't. That sounds good, but I can't say much more than that. Does produceCommittingRound fail if given an out of date round? I would have guessed it just did less work.

Edit: I think I get it. The tracker's can't always commit an old round. So if they've gone too far ahead, they fail. With the code change, they can't get ahead. It does worry me that they were getting so far ahead - does that mean the new code is holding the lock for a very long time? Should we be looking to understand why that happens? If we don't, when the same condition occurs, this lock will be held, so everything stalls?

ledger/tracker.go Show resolved Hide resolved
@brianolson
Copy link
Contributor

catchpointtracker.go and acctupdates.go have non-trivial implementations of produceCommittingTask(), but I'm not seeing what in them cares if dbRound has moved on - or if anything refers to another moving piece of data

@algorandskiy
Copy link
Contributor Author

algorandskiy commented Apr 26, 2022

anything refers to another moving piece of data

the dbRound from outside gets compared with au.deltas that might be already sliced to have data after au.cachedDbRound (it gets out of sync with the registry's dbRound)

@algorandskiy algorandskiy merged commit bd1129e into algorand:master Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants