New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index: race fix, lock cs_main while 'm_synced' is subject to change #29867
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
Is this related to bbe82c1? |
Is it worth it to partially revert? Why not just |
I didn't do it because the revert is not clean. It conflicts with bbe82c1 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me). |
c395b26
to
cbcb2c8
Compare
utACK cbcb2c8
Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:
This would be a cleaner git history, and bbe82c1 turned out to be a red herring so would make more sense to revert that then have people be confused thinking that actually solved anything. |
utACK cbcb2c8 Holding on to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK cbcb2c8. I think I have the opposite opinion as andrewtoth, and that original bugfix c395b26 is more direct and simple than current cbcb2c8 since the original fix leaves the Rewind code alone.
re: #29867 (comment)
Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:
revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5 revert 0faafb57f8298547949cbc0044ee9e925ed887ba doc commit
I don't think this will work because don't actually want to revert bbe82c1. It is still important to set m_synced = true;
after calling Commit to avoid the bug reported #29767. We can't just go back to the state before 0faafb5 because the bug in #29767 was still present before that commit, just less likely to happen. So better to make a direct change fixing the current bug, than trying to fix it by reverting previous changes.
re: #29867 (comment)
Holding on to
cs_main
in this particular spot, when index catches up to the tip, is not a big deal performance wise.
Agree just locking cs_main while calling NextSyncBlock
and setting m_synced = true
like the code before 0faafb5 is probably the easiest fix. I think the following fix which avoids locking cs_main while calling Commit would work too, however:
--- a/src/index/base.cpp
+++ b/src/index/base.cpp
@@ -160,12 +160,24 @@ void BaseIndex::Sync()
}
const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
+ // If pindex_next is null, it means pindex is the chain tip, so
+ // commit data indexed so far.
if (!pindex_next) {
SetBestBlockIndex(pindex);
// No need to handle errors in Commit. See rationale above.
Commit();
- m_synced = true;
- break;
+
+ // If pindex is still the chain tip after committing, exit the
+ // sync loop. It is important for cs_main to locked while
+ // setting m_synced = true, otherwise a new block could be
+ // attached while m_synced is still false, and it would not be
+ // indexed.
+ LOCK(::cs_main);
+ pindex_next = NextSyncBlock(pindex, m_chainstate->m_chain);
+ if (!pindex_next) {
+ m_synced = true;
+ break;
+ }
}
if (pindex_next->pprev != pindex && !Rewind(pindex, pindex_next->pprev)) {
FatalErrorf("%s: Failed to rewind index %s to a previous chain tip", __func__, GetName());
Any of the fixes seem ok to me.
@ryanofsky's suggestion looks good to me as well. Though maybe if the timing is really unlucky you'd get lots of |
Ahh, I see yes |
That's a good point. If Commit() is slow and it has to be called more than once, it could make index sync less efficient, even if it blocks validation code less. There's a choice about what to do when the Commit() call at the end of the sync is slow and a new block is connected while it is executing.
I think behaviors 2 or 3 both should be ok in practice, and 3 may be more appealing conceptually because ideally indexes should not hold cs_main and block validation code while writing their own data. Another alternative could be to use approach 3 but only call Commit() exactly once at the end of the sync, skipping subsequent Commit() calls if more blocks are connected until the sync is completely finished and there is a ChainStateFlushed() notification. This would be similar to what happens with normal BlockConnected() notifications for new blocks. I think any approach that fixes the bug is fine though, and the current approach does LGTM. |
Little delayed but here again. I'm also okay with both options. I'm feeling a bit more inclined towards option 3, primarily because it allows us to explain what the code is doing more clearly. Also, just to share some extra thoughts around this issue, another -bad- option to hold a minimal |
This ensures that the index does not miss any 'new block' signals occurring in-between reading the 'next block' and setting 'm_synced'. Because, if this were to happen, the ignored blocks would never be indexed, thus stalling the index forever.
cbcb2c8
to
65951e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 65951e0
@@ -160,12 +160,24 @@ void BaseIndex::Sync() | |||
} | |||
|
|||
const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain)); | |||
// If pindex_next is null, it means pindex is the chain tip, so | |||
// commit data indexed so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I had to read this three times to get it right, but feel free to ignore unless you have to retouch.
I would change it to [...], so commit the data that was indexed so far.
Code review ACK 65951e0 |
ACK 65951e0 |
Note just to clarify issue history: The race condition fixed here was introduced by 0faafb5 from #28955, which stopped locking This race condition is different from the other race condition recently fixed in #29776, which is a much older bug going back to 4368384 from #14121, caused by incorrectly setting |
Seems like there's at least something which could/should be backported here? Maybe only the other change, given the newly broken code has only existed in master? |
Yes in terms of backports, it could make sense to backport the #29776 fix to releases before #28955, since the race fixed by #29776 where two threads are trying to update index state at the same time at the end of a sync could happen before #28955, although it was probably much less likely to happen before #28955. It would only make sense to backport this PR to releases after #28955, though, since this bug, which could cause blocks to be missing from the index, was introduced in #28955. |
This ensures that the index does not miss any 'new block' signals occurring in-between reading the 'next block' and setting 'm_synced'. Because, if this were to happen, the ignored blocks would never be indexed, thus stalling the index forever. Github-Pull: bitcoin#29867 Rebased-From: 65951e0
This ensures that the index does not miss any 'new block' signals occurring in-between reading the 'next block' and setting 'm_synced'. Because, if this were to happen, the ignored blocks would never be indexed, thus stalling the index forever. Github-Pull: bitcoin#29867 Rebased-From: 65951e0
Fixes #29831 and #29863. Thanks to Marko for the detailed description of the issue.
The race occurs because a block could be connected and its event signaled in-between reading the 'next block' and setting
m_synced
during the index initial synchronization. This is becausecs_main
is not locked through the process of determining the final index sync state.To address the issue, the
m_synced
flag set has been moved undercs_main
guard.