Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: race fix, lock cs_main while 'm_synced' is subject to change #29867

Merged
merged 1 commit into from Apr 25, 2024

Conversation

furszy
Copy link
Member

@furszy furszy commented Apr 13, 2024

Fixes #29831 and #29863. Thanks to Marko for the detailed description of the issue.

The race occurs because a block could be connected and its event signaled in-between reading the 'next block' and setting m_synced during the index initial synchronization. This is because cs_main is not locked through the process of determining the final index sync state.
To address the issue, the m_synced flag set has been moved under cs_main guard.

@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 13, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, fjahr, achow101
Stale ACK andrewtoth, Sjors

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

@maflcko
Copy link
Member

maflcko commented Apr 14, 2024

Is this related to bbe82c1?

@furszy
Copy link
Member Author

furszy commented Apr 14, 2024

Is this related to bbe82c1?

No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

@andrewtoth
Copy link
Contributor

andrewtoth commented Apr 15, 2024

Is this related to bbe82c1?

No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba? Compared to this partial revert, the happy path of the full revert is only holding the lock for an extra pointer comparison check to see if there is a reorg, which doesn't seem like it is worth moving the lock for. Handling Rewind is more complex, but it will happen very rarely and shouldn't really happen much in Sync unless we are very close to tip.

@furszy
Copy link
Member Author

furszy commented Apr 16, 2024

Is it worth it to partially revert? Why not just git revert 0faafb5?

I didn't do it because the revert is not clean. It conflicts with bbe82c1 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).
But np on doing it and squashing the commits down to one. One sec.

@andrewtoth
Copy link
Contributor

utACK cbcb2c8

I didn't do it because the revert is not clean. It conflicts with bbe82c1 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
revert 0faafb57f8298547949cbc0044ee9e925ed887ba
doc commit

This would be a cleaner git history, and bbe82c1 turned out to be a red herring so would make more sense to revert that then have people be confused thinking that actually solved anything.

@Sjors
Copy link
Member

Sjors commented Apr 16, 2024

utACK cbcb2c8

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise. The other, much more frequently called Commit() in Sync() was already done without holding cs_main.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK cbcb2c8. I think I have the opposite opinion as andrewtoth, and that original bugfix c395b26 is more direct and simple than current cbcb2c8 since the original fix leaves the Rewind code alone.

re: #29867 (comment)

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
revert 0faafb57f8298547949cbc0044ee9e925ed887ba
doc commit

I don't think this will work because don't actually want to revert bbe82c1. It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767. We can't just go back to the state before 0faafb5 because the bug in #29767 was still present before that commit, just less likely to happen. So better to make a direct change fixing the current bug, than trying to fix it by reverting previous changes.

re: #29867 (comment)

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise.

Agree just locking cs_main while calling NextSyncBlock and setting m_synced = true like the code before 0faafb5 is probably the easiest fix. I think the following fix which avoids locking cs_main while calling Commit would work too, however:

--- a/src/index/base.cpp
+++ b/src/index/base.cpp
@@ -160,12 +160,24 @@ void BaseIndex::Sync()
             }
 
             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
+            // If pindex_next is null, it means pindex is the chain tip, so
+            // commit data indexed so far.
             if (!pindex_next) {
                 SetBestBlockIndex(pindex);
                 // No need to handle errors in Commit. See rationale above.
                 Commit();
-                m_synced = true;
-                break;
+
+                // If pindex is still the chain tip after committing, exit the
+                // sync loop. It is important for cs_main to locked while
+                // setting m_synced = true, otherwise a new block could be
+                // attached while m_synced is still false, and it would not be
+                // indexed.
+                LOCK(::cs_main);
+                pindex_next = NextSyncBlock(pindex, m_chainstate->m_chain);
+                if (!pindex_next) {
+                    m_synced = true;
+                    break;
+                }
             }
             if (pindex_next->pprev != pindex && !Rewind(pindex, pindex_next->pprev)) {
                 FatalErrorf("%s: Failed to rewind index %s to a previous chain tip", __func__, GetName());

Any of the fixes seem ok to me.

@DrahtBot DrahtBot requested a review from ryanofsky April 16, 2024 15:28
@Sjors
Copy link
Member

Sjors commented Apr 16, 2024

@ryanofsky's suggestion looks good to me as well. Though maybe if the timing is really unlucky you'd get lots of Commit() calls?

@andrewtoth
Copy link
Contributor

andrewtoth commented Apr 16, 2024

It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767.

Ahh, I see yes BlockConnected is not guarded by cs_main, so that could still be called even if we hold cs_main inside Sync. Thank you for clarifying this for me.

@ryanofsky
Copy link
Contributor

ryanofsky commented Apr 16, 2024

Though maybe if the timing is really unlucky you'd get lots of Commit() calls?

That's a good point. If Commit() is slow and it has to be called more than once, it could make index sync less efficient, even if it blocks validation code less.

There's a choice about what to do when the Commit() call at the end of the sync is slow and a new block is connected while it is executing.

  1. Currently, the new block may get ignored and the index could be corrupted.
  2. Before 0faafb5 from #28955, and after cbcb2c8 from this PR, cs_main is locked during Commit() so a new block can't be connected.
  3. The suggested diff would call Commit() again and potentially keep calling it until no new blocks were connected.

I think behaviors 2 or 3 both should be ok in practice, and 3 may be more appealing conceptually because ideally indexes should not hold cs_main and block validation code while writing their own data.

Another alternative could be to use approach 3 but only call Commit() exactly once at the end of the sync, skipping subsequent Commit() calls if more blocks are connected until the sync is completely finished and there is a ChainStateFlushed() notification. This would be similar to what happens with normal BlockConnected() notifications for new blocks.

I think any approach that fixes the bug is fine though, and the current approach does LGTM.

@furszy
Copy link
Member Author

furszy commented Apr 17, 2024

Little delayed but here again. I'm also okay with both options. I'm feeling a bit more inclined towards option 3, primarily because it allows us to explain what the code is doing more clearly.

Also, just to share some extra thoughts around this issue, another -bad- option to hold a minimal cs_main lock could be to keep the current flow in master and connect missing intermediate blocks inside BlockConnected only the first time it is called after setting m_synced. This does not require chain access nor any extra cs_main lock, as we could go backwards from the newly connected tip down to the index tip, reading and connecting the missing block(s).
The obvious downside of this option is that the index could finish the initial sync process missing one or two blocks, only updating its tip on the next BlockConnected signal, which could take a while. So, this would change the meaning of m_synced.

This ensures that the index does not miss any 'new block' signals
occurring in-between reading the 'next block' and setting 'm_synced'.
Because, if this were to happen, the ignored blocks would never be
indexed, thus stalling the index forever.
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 65951e0

@DrahtBot DrahtBot requested a review from Sjors April 18, 2024 12:28
@@ -160,12 +160,24 @@ void BaseIndex::Sync()
}

const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
// If pindex_next is null, it means pindex is the chain tip, so
// commit data indexed so far.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I had to read this three times to get it right, but feel free to ignore unless you have to retouch.

I would change it to [...], so commit the data that was indexed so far.

@fjahr
Copy link
Contributor

fjahr commented Apr 18, 2024

Code review ACK 65951e0

@achow101
Copy link
Member

ACK 65951e0

@achow101 achow101 merged commit 0e2e7d1 into bitcoin:master Apr 25, 2024
16 checks passed
@furszy furszy deleted the 2024_index_fix_race branch April 25, 2024 18:27
@ryanofsky
Copy link
Contributor

ryanofsky commented Apr 25, 2024

Note just to clarify issue history: The race condition fixed here was introduced by 0faafb5 from #28955, which stopped locking cs_main while setting m_synced = true. It could cause indexes to incorrectly discard block-connected notifications.

This race condition is different from the other race condition recently fixed in #29776, which is a much older bug going back to 4368384 from #14121, caused by incorrectly setting m_synced = true before the Commit() call, instead of after. That race condition could cause the sync and notification threads to both try to write index data at the same time, and possibly corrupt the state. That race condition was probably also more likely to happen after #28955, though, due to cs_main being released.

@fanquake
Copy link
Member

Seems like there's at least something which could/should be backported here? Maybe only the other change, given the newly broken code has only existed in master?

@ryanofsky
Copy link
Contributor

Yes in terms of backports, it could make sense to backport the #29776 fix to releases before #28955, since the race fixed by #29776 where two threads are trying to update index state at the same time at the end of a sync could happen before #28955, although it was probably much less likely to happen before #28955.

It would only make sense to backport this PR to releases after #28955, though, since this bug, which could cause blocks to be missing from the index, was introduced in #28955.

@fanquake
Copy link
Member

I've pulled #29776 in for 27.x. It could go into 26.x as well cc @glozow.

glozow pushed a commit to glozow/bitcoin that referenced this pull request Apr 29, 2024
This ensures that the index does not miss any 'new block' signals
occurring in-between reading the 'next block' and setting 'm_synced'.
Because, if this were to happen, the ignored blocks would never be
indexed, thus stalling the index forever.

Github-Pull: bitcoin#29867
Rebased-From: 65951e0
glozow pushed a commit to glozow/bitcoin that referenced this pull request Apr 29, 2024
This ensures that the index does not miss any 'new block' signals
occurring in-between reading the 'next block' and setting 'm_synced'.
Because, if this were to happen, the ignored blocks would never be
indexed, thus stalling the index forever.

Github-Pull: bitcoin#29867
Rebased-From: 65951e0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ci: failure in rpc_scanblocks.py
9 participants