index: race fix, lock cs_main while 'm_synced' is subject to change #29867

furszy · 2024-04-13T15:38:39Z

Fixes #29831 and #29863. Thanks to Marko for the detailed description of the issue.

The race occurs because a block could be connected and its event signaled in-between reading the 'next block' and setting m_synced during the index initial synchronization. This is because cs_main is not locked through the process of determining the final index sync state.
To address the issue, the m_synced flag set has been moved under cs_main guard.

DrahtBot · 2024-04-13T15:38:42Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	ryanofsky, fjahr, achow101
Stale ACK	andrewtoth, Sjors

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

maflcko · 2024-04-14T10:31:25Z

Is this related to bbe82c1?

furszy · 2024-04-14T13:36:15Z

Is this related to bbe82c1?

No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

andrewtoth · 2024-04-15T23:02:48Z

Is this related to bbe82c1?

No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba? Compared to this partial revert, the happy path of the full revert is only holding the lock for an extra pointer comparison check to see if there is a reorg, which doesn't seem like it is worth moving the lock for. Handling Rewind is more complex, but it will happen very rarely and shouldn't really happen much in Sync unless we are very close to tip.

furszy · 2024-04-16T10:04:17Z

Is it worth it to partially revert? Why not just git revert 0faafb5?

I didn't do it because the revert is not clean. It conflicts with bbe82c1 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).
But np on doing it and squashing the commits down to one. One sec.

andrewtoth · 2024-04-16T13:37:10Z

utACK cbcb2c8

I didn't do it because the revert is not clean. It conflicts with bbe82c1 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
revert 0faafb57f8298547949cbc0044ee9e925ed887ba
doc commit

This would be a cleaner git history, and bbe82c1 turned out to be a red herring so would make more sense to revert that then have people be confused thinking that actually solved anything.

Sjors · 2024-04-16T14:48:13Z

utACK cbcb2c8

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise. The other, much more frequently called Commit() in Sync() was already done without holding cs_main.

ryanofsky

Code review ACK cbcb2c8. I think I have the opposite opinion as andrewtoth, and that original bugfix c395b26 is more direct and simple than current cbcb2c8 since the original fix leaves the Rewind code alone.

re: #29867 (comment)

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:
revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
revert 0faafb57f8298547949cbc0044ee9e925ed887ba
doc commit

I don't think this will work because don't actually want to revert bbe82c1. It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767. We can't just go back to the state before 0faafb5 because the bug in #29767 was still present before that commit, just less likely to happen. So better to make a direct change fixing the current bug, than trying to fix it by reverting previous changes.

re: #29867 (comment)

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise.

Agree just locking cs_main while calling NextSyncBlock and setting m_synced = true like the code before 0faafb5 is probably the easiest fix. I think the following fix which avoids locking cs_main while calling Commit would work too, however:

--- a/src/index/base.cpp
+++ b/src/index/base.cpp
@@ -160,12 +160,24 @@ void BaseIndex::Sync()
             }
 
             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
+            // If pindex_next is null, it means pindex is the chain tip, so
+            // commit data indexed so far.
             if (!pindex_next) {
                 SetBestBlockIndex(pindex);
                 // No need to handle errors in Commit. See rationale above.
                 Commit();
-                m_synced = true;
-                break;
+
+                // If pindex is still the chain tip after committing, exit the
+                // sync loop. It is important for cs_main to locked while
+                // setting m_synced = true, otherwise a new block could be
+                // attached while m_synced is still false, and it would not be
+                // indexed.
+                LOCK(::cs_main);
+                pindex_next = NextSyncBlock(pindex, m_chainstate->m_chain);
+                if (!pindex_next) {
+                    m_synced = true;
+                    break;
+                }
             }
             if (pindex_next->pprev != pindex && !Rewind(pindex, pindex_next->pprev)) {
                 FatalErrorf("%s: Failed to rewind index %s to a previous chain tip", __func__, GetName());

Any of the fixes seem ok to me.

Sjors · 2024-04-16T15:43:37Z

@ryanofsky's suggestion looks good to me as well. Though maybe if the timing is really unlucky you'd get lots of Commit() calls?

andrewtoth · 2024-04-16T16:08:49Z

It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767.

Ahh, I see yes BlockConnected is not guarded by cs_main, so that could still be called even if we hold cs_main inside Sync. Thank you for clarifying this for me.

ryanofsky · 2024-04-16T18:27:34Z

Though maybe if the timing is really unlucky you'd get lots of Commit() calls?

That's a good point. If Commit() is slow and it has to be called more than once, it could make index sync less efficient, even if it blocks validation code less.

There's a choice about what to do when the Commit() call at the end of the sync is slow and a new block is connected while it is executing.

Currently, the new block may get ignored and the index could be corrupted.
Before 0faafb5 from #28955, and after cbcb2c8 from this PR, cs_main is locked during Commit() so a new block can't be connected.
The suggested diff would call Commit() again and potentially keep calling it until no new blocks were connected.

I think behaviors 2 or 3 both should be ok in practice, and 3 may be more appealing conceptually because ideally indexes should not hold cs_main and block validation code while writing their own data.

Another alternative could be to use approach 3 but only call Commit() exactly once at the end of the sync, skipping subsequent Commit() calls if more blocks are connected until the sync is completely finished and there is a ChainStateFlushed() notification. This would be similar to what happens with normal BlockConnected() notifications for new blocks.

I think any approach that fixes the bug is fine though, and the current approach does LGTM.

furszy · 2024-04-17T20:14:41Z

Little delayed but here again. I'm also okay with both options. I'm feeling a bit more inclined towards option 3, primarily because it allows us to explain what the code is doing more clearly.

Also, just to share some extra thoughts around this issue, another -bad- option to hold a minimal cs_main lock could be to keep the current flow in master and connect missing intermediate blocks inside BlockConnected only the first time it is called after setting m_synced. This does not require chain access nor any extra cs_main lock, as we could go backwards from the newly connected tip down to the index tip, reading and connecting the missing block(s).
The obvious downside of this option is that the index could finish the initial sync process missing one or two blocks, only updating its tip on the next BlockConnected signal, which could take a while. So, this would change the meaning of m_synced.

This ensures that the index does not miss any 'new block' signals occurring in-between reading the 'next block' and setting 'm_synced'. Because, if this were to happen, the ignored blocks would never be indexed, thus stalling the index forever.

ryanofsky

Code review ACK 65951e0

fjahr · 2024-04-18T20:39:08Z

src/index/base.cpp

@@ -160,12 +160,24 @@ void BaseIndex::Sync()
            }

            const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
+            // If pindex_next is null, it means pindex is the chain tip, so
+            // commit data indexed so far.


nit: I had to read this three times to get it right, but feel free to ignore unless you have to retouch.

I would change it to [...], so commit the data that was indexed so far.

fjahr · 2024-04-18T20:41:10Z

Code review ACK 65951e0

achow101 · 2024-04-25T18:05:04Z

ACK 65951e0

ryanofsky · 2024-04-25T19:02:28Z

Note just to clarify issue history: The race condition fixed here was introduced by 0faafb5 from #28955, which stopped locking cs_main while setting m_synced = true. It could cause indexes to incorrectly discard block-connected notifications.

This race condition is different from the other race condition recently fixed in #29776, which is a much older bug going back to 4368384 from #14121, caused by incorrectly setting m_synced = true before the Commit() call, instead of after. That race condition could cause the sync and notification threads to both try to write index data at the same time, and possibly corrupt the state. That race condition was probably also more likely to happen after #28955, though, due to cs_main being released.

fanquake · 2024-04-26T03:30:28Z

Seems like there's at least something which could/should be backported here? Maybe only the other change, given the newly broken code has only existed in master?

ryanofsky · 2024-04-26T12:36:04Z

Yes in terms of backports, it could make sense to backport the #29776 fix to releases before #28955, since the race fixed by #29776 where two threads are trying to update index state at the same time at the end of a sync could happen before #28955, although it was probably much less likely to happen before #28955.

It would only make sense to backport this PR to releases after #28955, though, since this bug, which could cause blocks to be missing from the index, was introduced in #28955.

fanquake · 2024-04-26T13:07:57Z

I've pulled #29776 in for 27.x. It could go into 26.x as well cc @glozow.

This ensures that the index does not miss any 'new block' signals occurring in-between reading the 'next block' and setting 'm_synced'. Because, if this were to happen, the ignored blocks would never be indexed, thus stalling the index forever. Github-Pull: bitcoin#29867 Rebased-From: 65951e0

DrahtBot added the UTXO Db and Indexes label Apr 13, 2024

furszy mentioned this pull request Apr 14, 2024

test: intermittent issue in interface_rest.py - AssertionError: self.wait_until(lambda: self.nodes[0].getindexinfo() == expected_filter) #29863

Closed

maflcko mentioned this pull request Apr 15, 2024

index: block filters sync, reduce disk read operations by caching last header #28955

Merged

furszy force-pushed the 2024_index_fix_race branch from c395b26 to cbcb2c8 Compare April 16, 2024 10:04

ryanofsky approved these changes Apr 16, 2024

View reviewed changes

DrahtBot requested a review from ryanofsky April 16, 2024 15:28

DrahtBot mentioned this pull request Apr 17, 2024

index: blockfilter initial sync speedup, parallelize process #26966

Draft

furszy force-pushed the 2024_index_fix_race branch from cbcb2c8 to 65951e0 Compare April 17, 2024 20:24

ryanofsky approved these changes Apr 18, 2024

View reviewed changes

DrahtBot requested a review from Sjors April 18, 2024 12:28

fjahr reviewed Apr 18, 2024

View reviewed changes

luke-jr mentioned this pull request Apr 21, 2024

ThreadSanitizer: Fix #29767 #29776

Merged

achow101 merged commit 0e2e7d1 into bitcoin:master Apr 25, 2024
16 checks passed

furszy deleted the 2024_index_fix_race branch April 25, 2024 18:27

glozow mentioned this pull request Apr 29, 2024

[26.x] archive 26.1 release notes + backports #29899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index: race fix, lock cs_main while 'm_synced' is subject to change #29867

index: race fix, lock cs_main while 'm_synced' is subject to change #29867

furszy commented Apr 13, 2024 •

edited

DrahtBot commented Apr 13, 2024 •

edited

maflcko commented Apr 14, 2024

furszy commented Apr 14, 2024 •

edited

andrewtoth commented Apr 15, 2024 •

edited

furszy commented Apr 16, 2024 •

edited

andrewtoth commented Apr 16, 2024

Sjors commented Apr 16, 2024

ryanofsky left a comment •

edited

Sjors commented Apr 16, 2024 •

edited

andrewtoth commented Apr 16, 2024 •

edited

ryanofsky commented Apr 16, 2024 •

edited

furszy commented Apr 17, 2024 •

edited

ryanofsky left a comment

fjahr Apr 18, 2024

fjahr commented Apr 18, 2024

achow101 commented Apr 25, 2024

ryanofsky commented Apr 25, 2024 •

edited

fanquake commented Apr 26, 2024

ryanofsky commented Apr 26, 2024

fanquake commented Apr 26, 2024

index: race fix, lock cs_main while 'm_synced' is subject to change #29867

index: race fix, lock cs_main while 'm_synced' is subject to change #29867

Conversation

furszy commented Apr 13, 2024 • edited

DrahtBot commented Apr 13, 2024 • edited

Code Coverage

Reviews

Conflicts

maflcko commented Apr 14, 2024

furszy commented Apr 14, 2024 • edited

andrewtoth commented Apr 15, 2024 • edited

furszy commented Apr 16, 2024 • edited

andrewtoth commented Apr 16, 2024

Sjors commented Apr 16, 2024

ryanofsky left a comment • edited

Choose a reason for hiding this comment

Sjors commented Apr 16, 2024 • edited

andrewtoth commented Apr 16, 2024 • edited

ryanofsky commented Apr 16, 2024 • edited

furszy commented Apr 17, 2024 • edited

ryanofsky left a comment

Choose a reason for hiding this comment

fjahr Apr 18, 2024

Choose a reason for hiding this comment

fjahr commented Apr 18, 2024

achow101 commented Apr 25, 2024

ryanofsky commented Apr 25, 2024 • edited

fanquake commented Apr 26, 2024

ryanofsky commented Apr 26, 2024

fanquake commented Apr 26, 2024

furszy commented Apr 13, 2024 •

edited

DrahtBot commented Apr 13, 2024 •

edited

furszy commented Apr 14, 2024 •

edited

andrewtoth commented Apr 15, 2024 •

edited

furszy commented Apr 16, 2024 •

edited

ryanofsky left a comment •

edited

Sjors commented Apr 16, 2024 •

edited

andrewtoth commented Apr 16, 2024 •

edited

ryanofsky commented Apr 16, 2024 •

edited

furszy commented Apr 17, 2024 •

edited

ryanofsky commented Apr 25, 2024 •

edited