Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve runtime performance of --reindex #16981

Open
wants to merge 1 commit into
base: master
from

Conversation

@LarryRuane
Copy link
Contributor

LarryRuane commented Sep 29, 2019

Reduce the real-time required to reindex by about 48 minutes by modifying LoadExternalBlockFile() to take advantage of the caching inherent in CBufferedFile blkdat.

@LarryRuane LarryRuane changed the title improve runtime performance of --reindex [WIP] improve runtime performance of --reindex Sep 29, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Sep 29, 2019

Running bitcoind --reindex causes the client to reread the .bitcoin/blocks/blknnnnn.dat files and re-derive all the index and chainstate information. It's similar to IBD but the client uses these local files rather than download blocks from the network. As LoadExternalBlockFile() reads each block from disk, it checks to see if its parent has already been seen (and processed). If so, then this block can be immediately processed (by giving it to AcceptBlock()). If this block's parent hasn't yet been seen, it cannot be processed until after its parent is seen. Its prev-hash (parent's hash) is put into a map of "pending" blocks (called mapBlocksUnknownParent). Later, when its parent is encountered and processed, this map is consulted to see it has a child that's already been seen that can now be processed. If so, LoadExternalBlockFile() now has to go back and re-read the child block from disk (ReadBlockFromDisk()), deserialize it again, and then process (pass it to AcceptBlock()).

Performance isn't very good because about 90% of blocks are read before their parents are read. (I think this is a result of the headers-first IBD performance enhancement of a few years ago.) This can be seen by running bitcoind --reindex --debug=reindex, causing many messages like:

2019-09-29T00:29:00Z LoadExternalBlockFile: Out of order block 00000000a2268fb2d4ddc0408fae961a96668db58de415d3c9e0a7694eeb1657, parent 000000005fb53da5c3cda78aef28cbc44fde3e435f1dc2ac04df53387049171f not known

then, a short time later:

2019-09-29T00:29:00Z LoadExternalBlockFile: Processing out of order child 00000000a2268fb2d4ddc0408fae961a96668db58de415d3c9e0a7694eeb1657 of 000000005fb53da5c3cda78aef28cbc44fde3e435f1dc2ac04df53387049171f

So 90% of blocks end up being read from disk twice (actually, not twice, since the operating system's buffer cache likely still has the data, but there is still a memory-to-memory copy), and deserialized twice.

This PR mitigates this problem by taking advantage of the fact that recent blocks are still in the buffer (blkdat) that is used to read from the blk files. Since rewinding to an earlier offset in the buffer has been fixed by the recently merged #16577, we can see if the child block is still in memory (the probability is increased by making the buffer somewhat larger), and if so, avoid having to read the child block data from disk. It turns out that the child block is still in the buffer 60% of the time. This depends on some randomness that's introduced during IBD, so it may be different for you. (My IBD was done using all the default settings, nothing special.)

The second part of this change initially deserializes only the block header, rather than the entire block. since about 90% of the time we're going to have to process this block later; only the hashPrevBlock field of the header is needed to determine if we can process this block immediately (if this block's parent has been seen). This does mean that 10% of the time, we deserialize the header, and then immediately also the full block (where it would have been more efficient to simply deserialize the full block), but overall this is a big performance win.

On my system, this PR reduces the time to --reindex on mainnet by 48 minutes. I encourage reviewers to attempt to replicate my findings. Reindexing has two parts, reading the blocks/blknnnnn.dat files (which this PR improves), and then connecting the blocks to the blockchain (the UpdateTip messages), which this PR doesn't affect. For me, this PR reduces the first part from 123 to 75 minutes. After you start seeing UpdateTip messages, you can stop the client, and you will have the time measurement you need from debug.log.

blkdat.SetLimit(pos.nPos + nSize);
blkdat >> blockrecursive;
LogPrint(BCLog::REINDEX, "%s: Processing in-memory out of order child %s of %s\n", __func__, pblockrecursive->GetHash().ToString(),
head.ToString());

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Sep 29, 2019

Author Contributor

This LogPrint probably should be combined with the (existing) one a few lines below before merging, but I wanted reviewers to be aware of when the new code is being run.

blkdat.SetLimit(pos.nPos + nSize);
blkdat >> blockrecursive;
LogPrint(BCLog::REINDEX, "%s: Processing in-memory out of order child %s of %s\n", __func__, pblockrecursive->GetHash().ToString(),
head.ToString());

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Sep 29, 2019

Author Contributor

I don't know how to write a unit test for this PR (there are no existing unit tests for any of this area of the code), but test/functional/wallet_basic.py does run through the new code -- you can add an assert(false) here and that test fails, so that's at least something.

@LarryRuane LarryRuane changed the title [WIP] improve runtime performance of --reindex Improve runtime performance of --reindex Sep 29, 2019
@LarryRuane LarryRuane marked this pull request as ready for review Sep 29, 2019
Copy link
Member

promag left a comment

Concept ACK.

Do you think it's possible to split in 2 commits

  • first commit only refactors CBufferedFile - no behavior change
  • second improves LoadExternalBlockFile.
@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from 48c0cc7 to 6d8644d Sep 30, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Sep 30, 2019

Thanks for the suggestion, @promag, done (force-pushed).

@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from 6d8644d to c598893 Sep 30, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Sep 30, 2019

I forgot to update the CBufferedFile unit tests, just did that and another force-push.

@LarryRuane LarryRuane changed the title Improve runtime performance of --reindex [WIP] Improve runtime performance of --reindex Oct 1, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Oct 1, 2019

I changed this PR to WIP because I thought of a further improvement, which I'll try to push by the end of the week. It will build on the existing two commits, so reviewing those wouldn't be a waste of time. Thanks in advance for any reviews and suggestions.

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Oct 3, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from c598893 to 7af0587 Oct 9, 2019
Copy link
Contributor Author

LarryRuane left a comment

The main change is that the CBufferedFile memory buffer object can now span multiple files. It's very common that a block is read from disk before its parent, and that parent is found in a later blocks/blk00nnn.dat file. In that case, it's usually possible to find the earlier (child) block in the memory buffer, obviating the need to read it again from disk.

@@ -668,6 +668,7 @@ static void ThreadImport(std::vector<fs::path> vImportFiles)

// -reindex
if (fReindex) {
CBufferedFile blkdat(160*MAX_BLOCK_SERIALIZED_SIZE, 140*MAX_BLOCK_SERIALIZED_SIZE, SER_DISK, CLIENT_VERSION);

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Oct 9, 2019

Author Contributor

This is a much larger buffer (640 mb), but it only exists during the first phase of reindexing (when you see Reindexing block... in the debug log); after that, the bitcoind memory usage increases for other reasons anyway. During the first phase, memory (VSZ in the ps output) is about 2.4 gb on my system; afterwards (when it's logging UpdateTip: new best...), it's about 3.5 gb. The benefit of a large buffer is that a larger fraction of the out-of-order blocks (that have to be processed later, when their parents are seen) can be found in memory, rather than having to read them again from disk. The "hit ratio" with this buffer size is just under 99%. The difference between 160 (but overall buffer size in max-blocks)) and 140 (the rewind limit) determines the size of disk reads (the larger, the better, more efficient).

bool Seek(uint64_t nPos) {
long nLongPos = nPos;
if (nPos != (uint64_t)nLongPos)
return false;
if (fseek(src, nLongPos, SEEK_SET))
return false;
nLongPos = ftell(src);
nSrcPos = nLongPos;
nReadPos = nLongPos;
return true;
}
Comment on lines 819 to 827

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Oct 9, 2019

Author Contributor

This method is unused, and really doesn't make sense. The buffer contents should match file (or, now, files) content, and this messes that up. If it's needed in the future, it can be re-thought.

LogPrintf("Loaded %i blocks from external file in %dms\n", nLoaded, GetTimeMillis() - nStart);
LogPrintf("Loaded %i blocks from external file in %dms pim %i pom %i\n",
nLoaded, GetTimeMillis() - nStart, nProcessedInMem, nProcessedOutMem);
Comment on lines 4662 to 4762

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Oct 9, 2019

Author Contributor

These pim (out-of-order block found in-memory) and pom (out-of-order block not in memory) let reviewers see the effectiveness of the read caching during testing; they can be removed before merging. Here's a typical instance:

2019-10-09T15:28:17Z Loaded 839 blocks from external file in 11729ms pim 726 pom 95

This means 821 (726+95) blocks were processed out-of-order, so 18 (839-821) blocks were processed in-order (their parents had already been seen). This is typical, and shows that almost all blocks are processed out-of-order. With this PR, 726 of these out-of-order blocks were recent enough that they were still in the buffer, and 95 had already been pushed out and had to be read again from disk. Overall, 99% of the out-of-order blocks are still in the buffer.

This comment has been minimized.

Copy link
@mzumsande

mzumsande Oct 17, 2019

Contributor

I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.

@LarryRuane LarryRuane changed the title [WIP] Improve runtime performance of --reindex Improve runtime performance of --reindex Oct 9, 2019
@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from dc256d8 to 14cb27d Oct 9, 2019
@jonatack jonatack mentioned this pull request Oct 10, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Oct 11, 2019

Testnet performance isn't important, but as a sanity check, I verified that the new code runs faster there too on my laptop. The time between Reindexing block file blk00001.dat... and Reindexing finished reduced from 10m21s to 7m13s.

Copy link
Contributor

mzumsande left a comment

This looks very interesting, will review the code more closely soon.
I am concept ACK for the first iteration, but would like to understand better if the additional benefits of the second iteration (multiple-files) outweigh the costs (memory):

  • Do you have data from mainnet for the multiple-file buffer implementation? I'm interested if there are diminishing returns or another huge speedup compared to the first revision of this PR.
  • While other objects that use a lot of memory can be reduced (see reduce-memory.md) in order to run a full node in low-memory environments, the large size of  CBufferedFile is hardcoded, which does not seem ideal. The minimum requirements mentioned in (https://bitcoin.org/en/bitcoin-core/features/requirements) are also considerably lower than the 2.4 gb that are used.
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Oct 13, 2019

@mzumsande, those are very good thoughts, thank you. I will study this further (and invite others to also). One nice thing about this design is that any amount of memory will work, in that it will be an improvement -- it is one of those classic space-time tradeoffs. The idea of making the buffer not have a hard-coded size is intriguing and promising. I think we would not go back to the pre-second iteration (multiple-files), because even reducing the buffer memory to the original amount would be improved by the multi-file version. (That is to say, there is no downside to the latest version here.) That's because all the multi-file commit means is that we do not discard cached data when switching files.

One thing I hope I made clear enough in an earlier comment is that I tried to keep the memory usage during reindexing be strictly less than what is needed during normal operation, so that there would be no overall increase in memory demand. We certainly would not want this design to create a spike (high water mark) of memory usage. But I was very likely mistaken, since I was not aware that memory could be "dialed down" for small environments.

@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from 14cb27d to ed9b569 Oct 15, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Oct 15, 2019

I force-pushed a reorganization of the commits to make reviewing easier. I split the largest commit, speed up reindex by reading out-of-order blocks from memory, into two separate commits:

  1. Commit message: initially deserialize only CBlockHeader, not CBlock. Deserialize only a block header (only 32 bytes) initially, instead of a full block. The block header is all that's needed to tell us if this block's parent has been seen yet; if not, the block will have to be deserialized later, after its parent is found. Nearly all blocks (89% on my system) are seen before their parents, so this is a big improvement.
  2. Commit message: speed up reindex by reading out-of-order blocks from memory. If we've just processed a block that's the parent of an earlier-seen block, we must now process that earlier block. Instead of always reading that block from disk, see if it happens to still be in the circular disk read memory buffer ('CBufferedFile blkdat`). This saves a disk read.

We could merge the first commit without the second. The first commit is pretty simple; the second commit is a little more involved.

These two commits have a buffer size unchanged from master, 8mb (2*MAX_BLOCK_SERIALIZED_SIZE), so there is no change in memory usage. But with this buffer size, only 29% of the out-of-order blocks are found in the buffer. If we increase the buffer by not too much (as I did in an earlier commit that's now been erased by force-push, 640mb), to 32mb, the "hit rate" increases to 50%. This is probably a good space-time balance.

Here are the block-load times on my system (with all default settings), which is just the first part of reindexing (each row builds on the previous row):

version real time (minutes)
master 130
deserialize only header 90
look for blocks in buffer, 8mb 84
look for blocks in buffer, 32mb 78
look for blocks in buffer, 640mb 72

(Take with a grain of salt, because the times varied even on identical runs, and a lot probably depends on the specifics of my system.) Keep in mind that this buffer exists only during the block-load part of reindex.

Copy link
Contributor

mzumsande left a comment

Concept ACK.
Thanks for the statistics, I agree that 32MB is a reasonable size. 

@@ -762,12 +763,12 @@ class CBufferedFile
}

public:
CBufferedFile(FILE *fileIn, uint64_t nBufSize, uint64_t nRewindIn, int nTypeIn, int nVersionIn) :
nType(nTypeIn), nVersion(nVersionIn), nSrcPos(0), nReadPos(0), nReadLimit(std::numeric_limits<uint64_t>::max()), nRewind(nRewindIn), vchBuf(nBufSize, 0)
CBufferedFile(uint64_t nBufSize, uint64_t nRewindIn, int nTypeIn, int nVersionIn) :

This comment has been minimized.

Copy link
@mzumsande

mzumsande Oct 17, 2019

Contributor

Is it necessary to remove the FILE from the constructor? Couldn't you just initialize it with nullptr in the multi-file case and call SetFile() later, so that there need to be no syntax changes for the single-file case? It is also easy to create a CBufferedFile object and forget about SetFile(), leading to a memory access violation.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Oct 20, 2019

Author Contributor

@mzumsande, good idea, I restored that argument to the constructor.

@@ -689,11 +691,13 @@ static void ThreadImport(std::vector<fs::path> vImportFiles)
// hardcoded $DATADIR/bootstrap.dat
fs::path pathBootstrap = GetDataDir() / "bootstrap.dat";
if (fs::exists(pathBootstrap)) {
CBufferedFile blkdat(2*MAX_BLOCK_SERIALIZED_SIZE, MAX_BLOCK_SERIALIZED_SIZE+8, SER_DISK, CLIENT_VERSION);

This comment has been minimized.

Copy link
@mzumsande

mzumsande Oct 17, 2019

Contributor

nit: This could be moved into the if block (same for -loadblock below)

LogPrintf("Loaded %i blocks from external file in %dms\n", nLoaded, GetTimeMillis() - nStart);
LogPrintf("Loaded %i blocks from external file in %dms pim %i pom %i\n",
nLoaded, GetTimeMillis() - nStart, nProcessedInMem, nProcessedOutMem);

This comment has been minimized.

Copy link
@mzumsande

mzumsande Oct 17, 2019

Contributor

I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented Oct 18, 2019

Under what circumstances is a -reindex needed these days? Only when disabling -prune? Though, then you'll download from the network anyway.

@leishman

This comment has been minimized.

Copy link
Contributor

leishman commented Oct 21, 2019

@MarcoFalke Building new indices is required if upgrading from pre-0.17 to post-0.17 (https://bitcoin.org/en/release/v0.17.0) with a txindex.

If your node has a txindex, the txindex db will be migrated the first time you run 0.17.0 or newer, which may take up to a few hours. Your node will not be functional until this migration completes.

I haven't looked into whether this PR speeds up that process, but it is something to consider.

@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from 869aed0 to 856be70 Oct 30, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Oct 30, 2019

Rebased. I think running --reindex is also good when for whatever reason you don't trust your index files; for example, think files may be missing or may have been corrupted. It's true that in that case you could just re-IBD, but this is faster and less disruptive to peers.

I wasn't sure how much variation there might be in the runtimes I posted here earlier, so to get a few more data points, I added assert(0) immediately after printing Reindexing finished, and ran bitcoind --reindex several times in a shell loop, timing each run (using the standard time bash builtin). Here are the real times without this PR (which I reported as 130 minutes in the earlier comment):

real	131m52.120s
real	130m55.904s
real	133m20.911s
real	134m37.052s
real	146m17.564s
real	135m17.301s

and here are the times for runs with this PR (I ran this a few more times), this is what I reported as 84 minutes above:

real	79m55.050s
real	92m30.341s
real	82m44.119s
real	80m50.067s
real	89m31.793s
real	87m36.802s
real	97m42.247s
real	86m31.999s
real	87m13.304s
real	82m21.497s
@DrahtBot DrahtBot added Needs rebase and removed Needs rebase labels Oct 30, 2019
@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from 856be70 to dc6964c Nov 7, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Nov 7, 2019

Rebased to fix merge conflicts, force-pushed.

@DrahtBot DrahtBot removed the Needs rebase label Nov 7, 2019
@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from dc6964c to 8ac8d16 Nov 21, 2019
@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Nov 21, 2019

Rebased, force-pushed

@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Feb 21, 2020

Reviewers, please take a fresh look at the latest force-push; this PR is much better than before. I'll explain from scratch, so it's not necessary to read any of the previous comments in this PR.

The previous version of this PR included two separate changes (both affecting -reindex):

  • initially deserializing only the header instead of the entire block
  • when we find a block whose child has already been read, attempt to "rewind" in the stream memory buffer to access and process the child (instead of always reading it again from disk)

These should always have been separate PRs. This PR now contains only the first change, which is the more effective and localized. I'll make a separate PR for the second change if it turns out to be a significant improvement. This PR no longer has any effect on memory usage, which reviewers had concerns about.

To go over some basics again, for background: During -reindex, blocks are read from disk (the blocks/blk00???.dat files), deserialized, and processed. But a block can't be processed if its parent hasn't yet been seen (we know this only after deserializing the block). In that case, its disk location is saved in a map and we go on to the next block.

Whenever we do successfully process a block (we've got its parent), we check to see if its child has already been seen (is present in the map, looked up by block hash), and, if so, we read the child block from disk (again), deserialize it, and process it (and remove it from the map).

It turns out that, due to the "headers first" initial block download (IBD) algorithm, over 92% of blocks (on my system) are out of order, in the sense that 92% of the time when we first read a block, we haven't yet seen its parent. You can see these by specifying the -debug=reindex option (see the second comment in this PR).

What this PR does -- and now all it does -- is to initially deserialize only the header (80 bytes), not the entire block. From the header, we can tell if this block's parent has already been seen (by its prev-hash). If not (92% of the time), we skip over the rest of the block. So the savings are to not deserialize the entire block in the most common case where that's unnecessary.

It's true that if all blocks were in order, this would be slightly slower because we'd deserialize the header, discover that the block's parent is already known, then deserialize the full block. (There's no easy way to deserialize the rest of the block.)

To test this change,you may want to apply this patch:

--- a/src/init.cpp
+++ b/src/init.cpp
@@ -692,6 +692,7 @@ static void ThreadImport(std::vector<fs::path> vImportFiles)
         pblocktree->WriteReindexing(false);
         fReindex = false;
         LogPrintf("Reindexing finished\n");
+        StartShutdown();
         // To avoid ending up in a situation without genesis block, re-try initializing (no-op if reindexing worked):
         LoadGenesisBlock(chainparams);
     }

Then you can time the reindexing in isolation (without having to look at timestamps in the log file):

$ time src/bitcoind --reindex --maxconnections=0

Setting maxconnections to zero makes the timing less variable, since you can get very different peers each time.

When I run this, I consistently get 83 minutes with this PR's branch, and 125 minutes without, an improvement of 42 minutes.

@LarryRuane

This comment has been minimized.

Copy link
Contributor Author

LarryRuane commented Mar 3, 2020

Just to clarify, the last bullet point in the review club writeup:

Before this PR, we would always read entire blocks from disk into our buffer. If the predecessor of a block was not available, the blocks would have to be read again at a later point. This PR changes behavior such that initially only the 80 byte block header is read into the buffer, and if we can’t accept the block yet, the rest of the block is skipped for now

is slightly incorrect. Even with this PR, we always read entire blocks from disk (most of them twice); the disk read behavior before and after this PR is exactly the same. The difference is that with this PR, only the header is (initially) deserialized, rather than the entire block, which saves CPU.

The CBufferedFile object does somewhat large reads, around 4 MB (MAX_BLOCK_SERIALIZED_SIZE), each of which often includes multiple entire blocks and a partial block (because we don't know where the boundaries are; we don't know how much to read to get exactly a block or blocks). Recent blocks are close to full, so each disk read probably reads in only one or two blocks (plus a partial block).

But now I'm wondering if this PR should work as described in the review club comment. We could read only the 80-byte header from disk, and then, if we've seen its parent, read the rest of the block. Two disk reads are less efficient than one large read (a disk read has a fixed overhead and a length-proportional overhead), but that happens only about 8% of the time. If we haven't seen this block's parent (probability 92%), we could seek ahead in the file, instead of reading it into memory. But as explained earlier, there could be more than one block per large read, so we may end up exchanging one large read for 2 or 3 80-byte reads -- hard to say which is better. I'd recommend leaving this PR as is, because it's an (almost) unambiguous improvement on the status quo, and changing it to do 80-byte reads and seeks is more complex than this PR is currently. (The reason I say almost is because, if the blocks are already sorted, this ends up deserializing the header one extra time.)

I'm just now thinking, we could make reindex (part 1) screaming fast by storing the headers in separate "parallel" files in the blocks/ directory: blk01234.dat would be the same as now, but there would be a hdr01234.dat (similar to the way there's a rev001234.dat today) that would be that file's blocks' headers in the same order. Actually, that file would be a sequence of (header, file-offset, size) tuples. The file-offset is the file offset of that block's block in the same-named blk file. That's only 100 MB of additional disk space (about 100 bytes times about 1M blocks, this is far less than 1% of the size of the blocks/ directory).

Then -reindex could do 4 MB reads, same as today except reading the hdr001234.dat files, bringing in 50k headers in a single read -- super-efficient. Then loop through each of those, and if the block can be accepted now (its parent has been seen, 8% case), then seek and read from the blk01234.dat file (we would know the exact seek offset and the size); that would be slower than today. But the 92% case would be much faster because we wouldn't be reading entire blocks into memory unnecessarily. This would be a significant reduction in the number of disk reads. (Definitely this would be a separate PR.) Opinions?

Copy link
Member

kallewoof left a comment

Concept and code review ACK a50f956

BOOST_AUTO_TEST_CASE(streams_buffered_file_skip)
{
FILE* file = fsbridge::fopen("streams_test_tmp", "w+b");
// The value at each offset is the offset.

This comment has been minimized.

Copy link
@kallewoof

kallewoof Mar 4, 2020

Member

I only understood this comment after I got to L#359.

Suggested change
// The value at each offset is the offset.
// The value at each offset is the byte offset (e.g. byte 1 in the file has the value 0x01, etc).
} catch (const std::exception& e) {
BOOST_CHECK(strstr(e.what(),
"Read attempted past buffer limit") != nullptr);
}

This comment has been minimized.

Copy link
@kallewoof

kallewoof Mar 4, 2020

Member

You could do

BOOST_CHECK_THROW(bf.Skip(2), std::exception);

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

This kind of check occurs four places in this file (and was being done before this PR, for example line 256). I like the conciseness of your suggestion, but it loses the check of the exception string. (The test wouldn't detect if a completely different exception occurred.) Another advantage of checking the exception string is that it documents why the exception is thrown (could be in a comment, but code is better).

It would be nice if there was a variant of BOOST_CHECK_THROW that took a string argument. What do you think about implementing this in this file, or in a more general place:

#define BOOST_CHECK_THROW_WHAT(statement, exception, str) \
    do { \
        try { \
            statement; \
            BOOST_CHECK(false); \
        } catch (const exception& e) { \
            BOOST_CHECK(strstr(e.what(), str) != nullptr); \
        } \
    } while (false)

and then replacing the try/catch (here and in 3 other places) with:

    BOOST_CHECK_THROW_WHAT(bf.Skip(2),
        std::exception, "Read attempted past buffer limit");

(I tested this.) It's weird to name this macro BOOST_* because it's not part of the Boost library, but what it does is so close to BOOST_CHECK_THROW() that it may make sense. If the casual reader thinks this is part of Boost, is that so bad?

@kallewoof

This comment has been minimized.

Copy link
Member

kallewoof commented Mar 4, 2020

Btw https://bitcoincore.reviews/16981.html notes that this fixes a bug as well. I would really like if this bug-fix was its own separate commit (followed by the commit doing the actual runtime performance improvement).

@mzumsande

This comment has been minimized.

Copy link
Contributor

mzumsande commented Mar 4, 2020

This seems to be a misunderstanding. I don't think that this PR fixes a bug, the sentence in question referred to the older PR #16577 which is already merged.

@kallewoof

This comment has been minimized.

Copy link
Member

kallewoof commented Mar 4, 2020

OH. Yeah, that's what it says. I wasn't reading carefully enough.

Copy link
Contributor

emilengler left a comment

Concept ACK
Thanks! I like this more low-level oriented way.
I will test this now a bit after my chain fully synched.
The code looks good to me so far but I will wait until todays PR review club meeting before I give more feedback on this yet.

nRewind = blkdat.GetPos();

uint256 hash = block.GetHash();
blkdat.SetLimit(nBlockPos + 80);

This comment has been minimized.

Copy link
@emilengler

emilengler Mar 4, 2020

Contributor

nit: Could this 80 become a const? The block header might get bigger or smaller in future and this 80 could get missed.

This comment has been minimized.

Copy link
@sipa

sipa Mar 4, 2020

Member

I think this call is unnecessary. It's there in the old code to make sure there is consistency between the block length in the file, and the actually serialized data. Headers are always just 80 bytes, though, so this check isn't necessary here.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

I think this call is unnecessary.

Thanks, I'll remove it. (@emilengler's comment is no longer relevant.) It's similar to the way we don't call SetLimit() before deserializing into the 4-byte nSize just above.

nRewind = blkdat.GetPos();

uint256 hash = block.GetHash();
blkdat.SetLimit(nBlockPos + 80);

This comment has been minimized.

Copy link
@sipa

sipa Mar 4, 2020

Member

I think this call is unnecessary. It's there in the old code to make sure there is consistency between the block length in the file, and the actually serialized data. Headers are always just 80 bytes, though, so this check isn't necessary here.

mapBlocksUnknownParent.insert(std::make_pair(header.hashPrevBlock, *dbp));

// Position to the start of the next block.
nRewind = nBlockPos + nSize;

This comment has been minimized.

Copy link
@sipa

sipa Mar 4, 2020

Member

I think this is incorrect. If the size field of the block was corrupted this would cause us to potentially jump into nowhere. Do things still work (and still give a performance benefit...) if you change this to nBlockPos + 1?

This comment has been minimized.

Copy link
@ajtowns

ajtowns Mar 5, 2020

Contributor

I think if nSize as read from disk is wrong, then

(a) if it's smaller than the actual size of the block, blkdat >> block will currently throw which will spit out a Deserialize or I/O error and start trying again at the end of the nSize bytes -- so this behaviour should be near enough to the same, and this is fine

and (b) if it's larger than the actual size of the block, the nSize will be effectively ignored, and we'll pick up at a different, possibly incorrect point, and try to find the next chainparams.MessageStart() marker, so we could miss up to, I guess, 8MB worth of block data (4MB due to a bad nSize which is just skipped over, then just under 4MB if the next block was as large as possible and we just missed its start which is read byte-by-byte) due to 4 bytes of corrupted size data, rather than just coping with it. This is worse, but doesn't seem terrible -- if the four bytes that were corrupted were the last bytes of one block and the first bytes of another you'd have lost up to 8MB of data that way too, without any more precise targeting.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

These comments create a good opportunity for me to ask some questions about the existing design while doing this PR, hope that's okay. Please don't feel an obligation to answer these.

  • What's the purpose of the MessageStart() marker? It seems to allow the skipping over some arbitrary junk that follows a block before the start of the next [marker-nSize-block] sequence of bytes. What could cause that? In other projects, I've seen these markers used when a file gets overwritten (such as a journal file that implements a circular log), because after restart, the previous write offset is unknown. Are these blk files overwritten?
  • Are we trying to be robust against arbitrary file-overwrite corruption? If the file contains 100 blocks, and data corruption overwrote part of the second block, then this design (searching for the marker) will find the start of the third block and pick up from there and correctly parse the remaining 97 blocks. This sounds good, but the second block's child (or children) will never be "accepted" (AcceptBlock() called on it), recursively (and not just for this file but for all remaining files), so we'll end up with a very truncated blockchain. What happens then? I think we may download blocks from peers (IBD) starting from where the corruption occurred. That great if true (why don't I actually try this?), although a minor problem is that the many entries in mapBlocksUnknownParent are never released since that map is static -- which is kind of weird. But since these (following) blocks are never accepted, the rest of this part 1 of reindex is wasted effort. It may be better to just abort the rest of this part 1 of reindexing if we encounter any file corruption.

If the size field of the block was corrupted

It's true that this PR increases dependence on this field, but If we're assuming the size can be corrupted, I'm not sure what benefit it can be providing, either with or without this PR. Yes, setting nRewind to nBlockPos+1 would work (I haven't tried it but I could), but it would be significantly slower because we'd have to scan for the next marker (a megabyte or 2 away) one byte at a time.

It seems to me that if the size field is corrupted, all bets are off, the train's going off the tracks in a hurry.

This comment has been minimized.

Copy link
@ajtowns

ajtowns Mar 6, 2020

Contributor

The main point of -reindex now is recovery after on-disk corruption, so I think it makes sense to be as robust to that as we can reasonably be?

As well as just random disk corruption, I think you can get problems if (a) you're downloading blocks and your power goes out before the data gets synced; I'd expect you might end up with something like "...[marker][nSize=1MB][250kB of block data][64kB of zeroes]" in that case; at which point when you restart you might start writing the next block straight after the zeroes. When running reindex currently, I think your parsing would fail during the 64kB of zeroes, and you'd then scan to the next marker and continue. If so, I guess that means that with this patch, if you ran -reindex you'd not see that new block and have to redownload it. That's all speculation though, so don't take my word for it.

Hmm, as an alternative approach, what if you just still read/parse the entire block off the disk, but if it's not one you can immediately accept, you store it into a map for later acceptance? Limit the map to 32MB worth of data, or maybe somehow tie it in to the mempool limit? Keeps the same logic, but reduces the amount of double-reading and double-parsing needed.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

I did an experiment yesterday in which I simulated corruption in one of the blk??.dat files, and as I suspected, all the blocks beyond the coorrupted block were never accepted (because their parents were missing). But the node then seemed to get all the missed blocks from peers, and at the end all seemed good.

Anthony, your suggestion is similar to what this PR originally did before I reduced its scope. You can see the discussion above, but to summarize here, the idea was to increase the size of the circular memory buffer (CBufferedFile object) so it could act as a cache, so that we could sometimes access blocks read earlier without having to re-read from disk; the concern was about increasing memory requirements. I think something along the lines I proposed or what you're proposing is likely to help further (especially tying it to mempool limit; that's a good idea), but I suggest keeping this PR limited (no change in memory, a definite improvement) if you don't mind.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

On the topic of the size being corrupted, if the size field is set to a uniformly distributed random value (or zero, probably the most likely), that is caught (see the check nSize < 80 || nSize > MAX_BLOCK_SERIALIZED_SIZE above); the probability that the corrupted value will be in that range is less than 0.1%. So most data corruption there will be detected.

// Position to the start of the next block.
nRewind = nBlockPos + nSize;
blkdat.SetLimit(nRewind);
blkdat.Skip(nRewind - blkdat.GetPos());

This comment has been minimized.

Copy link
@narula

narula Mar 4, 2020

Contributor

Skip() calls read(), which means (IIUC) that this change makes it so we are now doing a disk read while holding cs_main. Given this is reindexing this might not matter but worth noting. Is this a problem?

This comment has been minimized.

Copy link
@fjahr

fjahr Mar 4, 2020

Contributor

Also, I think this part could be more concise. I mean adding position + size just do subtract position below, seems a bit redundant.

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

Skip() calls read()

Good catch, and it's actually even worse, because if we process the block immediately (don't take the Skip() path), then we deserialize the entire block (blkdat >> block, just below), and that calls read() which can read the disk. But that's the less common path (only about 8%).

As someone said at review club, the node isn't fully up and running during reindexing, so there probably isn't much contention for this lock. But I think we can avoid this problem; I'll push a fix if it doesn't complicate the code too much.

this part could be more concise. I mean adding position + size just do subtract position below, seems a bit redundant.

I don't think it's redundant; note that nBlockPos is the start of the current block (whose header we just read), while blkdat.GetPos() returns the current position, which is after the header. We don't want to skip nSize bytes (which is the length of the block), we want to skip the remainder of the current block, so the distance to move forward is the start of the next block minus the current position. This could have been written blkdat.Skip(nSize - 80) but I think that's not as clear.

I just noticed that the SetLimit() here can be removed; it's not needed when doing a Skip() (it's only needed when deserializing something of variable length, such as a block). I'll remove it.

@fjahr

This comment has been minimized.

Copy link
Contributor

fjahr commented Mar 4, 2020

Concept ACK, also reviewed the code and mostly satisfied. I think there is good potential for performance improvement here. I will wait with testing until the current comments are addressed.

Copy link
Contributor Author

LarryRuane left a comment

Thanks so much to all the reviewers -- these are great comments and suggestions! I'll force-push fixes for most of the suggestions (where it's clear what to do) probably tomorrow.

Another thing I said during review club that I'd look into is seeing if we can test the changes to the reindex code in validation.cpp in a functional test, so I'll do that too.

} catch (const std::exception& e) {
BOOST_CHECK(strstr(e.what(),
"Read attempted past buffer limit") != nullptr);
}

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

This kind of check occurs four places in this file (and was being done before this PR, for example line 256). I like the conciseness of your suggestion, but it loses the check of the exception string. (The test wouldn't detect if a completely different exception occurred.) Another advantage of checking the exception string is that it documents why the exception is thrown (could be in a comment, but code is better).

It would be nice if there was a variant of BOOST_CHECK_THROW that took a string argument. What do you think about implementing this in this file, or in a more general place:

#define BOOST_CHECK_THROW_WHAT(statement, exception, str) \
    do { \
        try { \
            statement; \
            BOOST_CHECK(false); \
        } catch (const exception& e) { \
            BOOST_CHECK(strstr(e.what(), str) != nullptr); \
        } \
    } while (false)

and then replacing the try/catch (here and in 3 other places) with:

    BOOST_CHECK_THROW_WHAT(bf.Skip(2),
        std::exception, "Read attempted past buffer limit");

(I tested this.) It's weird to name this macro BOOST_* because it's not part of the Boost library, but what it does is so close to BOOST_CHECK_THROW() that it may make sense. If the casual reader thinks this is part of Boost, is that so bad?

nRewind = blkdat.GetPos();

uint256 hash = block.GetHash();
blkdat.SetLimit(nBlockPos + 80);

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

I think this call is unnecessary.

Thanks, I'll remove it. (@emilengler's comment is no longer relevant.) It's similar to the way we don't call SetLimit() before deserializing into the 4-byte nSize just above.

mapBlocksUnknownParent.insert(std::make_pair(header.hashPrevBlock, *dbp));

// Position to the start of the next block.
nRewind = nBlockPos + nSize;

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

These comments create a good opportunity for me to ask some questions about the existing design while doing this PR, hope that's okay. Please don't feel an obligation to answer these.

  • What's the purpose of the MessageStart() marker? It seems to allow the skipping over some arbitrary junk that follows a block before the start of the next [marker-nSize-block] sequence of bytes. What could cause that? In other projects, I've seen these markers used when a file gets overwritten (such as a journal file that implements a circular log), because after restart, the previous write offset is unknown. Are these blk files overwritten?
  • Are we trying to be robust against arbitrary file-overwrite corruption? If the file contains 100 blocks, and data corruption overwrote part of the second block, then this design (searching for the marker) will find the start of the third block and pick up from there and correctly parse the remaining 97 blocks. This sounds good, but the second block's child (or children) will never be "accepted" (AcceptBlock() called on it), recursively (and not just for this file but for all remaining files), so we'll end up with a very truncated blockchain. What happens then? I think we may download blocks from peers (IBD) starting from where the corruption occurred. That great if true (why don't I actually try this?), although a minor problem is that the many entries in mapBlocksUnknownParent are never released since that map is static -- which is kind of weird. But since these (following) blocks are never accepted, the rest of this part 1 of reindex is wasted effort. It may be better to just abort the rest of this part 1 of reindexing if we encounter any file corruption.

If the size field of the block was corrupted

It's true that this PR increases dependence on this field, but If we're assuming the size can be corrupted, I'm not sure what benefit it can be providing, either with or without this PR. Yes, setting nRewind to nBlockPos+1 would work (I haven't tried it but I could), but it would be significantly slower because we'd have to scan for the next marker (a megabyte or 2 away) one byte at a time.

It seems to me that if the size field is corrupted, all bets are off, the train's going off the tracks in a hurry.

// Position to the start of the next block.
nRewind = nBlockPos + nSize;
blkdat.SetLimit(nRewind);
blkdat.Skip(nRewind - blkdat.GetPos());

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 5, 2020

Author Contributor

Skip() calls read()

Good catch, and it's actually even worse, because if we process the block immediately (don't take the Skip() path), then we deserialize the entire block (blkdat >> block, just below), and that calls read() which can read the disk. But that's the less common path (only about 8%).

As someone said at review club, the node isn't fully up and running during reindexing, so there probably isn't much contention for this lock. But I think we can avoid this problem; I'll push a fix if it doesn't complicate the code too much.

this part could be more concise. I mean adding position + size just do subtract position below, seems a bit redundant.

I don't think it's redundant; note that nBlockPos is the start of the current block (whose header we just read), while blkdat.GetPos() returns the current position, which is after the header. We don't want to skip nSize bytes (which is the length of the block), we want to skip the remainder of the current block, so the distance to move forward is the start of the next block minus the current position. This could have been written blkdat.Skip(nSize - 80) but I think that's not as clear.

I just noticed that the SetLimit() here can be removed; it's not needed when doing a Skip() (it's only needed when deserializing something of variable length, such as a block). I'll remove it.

@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch 2 times, most recently from 8db0113 to f714ad1 Mar 6, 2020
@LarryRuane LarryRuane force-pushed the LarryRuane:reindex-speedup branch from f714ad1 to 0e27461 Mar 7, 2020
Copy link
Contributor Author

LarryRuane left a comment

I think I've addressed all the review comments, thanks again!

BOOST_CHECK(strstr(e.what(), str) != nullptr); \
} \
} while (false)

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

No functional change here, just refactoring a check that's repeated 4 times. (It's a macro rather than a function so that the line numbers report correctly.) This macro is misnamed (since it's not part of Boost), but it's so close to BOOST_CHECK_THROW() that this is probably the least confusing name. (But I can change it.)

mapBlocksUnknownParent.insert(std::make_pair(header.hashPrevBlock, *dbp));

// Position to the start of the next block.
nRewind = nBlockPos + nSize;

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

I did an experiment yesterday in which I simulated corruption in one of the blk??.dat files, and as I suspected, all the blocks beyond the coorrupted block were never accepted (because their parents were missing). But the node then seemed to get all the missed blocks from peers, and at the end all seemed good.

Anthony, your suggestion is similar to what this PR originally did before I reduced its scope. You can see the discussion above, but to summarize here, the idea was to increase the size of the circular memory buffer (CBufferedFile object) so it could act as a cache, so that we could sometimes access blocks read earlier without having to re-read from disk; the concern was about increasing memory requirements. I think something along the lines I proposed or what you're proposing is likely to help further (especially tying it to mempool limit; that's a good idea), but I suggest keeping this PR limited (no change in memory, a definite improvement) if you don't mind.

mapBlocksUnknownParent.insert(std::make_pair(header.hashPrevBlock, *dbp));

// Position to the start of the next block.
nRewind = nBlockPos + nSize;

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

On the topic of the size being corrupted, if the size field is set to a uniformly distributed random value (or zero, probably the most likely), that is caught (see the check nSize < 80 || nSize > MAX_BLOCK_SERIALIZED_SIZE above); the probability that the corrupted value will be in that range is less than 0.1%. So most data corruption there will be detected.


// Read the rest of the block (from the end of its header to the end
// of the block) to ensure it's in memory while holding cs_main (below).
blkdat.Skip(nBlockPos + nSize - blkdat.GetPos());

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

@narula, this addresses your concern about holding cs_main while reading disk.

# Check that blocks can be processed out of order
def out_of_order(self):
# The previous test created 12 blocks
assert_equal(self.nodes[0].getblockcount(), 12)
self.stop_nodes()

# In this test environment, blocks will always be in order (since
# we're generating them rather than getting them from peers, so to
# test out-of-order handling, swap blocks 1 and 2 on disk.
blk0 = os.path.join(self.nodes[0].datadir, self.nodes[0].chain, 'blocks', 'blk00000.dat')
with open(blk0, 'r+b') as bf:
b = bf.read(814)

# This is really just checking the test; these are is the first
# "marker" bytes of the four blocks, genesis and 3 more
# (these offsets are extremely unlikely ever to change)
assert_equal(b[0], 0xfa)
assert_equal(b[293], 0xfa)
assert_equal(b[553], 0xfa)
assert_equal(b[813], 0xfa)

# Swap the same-size second and third blocks (don't change genesis).
bf.seek(293)
bf.write(b[553:813])
bf.write(b[293:553])

extra_args = [["-reindex", "-debug=reindex"]]
self.start_nodes(extra_args)
wait_until(lambda: self.nodes[0].getblockcount() > 0)

# All blocks should be accepted and processed.
assert_equal(self.nodes[0].getblockcount(), 12)
Comment on lines +35 to +66

This comment has been minimized.

Copy link
@LarryRuane

LarryRuane Mar 7, 2020

Author Contributor

I did verify that if I break this PR's code, this test doesn't pass. I know hardcoded constants are generally bad, but I think they're okay here, because the size of an empty regtest block is very unlikely to change, and if it does, this test will fail, and it should be pretty easy for someone to figure out why and fix the test. (That is, the test will not falsely succeed.)

@kallewoof

This comment has been minimized.

Copy link
Member

kallewoof commented Mar 7, 2020

Edit: I forgot to mention; this node has txindex enabled.

25.11s user 163.37s system 7% cpu 39:53.51 total
25.59s user 162.06s system 7% cpu 39:43.73 total
25.35s user 161.31s system 7% cpu 39:16.82 total
24.98s user 162.59s system 7% cpu 39:06.45 total
25.36s user 162.80s system 8% cpu 39:01.81 total
  • With SCHED_BATCH off:
24.64s user 155.84s system 8% cpu 35:31.20 total
25.22s user 156.17s system 8% cpu 36:08.64 total
25.31s user 156.53s system 8% cpu 36:44.35 total
25.08s user 156.75s system 8% cpu 37:30.11 total
24.97s user 156.26s system 8% cpu 37:15.24 total
  • Master (too slow; only ran one time, on 3d28c88):
4261.85s user 151.57s system 66% cpu 1:49:57.76 total
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.