Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Serve blocks directly from disk when possible #13151

Merged
merged 1 commit into from May 23, 2018

Conversation

Projects
None yet
8 participants
@laanwj
Copy link
Member

laanwj commented May 2, 2018

In ProcessGetBlockData, send the block data directly from disk if type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the on-disk format matches the network format.

This is expected to increase performance because a deserialization and subsequent serialization roundtrip is avoided.

@laanwj laanwj force-pushed the laanwj:2018_05_direct_from_disk branch May 2, 2018

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 2, 2018

Could you please add a benchmark to ./src/bench/checkblock.cpp, so it is easier to see how much this improves?

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 2, 2018

Sure, though I'm not sure how to do that; none of the benches actually uses ReadBlockFromDisk, I would have to set up a fake block index or such.

(I also don't think it will work on block413567 as-is because it has no magic/size header, and is not a file on disk, though it's easy enough to write a temporary file of course). [When doing this from memory there's effectively nothing to benchmark, too.]

@promag
Copy link
Member

promag left a comment

Concept ACK.

Couldn't we serve corrupted blocks?

src/validation.cpp Outdated

try {
block.resize(nSize);
filein.read((char*)block.data(), nSize);

This comment has been minimized.

@promag

promag May 2, 2018

Member

Just to throw out the idea, mmap wouldn't pay off right?

This comment has been minimized.

@laanwj

laanwj May 2, 2018

Author Member

I don't think that's a win here, as the entire block is read consecutively - could even be slower as it'd have to create and destroy the mapping. Also it's not portable.

@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
std::shared_ptr<const CBlock> pblock;
if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
pblock = a_recent_block;
} else if (inv.type == MSG_WITNESS_BLOCK) {

This comment has been minimized.

@MarcoFalke

MarcoFalke May 2, 2018

Member

Shouldn't this compare against the serialization flags of the block on disk? Currently you are assuming that all blocks are serialized as witness blocks on disk, but this is not true for all "early" blocks.

This comment has been minimized.

@laanwj

laanwj May 2, 2018

Author Member

Yes I'm not convinced this logic is correct. It seems to work, though, even for the initial blocks.

Edit: What is the operation to convert from a non-witness block to witness block with no witnesses? I suppose this could still be done without a full round-trip?

This comment has been minimized.

@sipa

sipa May 2, 2018

Member

@laanwj It's always correct to give the raw blocks we store to peers that ask for witnesses (even if the block does not have a witness).

Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don't see how to do it without going through some form of serialization code.

This comment has been minimized.

@laanwj

laanwj May 2, 2018

Author Member

@sipa Thanks.

Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don't see how to do it without going through some form of serialization code.

Right, that case should fall back to deserialization->serialization right now. I don't think we can do much better there. Not sure it's even worth optimizing, there won't be many new clients being synced with pre-segwit versions.

This comment has been minimized.

@MarcoFalke

MarcoFalke May 2, 2018

Member

Leaving out the less common edge case (non-witness peers) seems fine for now.

This comment has been minimized.

@sipa

sipa May 13, 2018

Member

Unsure if we care, but we could also check whether SegWit was active for the requested block, and if not, we can serve without deserialization even when witnesses are not requested.

This comment has been minimized.

@laanwj

laanwj May 14, 2018

Author Member

@sipa Yes, that would be something that could be done here.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 2, 2018

Concept ACK. Would be nice to see how much the additional savings are on top of #13098.

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 2, 2018

Concept ACK. Would be nice to see how much the additional savings are on top of #13098.

At least it's a lot simpler.

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 2, 2018

Couldn't we serve corrupted blocks?

Yes, that's a possibility, though only if the underlying storage is corrupted. I've posited the idea to add a CRC32C to the on-disk blocks at some point (which is quick to verify, especially with specialized instructions, and should protect against accidental corruptions), but that's quite an invasive change. It's something that could be done later.

The only option to verify with the current information would be to do a Merkle tree check - which could be done without deserialization but it's not pretty... (and a serious overhead SHA256-hashing)

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 2, 2018

The only option to verify with the current information would be to do a Merkle tree check

I don't see that we currently do this, so it wouldn't make anything worse here.

@gmaxwell

This comment has been minimized.

Copy link
Member

gmaxwell commented May 2, 2018

I think serving a corrupted block if our state is corrupted is fine, the peer will just disconnect us and go get the block from someone else, seems pretty harmless!

This is a much smaller change than I was expecting-- in particular I forgot there was a size, light review ACK.

@TheBlueMatt
Copy link
Contributor

TheBlueMatt left a comment

utACK except for the below:

src/validation.cpp Outdated
}

try {
block.resize(nSize);

This comment has been minimized.

@TheBlueMatt

TheBlueMatt May 2, 2018

Contributor

Probably want to check the size is sane before we do this.

This comment has been minimized.

@laanwj

laanwj May 3, 2018

Author Member

Good point. What constant would be appropriate here?
Edit: I'll go with MAX_SIZE from serialize.h.

This comment has been minimized.

@laanwj

laanwj May 3, 2018

Author Member

Another thing I wondered here: what is the C++11 proper way to allocate a vector (or a RAII memory area) without zeroing it? I think that's unnecessary here.
That was a bad idea: even though we handle errors while reading, as this data is sent directly over P2P, zeroing is defense-in-depth against heartbleed-style issues here

@MarcoFalke
Copy link
Member

MarcoFalke left a comment

Will measure some round trips tomorrow.

src/net_processing.cpp Outdated
} else if (inv.type == MSG_WITNESS_BLOCK) {
// Fast-path: in this case it is possible to serve the block directly from disk,
// as the network format matches the format on disk
LogPrintf("debug: Serving raw block directly from disk: %s\n", pindex->ToString());

This comment has been minimized.

@MarcoFalke

MarcoFalke May 2, 2018

Member

Should probably remove this debug logging

This comment has been minimized.

@laanwj

laanwj May 3, 2018

Author Member

Yes, definitely. I added it while WIP so that people testing this can be sure that the code actually triggers and they're testing the right thing.

@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
std::shared_ptr<const CBlock> pblock;
if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
pblock = a_recent_block;
} else if (inv.type == MSG_WITNESS_BLOCK) {

This comment has been minimized.

@MarcoFalke

MarcoFalke May 2, 2018

Member

Leaving out the less common edge case (non-witness peers) seems fine for now.

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 3, 2018

utACK 4c790dff7481d1464a906ad6b17a3179a7da3431

This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

src/validation.cpp Outdated

CMessageHeader::MessageStartChars msg_start_in;
unsigned int nSize;
filein >> msg_start_in >> nSize;

This comment has been minimized.

@laanwj

laanwj May 3, 2018

Author Member

I guess this deserialization logic should be within the try {}.

@laanwj laanwj changed the title WIP: net: Serve blocks directly from disk when possible net: Serve blocks directly from disk when possible May 3, 2018

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 3, 2018

OK, thanks for review everyone, removed WIP tag and pushed commits with the following changes:

  • Remove extraneous debug log message
  • Check nSize against MAX_SIZE
  • Move deserialization of msg_start_in, and size into exception try
  • Improved variable and parameter naming

Will squash if no further issues.

This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

Yeah - I saw in #13098 that @MarcoFalke had also optimized the zmq full-block notification part. I'm not sure that is any critical path performance-wise (maybe for your indexer?) but if so I'll leave that for a later PR.

Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

Did you forget to post the link?

@sipa
Copy link
Member

sipa left a comment

utACK after squash.

src/net_processing.cpp Outdated
// as the network format matches the format on disk
std::vector<uint8_t> block_data;
if (!ReadRawBlockFromDisk(block_data, pindex, chainparams.MessageStart()))
assert(!"cannot load block from disk");

This comment has been minimized.

@sipa

sipa May 3, 2018

Member

Nit: braces around then-branch if on a separate line.

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 4, 2018

Running with e0223ebf0c58f7beedea91df48e9586154cd4436 and just looking at the wall clock time for reading+optional deserialization shows for me on an ssd:

ssd

Edit: Note that this was done with full fake blocks and not real blocks from the network.

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 4, 2018

Thanks for benchmarking @MarcoFalke.

I used a patched version of @jonasschnelli's bitcoincore-indexd to benchmark the time for fetching block 0..473600 through P2P, with no processing client-side. The result is:

With patch:
real    63m51.273s

Without patch:
real    70m28.956s

10% speedup. And in my case is the blocks are on a slow harddisk. I expect the gains to be more significant in case of a faster storage medium, or slower CPU.


block.resize(blk_size); // Zeroing of memory is intentional here
filein.read((char*)block.data(), blk_size);
} catch(const std::exception& e) {

This comment has been minimized.

@promag

promag May 7, 2018

Member

nit, space after catch } catch (....

blk_size, MAX_SIZE);
}

block.resize(blk_size); // Zeroing of memory is intentional here

This comment has been minimized.

@promag

promag May 7, 2018

Member

Zeroing of memory is intentional

Why?

This comment has been minimized.

@laanwj

laanwj May 7, 2018

Author Member

To avoid heartbleed-type leaks as this data goes directly over the network.

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 7, 2018

Did 10 rounds of requesting blocks in range 490'000 up to 500'000 on master and got.
Setup:

  • Non VM machine
  • SSD 1400MB/s
  • Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
  • txindex was enabled
  • connect=0 --whitebind=127.0.0.1:8333
  • no other resource intense applications where running on that system
  • used a modified version of https://github.com/laanwj/bitcoincore-indexd

Master (-g -O2):

95211ms in avg. (all rounds where very similar and there was no need to exclude the first round)
exemption

This PR (-g O2):

101051ms in avg.

I can't figure out why this PR performs ~6% slower and I double checked the comparison by manually rolling back from the head of this PR to 598db38 and back to head. Also added the LogPrintfs back to ensure I'm using the "fast track" (removed again during benchmarking).

sidenode: found out that -debug=net (which was disabled during the rounds reported above) account for a 3% slowdown for the above test-scenario.

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 9, 2018

@jonasschnelli That's really strange. As reported, I did see some actual speed-ups where using this.
Maybe someone else can try some measurements, or we should just close, I don't know.

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 9, 2018

If someone wants to compare master against this PR built in the same environment:

PR: https://bitcoin.jonasschnelli.ch/build/600
master: https://bitcoin.jonasschnelli.ch/build/599

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 9, 2018

I updated my benchmark to also include the time it takes to Make (serialize) the net message:

net message serialization times

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 9, 2018

I think we should definitively look into why it is slower to sync, since that indicates a problem (potentially in our code) exists elsewhere.

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 13, 2018

I did the same experiment as @jonasschnelli, a modified bitcoincore-indexd that requests block 490000..500000 (https://github.com/laanwj/bitcoincore-indexd/tree/bench). Tried both cases 5 times;

with patch:
real    0m55.928s
real    0m55.986s
real    0m55.913s
real    0m55.844s
real    0m55.790s

without patch (using the commit before):
real    2m47.673s
real    2m46.329s
real    2m46.634s
real    2m46.413s
real    2m46.458s

~66% speedup. This is with a spinning rust disk not a SSD. This was done on a different computer than my previous test.

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 14, 2018

I think this is a clear benefit for spinning disk and probably also for ssd in non absurd localhost cases.

utACK

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 14, 2018

@jonasschnelli Did you have a chance to look into why your result was unexpected?

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 14, 2018

@MarcoFalke: no. I haven't but I'm willing to do as soon as someone could confirm my results (SSD test).

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 14, 2018

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 14, 2018

For clarity: 598db means master@598db and 9893e means this pull request. I used @laanwj's branch of bitcoincore-indexd.

10k block fetch times 490k-500k

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 14, 2018

@jonasschnelli I coulnd't find the branch you were using. Mind to share, otherwise I can't reproduce?

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 15, 2018

Same experiment as #13151 (comment) on i.MX6Q ARM board w/ USB2 spinning disk:

with patch:
real    10m18.368s
real    11m14.600s
real    10m12.006s
real    10m21.668s
real    10m11.070s

without patch (using the commit before):
real    27m30.574s
real    26m27.591s
real    25m38.311s
real    25m36.661s
real    25m40.902s

Seems to help even with a slow CPU and slow I/O.

net: Serve blocks directly from disk when possible
In `ProcessGetBlockData`, send the block data directly from disk if
type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the
on-disk format matches the network format.

This is expected to increase performance because a deserialization and
subsequent serialization roundtrip is avoided.

@laanwj laanwj force-pushed the laanwj:2018_05_direct_from_disk branch to 0bf4318 May 15, 2018

@laanwj

This comment has been minimized.

Copy link
Member Author

laanwj commented May 15, 2018

squashed, no other changes --
9893e712e9e04e8b9478e36e0b5d843899540bd2 → 0bf4318

@jonasschnelli

This comment has been minimized.

Copy link
Member

jonasschnelli commented May 15, 2018

I guess my setup was either faulty or there is a performance loss with that particular setup (>1000MB/s IO r&w on very fast CPUs).
However, this PR is a clear and significant win!

@MarcoFalke

This comment has been minimized.

Copy link
Member

MarcoFalke commented May 15, 2018

@jonasschnelli I can't explain why, but you might want to try to drop the files cached in memory with sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'. For me this was speeding up the sync on an ssd.

@laanwj laanwj merged commit 0bf4318 into bitcoin:master May 23, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

laanwj added a commit that referenced this pull request May 23, 2018

Merge #13151: net: Serve blocks directly from disk when possible
0bf4318 net: Serve blocks directly from disk when possible (Wladimir J. van der Laan)

Pull request description:

  In `ProcessGetBlockData`, send the block data directly from disk if type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the on-disk format matches the network format.

  This is expected to increase performance because a deserialization and subsequent serialization roundtrip is avoided.

Tree-SHA512: 9a9500b4c1354eaae1a6f1c6ef2416c1c1985029852589266f3a70e808f6c7482c135e9ab251a527566935378ab7c32dba4ed43ba5451e802d8e72b77d1ba472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.