rpc: Optimize serialization disk space of dumptxoutset #26045

aureleoules · 2022-09-08T00:57:46Z

This is an attempt to implement #25675.

I was able to reduce the serialized utxo set from 5GB to 4.1GB on mainnet.

Closes #25675.

DrahtBot · 2022-09-08T17:13:08Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	jamesob, pablomartin4btc, jaonoctus, Sjors
Stale ACK	TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29307 (util: check for errors after close and read in AutoFile by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

amovfx

I don't have the skill set to approve this code but I can test and verify. I ran this on testnet and can confirm a smaller output file.

without:
{ "coins_written": 27908372, "base_hash": "0000000000000008f07ed39d6d03c19ee7346bc15b6a516cdda8402b6244b828", "base_height": 2345886, "path": "/Users/{$User}/Library/Application Support/Bitcoin/testnet3/./txoutset.txt", "txoutset_hash": "026617308d218e57fb43f02baa644134f5000594a1eea06b02cc9d02959d4d9b", "nchaintx": 63601314 }
Size of 3 473 536

With this optimization:
{ "coins_written": 27908372, "base_hash": "0000000000000008f07ed39d6d03c19ee7346bc15b6a516cdda8402b6244b828", "base_height": 2345886, "path": "/Users/${User}/Library/Application Support/Bitcoin/testnet3/txoutset.optimized.txt", "txoutset_hash": "026617308d218e57fb43f02baa644134f5000594a1eea06b02cc9d02959d4d9b", "nchaintx": 63601314 }
Size of 2 457 728

Looks like a win to me.

luke-jr · 2022-09-10T23:36:36Z

Not sure it's worth it to save 20%. Presumably generic compressors could do better anyway?

aureleoules · 2022-09-12T07:26:40Z

Not sure it's worth it to save 20%. Presumably generic compressors could do better anyway?

Yes I agree but at least with this implementation the utxo set is still readable as is and doesn't need decompression.

TheCharlatan · 2023-03-02T13:53:48Z

The main downside of this implementation is that the entire UTXO set must be loaded in RAM (in mapCoins) before being written to disk. So running dumptxoutset will consome a lot of RAM on mainnet since the utxo set is large.
Not sure how to improve this.

The keys in leveldb are guaranteed to be lexicographically sorted. You just need to cache the last txid and flush the txid and outpoint tuples once the next one is reached in the database iterator. I pushed a dirty proof of concept here and it produced the same output file as your current implementation.

Otherwise Concept ACK. For a file that might be lugged over a network or other data carrier this space improvement is nice. Users can still apply their own compression on top of it. I also like that this provides additional structure to the data.

RCasatta · 2023-03-16T16:17:51Z

Yeah as I specified in the issue and as @TheCharlatan wrote more ram shouldn't be needed because txid are iterated sorted from leveldb.

I was also wondering why the bitcoin-cli savetxoutset API (same goes for savemempool) requires to specify a filename instead of writing to stdout that would offer better composition while maintaining the possibility to write to file with >utxo.bin

TheCharlatan · 2023-04-29T08:26:20Z

Re #26045 (comment)

I was also wondering why the bitcoin-cli savetxoutset API (same goes for savemempool) requires to specify a filename instead of writing to stdout that would offer better composition while maintaining the possibility to write to file with >utxo.bin

Writing to stdout would mean that the data would have to be carried over the rpc connection, right?

TheCharlatan

Thank you for picking this up again.

These are just patches for fixing up my initial code, I also put the entire diff here. The rest looks good to me, I tested dumptxoutset extensively on mainnet. The runtime performance of this branch is very similar to the base.

The following can be ignored in the context of this PR, but I still want to take note:

During review I noticed that the execution time of the command greatly relies on getting the utxo stats (getting the stats and writing to the file each take about 450s on my m.2 machine). I don't follow why these stats need to be collected by iterating through the entire set again, is there no better way to prepend the meta data? A count of the written entries as well as a hash of them can be provided by the same iteration loop that is used to write the file. Further, the m_coins_count in the SnapshotMetadata is populated with tip->nChainTx, which I am not sure always corresponds to the actual number of coins written, but is used to read that exact number of coins from the file when loading the snapshot.

src/rpc/blockchain.cpp

RCasatta · 2023-04-30T13:33:56Z

Writing to stdout would mean that the data would have to be carried over the rpc connection, right?

Yes, would that be a problem?

sipa · 2023-04-30T13:37:42Z

There is no way we can currently send gigabytes of data as an RPC response; both the server and client likely buffer the result and would OOM.

aureleoules · 2023-04-30T14:43:48Z

Thanks for the patch @TheCharlatan, I applied it.

TheCharlatan · 2023-04-30T16:24:24Z

I ran some more numbers, dumping a mainnet snapshot on height 787535:

Filename	Size (bytes)
this_branch.dat	4'697'836'541
this_branch.tar.xz	3'899'452'268
master.dat	5'718'308'597
master.tar.xz	3'925'608'176

So even in the compressed form the encoding saves some megabytes.

TheCharlatan · 2023-05-01T07:27:24Z

ACK 7acfc2a

Since we are changing the format of dumptxoutset anyway here in a non backwards compatible fashion, I'd like to suggest moving the metadata to the end of the file. This would take care of the double iteration described in #26045 (review). In my eyes, this does not significantly hurt the integrity of the file. If an exception occurs during writing, only the temporary file remains. Other common file formats also embed metadata at the end.

aureleoules · 2023-12-18T10:20:56Z

@aureleoules wen rebase? :-)

🫡

aureleoules · 2023-12-18T11:01:17Z

Rebased

I had to slightly change the tests in feature_assumeutxo.py because I changed the encoding format of the dump. I added 2 bytes to the offset because of the new size (2 bytes) field.

Sjors · 2023-12-22T13:11:41Z

Did some testing with f5d2014.

Using contrib/devtools/utxo_snapshot.sh on signet I get the same txoutset_hash. The resulting is also identical to what I generated in earlier, see #26045 (comment). I was then able to load the loadshot and sync the chain (to the tip in about a minute, wonderful) and complete the background sync.

I also generated a mainnet snapshot which was also identical to what I found in the last test, so I have not tried loading again.

Will review the code later.

Co-authored-by: TheCharlatan <seb.kung@gmail.com>

aureleoules · 2024-02-05T16:13:01Z

Rebased. I had to change the offset again in feature_assumeutxo.py.

Sjors · 2024-02-22T10:43:46Z

test/functional/feature_assumeutxo.py

@@ -76,6 +76,7 @@ def test_invalid_snapshot_scenarios(self, valid_snapshot_path):
        bad_snapshot_path = valid_snapshot_path + '.mod'

        def expected_error(log_msg="", rpc_details=""):
+            print(log_msg)


Don't forget to drop this.

Should be addressed in #29612

Sjors

Concept ACK. Found a few issues during code review, see inline.

@TheCharlatan wrote:

The keys in leveldb are guaranteed to be lexicographically sorted. You just need to cache the last txid and flush the txid and outpoint tuples once the next one is reached in the database iterator.

I would be good to document in the code that we rely on this behavior (maybe in the header file, so someone who wants to replace leveldb is reminded).

@luke-jr & @aureleoules wrote:

Not sure it's worth it to save 20%. Presumably generic compressors could do better anyway?

Yes I agree but at least with this implementation the utxo set is still readable as is and doesn't need decompression.

We might also at some point want to xor the file - for the same reason as #28207. That would make compression with external tools impossible.

Sjors · 2024-02-23T10:04:47Z

src/rpc/blockchain.cpp

+
+    auto write_coins_to_file = [&](AutoFile& afile, const uint256& last_hash, const std::vector<std::pair<uint32_t, Coin>>& coins) {
+        afile << last_hash;
+        afile << static_cast<uint16_t>(coins.size());


4e19464: In primitives/transaction.h we serialize std::vector<CTxOut> vout with a simple s << tx.vout; without any reference to the size of the vector. Not sure if any magic is happening there under the hood that I missed.

Also, uint16_t implies a limit of 65,536 outputs per transaction. I don't think that's a consensus rule?

Yeah, I think the magic happens here: https://github.com/bitcoin/bitcoin/blob/master/src/serialize.h#L674

However, we can't use that because we are not looking at a full transaction but rather the outpoints that are still left in the UTXO set. But we basically mimic that behavior here.

On the 65,536: I guess the blocksize solves this for us for now I think it makes sense to use VARINT/CompactSize here

A transaction with 65,537 OP_RETURN outputs should fit in a block.

If I start with P2TR outputs with this calculator, that's 2,818,159 vbyte. https://bitcoinops.org/en/tools/calc-size/

And then subtract 32 bytes per output: 2,818,159 - 65537 * 32 = 720,975 vbyte

cc @murchandamus can you add OP_RETURN to the dropdown? :-)

In any case it seems unsafe to rely on the block size here.

Hm, but OP_RETURNs are not included in the UTXO set and we are serializing the UTXO set here, so I think this could still not happen like this. But I think you are right there are non-standard cases imaginable that make this possible, like just sending to OP_TRUE for example. So we should still make this robust.

Anyway, I am using CompactSize now in #29612 :)

Sjors · 2024-02-23T10:31:01Z

src/validation.cpp

+            coins_file >> size;
+
+            if(size > coins_left) {
+                LogPrintf("[snapshot] mismatch in coins count in snapshot metadata and actual snapshot data\n");


In the context of my remark above about maybe not serializing size: in that case this check would have to happen after the batch of coins is loaded, which seems fine.

@Sjors I couldn't follow this comment here. If this is still relevant, could you please clarify in #29612? Thanks!

Nothing to do here, since we do have serialize size, as you explained: #26045 (comment)

Sjors · 2024-02-23T10:36:02Z

src/validation.cpp

+                Coin coin;
+                coins_file >> outpoint.n;
+                coins_file >> coin;
+                outpoint.hash = txid;


4e19464 nit: maybe move this above where you set outpoint.n

Should be addressed in #29612

Sjors · 2024-02-23T10:38:42Z

src/validation.cpp

            }
+        } catch (const std::ios_base::failure&) {
+            LogPrintf("[snapshot] bad snapshot format or truncated snapshot after deserializing %d coins\n",


Unrelated, but why doesn't this use coins_processed?

Should be addressed in #29612

Sjors · 2024-02-23T10:46:25Z

src/validation.cpp

@@ -5468,6 +5480,7 @@ bool ChainstateManager::PopulateAndValidateSnapshot(

    bool out_of_coins{false};
    try {
+        COutPoint outpoint;


I think this should be:

Txid txid; coins_file >> txid;

The current code might accidentally work because a COutPoint is a Txid followed by a single number (albeit uint32_t instead of uint16_t).

Should be addressed in #29612

fjahr · 2024-03-04T17:47:35Z

src/rpc/blockchain.cpp

    Coin coin;
-    unsigned int iter{0};


This move doesn't seem necessary.

Should be addressed in #29612

fjahr · 2024-03-04T23:17:43Z

src/rpc/blockchain.cpp

+
+    auto write_coins_to_file = [&](AutoFile& afile, const uint256& last_hash, const std::vector<std::pair<uint32_t, Coin>>& coins) {
+        afile << last_hash;
+        afile << static_cast<uint16_t>(coins.size());


Yeah, I think the magic happens here: https://github.com/bitcoin/bitcoin/blob/master/src/serialize.h#L674

However, we can't use that because we are not looking at a full transaction but rather the outpoints that are still left in the UTXO set. But we basically mimic that behavior here.

fjahr · 2024-03-04T23:28:08Z

src/rpc/blockchain.cpp

+    auto write_coins_to_file = [&](AutoFile& afile, const uint256& last_hash, const std::vector<std::pair<uint32_t, Coin>>& coins) {
+        afile << last_hash;
+        afile << static_cast<uint16_t>(coins.size());
+        for (auto [vout, coin] : coins) {


I think you should call vout here either n or vout_index, otherwise it might be confusing

Should be addressed in #29612

fjahr · 2024-03-04T23:38:30Z

src/rpc/blockchain.cpp

+
+    auto write_coins_to_file = [&](AutoFile& afile, const uint256& last_hash, const std::vector<std::pair<uint32_t, Coin>>& coins) {
+        afile << last_hash;
+        afile << static_cast<uint16_t>(coins.size());


On the 65,536: I guess the blocksize solves this for us for now I think it makes sense to use VARINT/CompactSize here

aureleoules · 2024-03-07T17:52:56Z

Thanks for the reviews but I don't have time to address the comments so please mark this pull as Up for grabs.

Sjors · 2024-03-07T19:39:20Z

Thanks for the great start @aureleoules!

@fjahr can you take this one? Otherwise I can, but not sure how soon.

fjahr · 2024-03-07T21:27:45Z

Thanks for the great start @aureleoules!

@fjahr can you take this one? Otherwise I can, but not sure how soon.

Yeah, I will reopen this shortly.

DrahtBot added the RPC/REST/ZMQ label Sep 8, 2022

aureleoules marked this pull request as draft September 8, 2022 08:37

aureleoules force-pushed the 2022-09-dumputxoset-compact branch 4 times, most recently from a00b91c to 80f0b0f Compare September 8, 2022 10:17

aureleoules marked this pull request as ready for review September 8, 2022 10:47

aureleoules force-pushed the 2022-09-dumputxoset-compact branch 3 times, most recently from 3822c2f to 8b0e9e9 Compare September 8, 2022 13:08

amovfx reviewed Sep 9, 2022

View reviewed changes

DrahtBot mentioned this pull request Sep 10, 2022

msvc, refactor: Avoid some rare compiler warnings #25819

Closed

DrahtBot mentioned this pull request Feb 6, 2023

assumeutxo: keep cache when flushing snapshot (#17487 followup) #27008

Closed

aureleoules force-pushed the 2022-09-dumputxoset-compact branch from 8b0e9e9 to 04887de Compare April 26, 2023 14:38

TheCharlatan suggested changes Apr 29, 2023

View reviewed changes

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved

aureleoules force-pushed the 2022-09-dumputxoset-compact branch from 04887de to 7acfc2a Compare April 30, 2023 14:43

This was referenced May 4, 2023

test: dedup file hashing using sha256sum_file helper #27572

Merged

assumeutxo #15606

Closed

DrahtBot mentioned this pull request Jun 7, 2023

kernel: Remove shutdown globals from kernel library #27711

Closed

aureleoules marked this pull request as draft December 18, 2023 10:45

aureleoules force-pushed the 2022-09-dumputxoset-compact branch from fe4ca31 to f5d2014 Compare December 18, 2023 10:56

DrahtBot added CI failed and removed Needs rebase labels Dec 18, 2023

aureleoules marked this pull request as ready for review December 18, 2023 11:01

DrahtBot removed the CI failed label Dec 18, 2023

DrahtBot added the CI failed label Jan 14, 2024

This was referenced Jan 24, 2024

util: check for errors after close and read in AutoFile #29307

Open

test: Assumeutxo with more than just coinbase transactions #29354

Merged

DrahtBot added the Needs rebase label Feb 5, 2024

rpc: Optimize serialization disk space of dumptxoutset

4e19464

Co-authored-by: TheCharlatan <seb.kung@gmail.com>

aureleoules force-pushed the 2022-09-dumputxoset-compact branch from f5d2014 to 4e19464 Compare February 5, 2024 16:11

DrahtBot removed Needs rebase CI failed labels Feb 5, 2024

Sjors reviewed Feb 22, 2024

View reviewed changes

Sjors reviewed Feb 23, 2024

View reviewed changes

Sjors mentioned this pull request Feb 28, 2024

[DO NOT MERGE] Schnorr batch verification for blocks #29491

Draft

5 tasks

fjahr mentioned this pull request Mar 4, 2024

assumeutxo: Add mainnet chainparams for block 830000 #29551

Closed

fjahr reviewed Mar 5, 2024

View reviewed changes

aureleoules closed this Mar 7, 2024

fanquake added the Up for grabs label Mar 7, 2024

fjahr mentioned this pull request Mar 10, 2024

rpc: Optimize serialization and enhance metadata of dumptxoutset output #29612

Open

maflcko removed the Up for grabs label Mar 11, 2024

rpc: Optimize serialization disk space of dumptxoutset #26045

rpc: Optimize serialization disk space of dumptxoutset #26045

Conversation

aureleoules commented Sep 8, 2022 • edited

DrahtBot commented Sep 8, 2022 • edited

Code Coverage

Reviews

Conflicts

amovfx left a comment • edited

Choose a reason for hiding this comment

luke-jr commented Sep 10, 2022

aureleoules commented Sep 12, 2022

TheCharlatan commented Mar 2, 2023

RCasatta commented Mar 16, 2023

TheCharlatan commented Apr 29, 2023

TheCharlatan left a comment • edited

Choose a reason for hiding this comment

RCasatta commented Apr 30, 2023

sipa commented Apr 30, 2023

aureleoules commented Apr 30, 2023

TheCharlatan commented Apr 30, 2023

TheCharlatan commented May 1, 2023

aureleoules commented Dec 18, 2023

aureleoules commented Dec 18, 2023

Sjors commented Dec 22, 2023

aureleoules commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sjors left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aureleoules commented Mar 7, 2024

Sjors commented Mar 7, 2024 • edited

fjahr commented Mar 7, 2024

aureleoules commented Sep 8, 2022 •

edited

DrahtBot commented Sep 8, 2022 •

edited

amovfx left a comment •

edited

TheCharlatan left a comment •

edited

Sjors left a comment •

edited

Sjors commented Mar 7, 2024 •

edited