Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible savings in dumptxoutset serialization format (~20%) #25675

Open
RCasatta opened this issue Jul 22, 2022 · 2 comments · May be fixed by #29612
Open

Possible savings in dumptxoutset serialization format (~20%) #25675

RCasatta opened this issue Jul 22, 2022 · 2 comments · May be fixed by #29612
Labels

Comments

@RCasatta
Copy link

Is your feature request related to a problem? Please describe.

Size of the serialized UTXO set

Describe the solution you'd like

The serialization format of the serialized UTXO set from dumptxoutset is a list of (COutPoint, Coin).

However, many out points refer to the same transaction so we can group by Txid with:
list(Txid, list((vout,Coin))

Since the cursor is already iterating through sorted Txid this doesn't add complexity to the serialization code.

Considering the UTXO at height 745995:

serialized_size: ~5.3Gb
total_elements: 83_082_178
uniques_txids: 49_517_483
bytes_lost: total_elements // due to the additional byte for the length of the inner list
bytes_savings: (total_elements-unique_txids)*32 - bytes_lost ~= 1Gb

Describe alternatives you've considered

Additional bytes could be saved by leveraging the duplications in scripts (address reuse). However, this is not considered worthy because the format would lose the streaming property and also because we don't want to optimize on something which is not recommended.

The byte lost for expressing the length of the inner list could be optimized with a special byte containing both the length of the list and the first vout, since usually both vout and this value are very small they should fit on a single byte most of the time having a fallback for edge cases.

Additional context

Tagging @jamesob author of #16899
assumeutxo summary #15606

@luke-jr
Copy link
Member

luke-jr commented Mar 14, 2024

Outputs are also created sequentially, so it seems likely list(Txid, list(Coin,...) might improve things too?

@RCasatta
Copy link
Author

Shouldn't be list(Txid, list(Option<Coin>,...) to recompute the vout?
In this case I don't think so because it needs a byte for every None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants
@luke-jr @RCasatta and others