reduce memory usage of bittorrent v2 merkle trees #4373

arvidn · 2020-03-02T09:33:54Z

libtorrent version (or branch): master

Currently, m_merkle_trees use more than 85% of total memory usage of client_test, when loading many torrents. The more torrents, the larger share.

I would like to come up with a more memory efficient representation. Some ideas:

consolidate the allocations of all the sha256_hash nodes in the trees, into a single vector, and make the m_merkler_trees reference that storage. The hope is that the overhead of many small memory allocations would be saved.
change the data structure for the merkle trees where not every layer is pre-computed, and asking for a hash would sometimes require some computation. Maybe it would even be acceptable to always recompute nodes (say, all nodes but the root and piece layer).

@ssiloti do you have any ideas?

The text was updated successfully, but these errors were encountered:

arvidn · 2020-03-02T13:37:51Z

I'm leaning towards an experiment with (2). Also, change the torrent_info object to not actually build the tree, but just hold the piece layers, as found in the .torrent file (and probably validate them, but not store the whole tree).

This would also clean up the ownership of the merkle trees. Currently, they are owned by the torrent_info object (which I really would prefer would be immutable) but it's used and mutated by the hash_picker. So, it's a bit smelly.

ssiloti · 2020-03-03T04:07:38Z

Typically the vast majority of hash requests will be for the piece layer, so I agree option 2 is a good idea.

I'm skeptical about moving the merkle trees out of torrent_info though. Whatever data structure the trees get moved into would need to be passed around alongside the torrent_info in a lot of places. The shift from immutable to mutable is a reflection of the fact that with v2 the torrent metadata is no longer all-or-nothing.

xnoreq · 2020-03-09T21:34:39Z

Binary trees can be stored quite easily in breadth-first order in an array. Since the padding required in merkle trees is at the rightmost leaf nodes they don't need to be stored.

Consider a worst-case example of 8+1 leaf nodes. Assuming 20 bytes/hash that gives a minimum of 180 bytes for just the leafs/pieces.

8 nodes would result in a perfect tree of depth 3 with N=15 or 300 bytes (+66% over minimum).

The 9th node adds the same number of nodes plus a new root increasing the depth to 4. But since the remaining 7 leaf nodes are just padding we don't need to store them. That results in N=24 or 480 bytes (+166% over minimum).

This goes towards +200% worst case as the number of pieces goes towards infinity.

Would that really be that big of a problem?

arvidn · 2020-03-09T22:12:17Z

The benchmark is the memory usage of v1 torrents. A few things make this problem worse than with v1 torrents.

In v2 the hashes are larger than v1 (sha-256 vs sha-1), so 64 bytes vs 20 bytes.

The info-dictionary is always kept in RAM to support magnet links (sending the meta data to other peers). In v1, the piece hashes are part of the info-dictionary, so they are "free". As in, the hash checker can just look into the info-dictionary blob for those.

In v2 the piece hashes (technically, the piece layer) is not part of the info-dictionary, but part of the outer torrent file. This means they have to be allocated separately. Not the end of the world, that memory would have been allocated as part of the info-dict blob anyway. The root hashes for the files are part of the info-dict though, so they can be referenced directly from the blob. In the current implementation, the root hashes are copied into the "filled" tree, wasting 64 bytes per (non-pad) file.

Each file has its own tree, which means each file has its own padding. Empty files and pad files don't have a tree, but they still have an empty std::vector, which is a wasted 24 bytes.

My ideas so far are:

introduce an abstraction representing a merkle tree for a single file.
don't store the root hash, but refer to the one in the info-dict. We always have this.
don't store any hashes except the piece layer and the leaf layer. compute them all other hashes on demand (I hope this will work in the hash_picker as well).
don't store empty merkle_trees for pad files and small files

verdy-p · 2020-03-12T02:09:19Z

a binary tree (whose all levels are full and only the highest level is partially filled but all empty leaves are grouped together at end) is very simple to represent as a vector, where item [0] is the root, items[1] and [2] are its immediate children. The index in the vector maps directly to the ordering of leaf nodes and the starting index for each level is always a power of 2 minus 1.
The highest level is stored at end of the vector and is simply the equivalent of the bittorent v1 list or hashes per fixed-size block. The total size of all lower levels (up to the root) contains always 2^n-1 hashes
So

root level starts at index [2^0-1=0] and contains (2^0=1) hash,
1st child level starts at index [2^1-1=1] and up to contains (2^1=2) hashes stored sequentially,
2nd child level starts at index [2^2-1=3] and up to contains (2^2=4) hashes stored sequentially,
3rd child level starts at index [2^3-1=7] and up to contains (2^3=8) hashes stored sequentially,
and so on.
NO padding is ever necessary (independantly of the size of the file to hash) !!

So you don't need any tree structure to store the tree, you can even keep the tree on disk instead of memory and a direct access is possible. For checking a file and locating an error, you can read jsut the minimum from the start of the vector of hash with minimal I/O, and esily locate which part has wrong signature, then recurse down by using a sequential I/O from this vector until you reach the leaf level. All this is easily and efficiently cached. So all you need to keep in memory is the root hash; when there are many torrents, this will just require a few bytes per torrent, just the Merkle Tree root hash for that file. The costs will be the same order as with Bittorrent v1 which just uses a single hash (20 bytes with SHA1) for the whole file.
This vector containing the whole tree can be integrated in the .torrent file. When sharing very large files (e.g. a database dump of several tens of gigabytes), the simple SHA1 index is very long to compute and check, for downloads you cannot start doing it easily from randoim position without having the full vector of SHA1 hashes downloaded. On the opposite with the Merkle tree, you can concentrate very fast to the part of file of interest by just downloading the necessary keys and this is great also to improve the security and check correctness after possible data corruption on the local store (you can repair it easily, you'll only download the file fragments that interest you).
The total size of the merkle tree vector of hashes will be lower than twice the size of hashes for the individual blocks.
And with Merkle trees, your individual blocks can easily be smaller than with v1: if your connection is fast and the remote peer has a large fragment to offer you can jut check the large fragment as you receive it by comparing only one or two hashes from the parent node covering the whole fragment, and this can be done even if you've still not downloaded the full detail for individual blocks for that large downloaded fragment.
So with Merkle tree the indivisual blocks could be as small as 4KB.
Note that for such small blocks, using a full SHA2 hash may not even be needed and this could be truncated in the hashstore, while the complete hash value of the block will be used for computing the hash of the parent. You could as well keep SHA1 for leaf nodes (but they will be smaller at 4KB than the current typical size of 64KB or much more for large shared files). While keeping a string security with a full SHA2 key for parents. So the hashstore is compactable (all nodes at the same level in the tree just need the same hash size and hashing algorithm, but you can freely adjust the hash size and algorithm depending only on the number of higher levels up to leaves: this is easy to predict).

arvidn · 2020-03-12T12:09:36Z

@verdy-p I don't understand what your point is. you are describing the data structure for a binary tree. why?

verdy-p · 2020-03-12T12:23:25Z

No I describe the storage as a single indexed vector, without any linnks between nodes; the vector has a predefined size directly tied by the total filesize in blocks multiplied by the size of each block size. This just creates a single integer-indexed block.
The same is true if the tree is not binary but uses another branching degree (as long as the branching degree is constant for each level): a Merkle-tree is not necessarily binary it could as well be quaternary and you would still save lot of storage spaces.
As well you dont need to keep the whole blob in memory or need to load it entirely for large files (I'm speaking about sharing files that could be tens of gigabytes per file, where the blob of hashes could be tens of megabytes for such file)

arvidn · 2020-03-12T12:54:18Z

I still don’t see your point. You are describing the current implementation. My point with this issue is that it’s using too much RAM

verdy-p · 2020-03-12T14:12:22Z

Your padding algorithm is incorrect: for a tree with 9 for 9 file blocks), you need 9 leaf nodes: ok there's no padding for that level that could store up to 16 nodes.
But the parent level does not require 8 nodes, it just requires ceil(9/2)=5 nodes.
Its parent requires ceil(9/4)=3 nodes
Its parent requires ceil(9/16)=2 nodes
Is (root) parent requires cell(9/16=1 node

You can save ALL paddings at every level !

This is a "worst case" (the worst cases are when there are (2^N+1) leaf nodes, the best cases are when there are 2^n leaf nodes) but still the total is 20 nodes, but NOT 24 as stated incorerctly above by xnoreq. When N grows to infinity, this worst case goes towards +100% for storing the parent nodes. You have then (for a tree degree=2):

+0% for the leading 0 parent node for 1 leaf node which is also the root (total nodes=1).
+50% for the leading 1 parent node (the root) of 2 leaf nodes (total nodes=3).
+100 % for the leading 3 parent nodes (including the root) of 3 leaf nodes (total nodes=6).
+75% for the leading 3 parent nodes (including the root) of 4 leaf nodes (total nodes=7).
+120% for the leading 6 parent nodes (including the root) of 5 leaf nodes (total nodes=11).
+100% for the leading 6 parent nodes (including the root) of 6 leaf nodes (total nodes=12).
+85.7% for the leading 6 parent nodes (including the root) of 7 leaf nodes (total nodes=13).
+75% for the leading 6 parent nodes (including the root) of 8 leaf nodes (total nodes=14).
+122.2% for the leading 11 parent nodes (including the root) of 9 leaf nodes (total nodes=15).
etc.

And as I stated, leaf nodes (because they represent smaller sections of files) do not require the same hash length than parent nodes. If you want to reduce more the overhead of parent nodes, you can reduce the hash size of leaf nodes (SHA1, or truncated SHA2, is safe for small 4KB blocks for all files that are larger than 4KB, given then you still can use SHA2 for their parent nodes in the Merkle tree; this just requires that when indexing files blocks are hashed in parallel with SHA1 and SHA2, but you'll store only SHA1 in the leaf nodes, and will use SHA2 for all other parent nodes, or for the root node which is also a leaf node when the file size is lower than 4KB.

The additional benefit is that you have two different hashing algorithms that is also increasing dramatically the resistance to collisions (better than when using SHA2 alone) if there are some known patterns that are attackable with a single algorithm, so you also gain in terms of security (this is when files are published witth two hashes: SHA1 and MD5: MD5 is known to have low resistance, SHA1 as well, but not their combination; and in a Merkle-tree you're ready for using two hashing algorithms in parallel, you just don't use the same algo for the different levels, you just have to use the same algo inside the same level of the tree, and don't need to store the extra parallel hashes at higher levels, because they are replaced by the hash at lower levels)

Now if you have two different algorithms (SHA1 for leaves only, and SHA2 all other parents), it is very safe to increase the degree of the tree, and you can save more on the total size for parent nodes. If leaf nodes are 4KB and parent nodes are at least 64KB, the degree of subtrees for the level above leaf is 16: a single SHA2 hash is used in the blob, for each range of 16 leaf blocks hashed with SHA1. For more levels (if they are needed) I suggest keeping the degree at 2 (and still use SHA2 for combining hashes of nodes at the next level), to preserve the miminum size to download from sources if a corruption is detected; and the combination function to use should be a HMAC, possibly keyed by the starting position in file, based on the appropropriate degree and the hashing function (SHA2).

arvidn · 2020-03-12T14:20:17Z

Your padding algorithm is incorrect

Now we're getting somewhere. It sounds what you mean to say is "you can save some memory by not storing zero-hashes at the leaf layer".

There are no padding at any of the internal levels. What do you mean by saving padding at every level?
Are you proposing a different algorithm than the one that's defined in BEP 52?

In either case, I still don't see how this addresses the problem. You seem to assume the RAM usage is dominated by padding nodes/hashes. I don't think that's the case. Yes, it doesn't make sense to store zeroes, but I don't think that's going to make a big difference.

Do you see any reasons to not implement my ideas I outlined above?

arvidn · 2020-03-12T14:21:30Z

And as I stated, leaf nodes (because they represent smaller sections of files) do not require the same hash length than parent nodes.

You mean in a world where BEP52 says something else, right?

If you want to discuss BEP52, please do so here.

verdy-p · 2020-03-12T14:47:16Z

And yes mean not storing any zeroes padding at EVERY level! there are NEVER needed.

xnoreq · 2020-03-29T13:27:09Z

@very-p It's funny that you wrote pages upon pages to explain something trivial which I had already mentioned before and very succinctly in two sentences:

Binary trees can be stored quite easily in breadth-first order in an array. Since the padding required in merkle trees is at the rightmost leaf nodes they don't need to be stored.

Also, yeah, intermediate nodes, if they were zero, would not need to be stored (which btw directly contradicts the suggested storage, confer with your own indexing examples) but this is not the case in bittorrent anyway as explained in BEP 30.

arvidn · 2020-05-24T09:12:02Z

This is the main piece of addressing this: #4626

arvidn added this to the 2.0 milestone Mar 2, 2020

arvidn mentioned this issue Apr 14, 2020

v2 status? #4539

Closed

arvidn closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce memory usage of bittorrent v2 merkle trees #4373

reduce memory usage of bittorrent v2 merkle trees #4373

arvidn commented Mar 2, 2020

arvidn commented Mar 2, 2020

ssiloti commented Mar 3, 2020

xnoreq commented Mar 9, 2020 •

edited

arvidn commented Mar 9, 2020

verdy-p commented Mar 12, 2020 •

edited

arvidn commented Mar 12, 2020

verdy-p commented Mar 12, 2020

arvidn commented Mar 12, 2020

verdy-p commented Mar 12, 2020 •

edited

arvidn commented Mar 12, 2020

arvidn commented Mar 12, 2020 •

edited

verdy-p commented Mar 12, 2020

xnoreq commented Mar 29, 2020

arvidn commented May 24, 2020

reduce memory usage of bittorrent v2 merkle trees #4373

reduce memory usage of bittorrent v2 merkle trees #4373

Comments

arvidn commented Mar 2, 2020

arvidn commented Mar 2, 2020

ssiloti commented Mar 3, 2020

xnoreq commented Mar 9, 2020 • edited

arvidn commented Mar 9, 2020

verdy-p commented Mar 12, 2020 • edited

arvidn commented Mar 12, 2020

verdy-p commented Mar 12, 2020

arvidn commented Mar 12, 2020

verdy-p commented Mar 12, 2020 • edited

arvidn commented Mar 12, 2020

arvidn commented Mar 12, 2020 • edited

verdy-p commented Mar 12, 2020

xnoreq commented Mar 29, 2020

arvidn commented May 24, 2020

xnoreq commented Mar 9, 2020 •

edited

verdy-p commented Mar 12, 2020 •

edited

verdy-p commented Mar 12, 2020 •

edited

arvidn commented Mar 12, 2020 •

edited