Skip to content

Latest commit

 

History

History
113 lines (72 loc) · 10.5 KB

frc-0069.md

File metadata and controls

113 lines (72 loc) · 10.5 KB
fip title author discussions-to status type created
0069
Piece Multihash
Adin Schmahmann (@aschmahmann), Peter Rabbitson (@ribasushi)
Draft
FRC
2023-07-26

FRC-0069: Piece Multihash and v2 Piece CID

Simple Summary

Introduces an alternative CID representation for the FR32 padded sha256-trunc254-padded binary merkle trees used in Filecoin Piece Commitments (i.e. CommP). In it we use the Raw codec and a new sha2-256-trunc254-padded-binary-tree multihash (or piece multihash) rather than the Fil-commitment-unsealed codec and the sha2-256-trunc254-padded multihash.

Abstract

This specification describes a new way to reference Pieces using CIDs.

While the current way of describing Piece CIDs (i.e. with the fil-commitment-unsealed codec and sha2-256-trunc254-padded multihash) has been useful for some purposes, in practice it turns out to be much more useful to be able to express the tuple of (Root hash, size of tree) rather than just the root hash.

For example, in much of the relevant portions of the Filecoin spec and lotus' Go code, Piece CIDs are not used on their own but rather are combined with the piece size in the Piece Info type. One such example of this is in the Storage Proving Subsystem.

This makes it much more natural to work with new concepts like Deal Aggregates, as proposed in FRC-0058.

To resolve this we introduce a new multihash type fr32-sha2-256-trunc254-padded-binary-tree multihash which combines the root hash with the tree height and the amount of padding of the data with zeros such that the result is a full and balanced binary tree after the fr32 padding is applied. If this multihash were to be used to reference the full piece as understood by the Filecoin on-chain consensus mechanism, this would involve the special case where a padding of zero is used.

Specification

A CIDv1 requires a:

  • codec
  • multihash

fr32-sha2-256-trunc254-padded-binary-tree multihash

The core component introduce in this specification is a new multihash type fr32-sha2-256-trunc254-padded-binary-tree multihash.

The multihash code for this type is 0x1011 as identified in the multicodec code table.

The digest for the multihash is a variable number of bytes.

It can be roughly described as uvarint padding | uint8 height | 32 byte root data where | means concatenation.

  • The first bytes are a uvarint of the number of bytes needed to pad the underlying data such that after FR32 padding it will be a full binary tree
    • If the data is < 127 bytes then it is padded to 127 bytes. For example, if the data is of size 12 the padding will be 127-12 = 115
    • For example, if the data is of size 254 (i.e. 2^i * 127/128, where i is a positive integer >= 7) then this will be 0
    • For example, if the data is of size 256 then the padding is 508-256=252
    • Note: because the unsigned-varint spec currently has a maximum representable size of 2^63-1 (and 9 bytes to represent the varint) this puts a cap on the maximum size of data representable by this multihash as well
    • Note: because this is padding data it must be less than the size of the underlying data (with the exception of data less than 127 bytes which is always padded up to 127 bytes)
  • The next byte defines the height of the tree
    • For example if the first byte is 0 the tree is a single 32-byte node

    • Similarly, if the first byte is 30 then the tree is 30 levels deep which, since the leaves must be 32 bytes, represents a piece of size 32*2^30 bytes = 32GiB.

  • The last 32 bytes are the value at the root of the binary tree

Note: This structure is such that the last 33 bytes of the multihash are uint 8 height | 32 byte root data which is the data that the proofs underlying Filecoin consensus can attest to (i.e. the proofs don't know how much of the padding is padding vs zeros that are part of the user data).

Raw codec

There is currently no specific semantics associated with what data must go into a Piece. Therefore, we use the Raw codec.

Design Rationale

The overall design rationale was guided by the desire to:

  1. have a CID (Content IDentifier) that contains enough information to work with a Piece
  2. be an easy conversion with major uses of Piece CIDs today (which tend to move along with sizes)
  3. be simple to understand and implement
  4. have a Piece CID that reasonably fits the IPLD Data Model

The first criteria is met by embedding size information in the CID (via the multihash digest). The second is met by being able to convert between v1 and v2 Piece CIDs. The third is met by the spec implementation being tiny. The fourth is met by using the codec to note that the bytes in the Piece are of an unknown type, but can be verified with the given multihash.

Backwards Compatibility

This FRC describes an alternative CID format for referencing Pieces. However, this is not a FIP for changing how CommPs are stored or used on chain, but rather a more user-friendly way of referencing Pieces which can be used in various non-chain interactions (e.g. piece-based retrieval, smart contract interactions, etc.).

A tuple of (v1 Piece CID, Piece size) can be converted into a valid v2 Piece CID and similarly a v2 Piece CID can be converted into the tuple of (v1 Piece CID, Piece size).

Test Cases

Take data of size 127*4 bytes where the first 127 bytes are 0, the next 127 are 1, the next 127 are 2, and the last 127 are 3. The v1 piece CID of this data would be baga6ea4seaqes3nobte6ezpp4wqan2age2s5yxcatzotcvobhgcmv5wi2xh5mbi. The multihash-based piece CID would be bafkzcibcaaces3nobte6ezpp4wqan2age2s5yxcatzotcvobhgcmv5wi2xh5mbi. With a base16 multibase this would be f01559120220004496dae0cc9e265efe5a006e80626a5dc5c409e5d3155c13984caf6c8d5cfd605 or equivalently ((multibase = f) | (CIDv1 prefix = 0x01) | (Raw codec = 0x55) | (fr32-sha2-256-trunc254-padded-binary-tree multihash encoded varint = 0x9120) | (length of digest = 0x22) | (amount of data padding = 0x00) | (tree height = 0x04) | (underlying hash digest = 496...605)) Take payload of size 0 bytes. There MUST be 127 bytes of padding. The v2 piece CID MUST be bafkzcibcp4bdomn3tgwgrh3g532zopskstnbrd2n3sxfqbze7rxt7vqn7veigmy.

Take payload of 127 bytes where all bytes are 0. There MUST be 0 bytes of padding. The hight MUST be 2. The v2 piece CID MUST be bafkzcibcaabdomn3tgwgrh3g532zopskstnbrd2n3sxfqbze7rxt7vqn7veigmy.

Take payload of 128 bytes where all bytes are 0. There MUST be 126 bytes of padding. The height MUST be 3. The v2 piece CID MUST be bafkzcibcpybwiktap34inmaex4wbs6cghlq5i2j2yd2bb2zndn5ep7ralzphkdy Given the piece CID v1 of the empty 32 GiB piece (i.e. 32 * 2^30 * 127/128 bytes of zeros)baga6ea4seaqao7s73y24kcutaosvacpdjgfe5pw76ooefnyqw4ynr3d2y6x2mpq the corresponding mutlihash piece CID would be bafkzcibcaapao7s73y24kcutaosvacpdjgfe5pw76ooefnyqw4ynr3d2y6x2mpq

Given the piece CID v1 of the empty 64 GiB piece (i.e. 64 * 2^30 * 127/128 bytes of zeros)baga6ea4seaqomqafu276g53zko4k23xzh4h4uecjwicbmvhsuqi7o4bhthhm4aq the corresponding piece mutlihash piece CID would be bafkzcibcaap6mqafu276g53zko4k23xzh4h4uecjwicbmvhsuqi7o4bhthhm4aq

Take data of size 127*8 bytes where the first where the first 127 bytes are 0, the next 127 are 1, the next 127 are 2, the next 127 are 3 and the remaining 127*4 bytes are 0. There is no padding so the v1 piece CID would be baga6ea4seaqn42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa and the v2 would be bafkzcibcauan42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa.

Take data of size 128*4 bytes where the first where the first 127 bytes are 0, the next 127 are 1, the next 127 are 2, the next 127 are 3 and the remaining 4 bytes are 0. There is 504 bytes of padding needed and so the v1 piece CID is baga6ea4seaqn42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa (notice it's the same as above) and the v2 is bafkzcibd7abqlxticxolgseegik2stpfgkkuwyf6kufex3doorkvmzpjuxwe4dz4.

Take the data above and append one more zero (i.e. 127 0s, 127 1s, 127 2s, 127 3s, 5 0s). There are 503 bytes of padding needed and so the v1 piece CID is baga6ea4seaqn42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa (notice it's the same as above) and the v2 is bafkzcibd64bqlxticxolgseegik2stpfgkkuwyf6kufex3doorkvmzpjuxwe4dz4.

Security Considerations

Does not impact core Filecoin security.

Incentive Considerations

Does not impact current incentive systems.

Product Considerations

Using Piece Multihash CIDs as defined in this FRC would be used when working with Piece CIDs including in the Piece Retrieval FRC.

This also enables the use of IPFS-based tooling for moving around Pieces, with the same guarantees around content-addressing, verifiability, etc. In this way the new Piece multihash operates in effectively the same way as Blake3 (another merkle-tree based hash) does.

Implementation

Copyright

Copyright and related rights waived via CC0