diff --git a/content/developers/developer-patterns/massif-blob-offset-tables/index.md b/content/developers/developer-patterns/massif-blob-offset-tables/index.md index 6220bf1d9..420965ae8 100644 --- a/content/developers/developer-patterns/massif-blob-offset-tables/index.md +++ b/content/developers/developer-patterns/massif-blob-offset-tables/index.md @@ -11,7 +11,10 @@ toc: true --- -This page provides lookup tables for navigating the dynamic, but computable, offsets into the Merkle log binary format. The algorithms to reproduce this are relatively simple, we provide open source implementations, but in many contexts it is simpler to use these pre-calculations. These tables can be made for any log configuration at any time, in part or in whole, without access to any specific log. +This page provides lookup tables for navigating the dynamic, but computable, offsets into the Merkle log binary format. +The algorithms to reproduce this are relatively simple. +We provide open-source implementations, but in many contexts, it is simpler to use these pre-calculations. +These tables can be made for any log configuration at any time, in part or in whole, without access to any specific log. This is a fast review of the log format. We explain this in more detail at [Navigating the Merkle Log](/developers/developer-patterns/navigating-merklelogs) @@ -54,7 +57,8 @@ Using the `veracity` tool with the following command line we can reproduce our c -In the following table *Stack Start* and *mmr Start* are byte offsets from the start of the file. The leaf values are indices into the trie fields (not considered further in this page) and the node values are indices into the array of 32 byte nodes starting at *mmr Start* +In the following table *Stack Start* and *mmr Start* are byte offsets from the start of the file. +The leaf values are indices into the trie fields (not considered further in this page) and the node values are indices into the array of 32-byte nodes starting at *mmr Start* | Massif | Stack Start| mmr Start | First leaf | Last Leaf | First Node | Last Node | Peak Stack | | -------| ---------- | --------- | ---------- | ---------- | ----------- | --------- | --------- | @@ -138,8 +142,8 @@ Stack Start needs details from [Navigating the Merkle Log](/developers/developer ## The algorithms backing the table generation -In combination with the format inforation at [Navigating the Merkle Log](/developers/developer-patterns/navigating-merklelogs) -the pre-computed tables above can be generated using these examples. We provide open source, go-lang based, tooling to do this. +In combination with the format information at [Navigating the Merkle Log](/developers/developer-patterns/navigating-merklelogs) the pre-computed tables above can be generated using these examples. +DataTrails provides open source, go-lang based, tooling at [URL] (_[__]__ ) {{< tabs name="convert idtimestamp" >}} {{< tab name="Leaf Count and Massif Index" >}} diff --git a/content/developers/developer-patterns/navigating-merklelogs/index.md b/content/developers/developer-patterns/navigating-merklelogs/index.md index b90fad302..25a57c9df 100644 --- a/content/developers/developer-patterns/navigating-merklelogs/index.md +++ b/content/developers/developer-patterns/navigating-merklelogs/index.md @@ -13,11 +13,13 @@ - /docs/beyond-the-basics/navigating-merklelogs/ --- -Data Trails publishes the data necessary for verifying your events immediately to publicly readable and highly available commodity cloud storage. We call this verifiable data your *log* or *transparency log*. +DataTrails publishes the data necessary for verifying events immediately to publicly readable and highly available commodity cloud storage. +We call this verifiable data your *log* or *transparency log*. -Once verifiable data is written to this log we never change it. The log only grows, it never shrinks and data in it never moves. +Once verifiable data is written to this log we never change it. +The log only grows, it never shrinks and data in it never moves. -We provide extensive open source tooling to work with this format in an off line setting. +[DataTrails provides extensive open-source tooling]() to work with this format in an offline setting. To take advantage of this you will need: @@ -36,24 +38,34 @@ If you already know the basics, and want a straight forward way to deal with the | massif 0 | | massif 1 | | massif n +----------------+ +----------------+ .. +-----------+ -What is a massif ? In this context it means a [group of mountains that form a large mass](https://www.oxfordlearnersdictionaries.com/definition/american_english/massif). This term is due to the name of the verifiable data structure we use for the log: An MMR or "Merkle Mountain Range" [^1]. +What is a massif? +In this context, it means a [group of mountains that form a large mass](https://www.oxfordlearnersdictionaries.com/definition/american_english/massif). +This term is due to the name of the verifiable data structure used for the log: An MMR or "Merkle Mountain Range" [^1]. -[^1]: Merkle Mountain Ranges have seen extensive use in systems that need long term tamper evident storage. Notably [zcash](https://zips.z.cash/zip-0221), [mimblewimble](), and [many others](https://zips.z.cash/zip-0221#additional-reading). The name is - due originally to [Peter Todd](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html), though much parallel invention has occurred. They have been independently analysed in the context of [cryptographic asynchronous accumulators](https://eprint.iacr.org/2015/718.pdf), Generalised multi proofs for [Binary Numeral Trees](https://eprint.iacr.org/2021/038.pdf). And also by the [ethereum research community](https://ethresear.ch/t/batching-and-cyclic-partitioning-of-logs/536). +[^1]: Merkle Mountain Ranges have seen extensive use in systems that need long term tamper evident storage, notably [zcash](https://zips.z.cash/zip-0221), [mimblewimble](), and [many others](https://zips.z.cash/zip-0221#additional-reading). +Merkle Mountain Ranges are attributed to [Peter Todd](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html), though much parallel invention has occurred. +They have been independently analyzed in the context of [cryptographic asynchronous accumulators](https://eprint.iacr.org/2015/718.pdf), Generalised multi-proofs for [Binary Numeral Trees](https://eprint.iacr.org/2021/038.pdf). +And also by the [ethereum research community](https://ethresear.ch/t/batching-and-cyclic-partitioning-of-logs/536). -Each massif contains the verifiable data for a fixed number, and sequential range, of your events. The number of events is determined by your log configuration parameter `massif height`. Currently all logs have a massif height of `14` And the number of event *leaf* log entries in each massif is 2height-1, which is 214-1 leaves, which is `8192` leaves [^2]. +Each massif contains the verifiable data for a fixed number, and sequential range, of your events. +The number of events is determined by your log configuration parameter `massif height`. +Currently, all logs have a massif height of `14` And the number of event *leaf* log entries in each massif is 2height-1, which is 214-1 leaves, which is `8192` leaves [^2]. -[^2]: From time to time, we may re-size your massifs. We are able to do this without impacting the verifiability of the contained data and without invalidating your previously cached copies taken from the earlier massif size configuration. Simple binary file compare operations can show that the verifiable data for the new configuration is identical to that in the original should you wish to assure your self of this fact. +[^2]: Sometimes, DataTrails may re-size your massifs. +We can do this without impacting the verifiability of the contained data and without invalidating your previously cached copies taken from the earlier massif size configuration. +Simple binary file compare operations can show that the verifiable data for the new configuration is the same as in the original should you wish to assure yourself of this fact. Here, we have drawn `massif n` as open ended to illustrate that the last massif is always in the process of being *appended* to. -[Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) gives you a shortcut for picking the right massif. It can also be fairly easily computed from only the `merklelog_entry.commit.index` *mmrIndex* on your event using the example javascript on that page. +[Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) gives you a shortcut for picking the right massif. +It can also be fairly easily computed from only the `merklelog_entry.commit.index` *mmrIndex* on your event using the example javascript on that page. Here we deal with the format of a single massif. ## Every massif blob is a series of 32 byte aligned fields -Every massif in a log is structured as a series of `32` byte fields. All individual entries in the log are either exactly 32 bytes or a small multiple of `32` +Every massif in a log is structured as a series of `32` byte fields. +All individual entries in the log are either exactly 32 bytes or a small multiple of `32`. ``` 0 32 @@ -77,7 +89,8 @@ This is a simple reverse proxy to the native azure blob store where your logs ar **https://jitavidfd1103b1099ab3aa.blob.core.windows.net/merklelogs/v1/mmrs/tenant/** -Each massif is stored in a numbered file. The filename is the 16 character, zero padded, massif index. +Each massif is stored in a numbered file. +The filename is the 16-character, zero-padded, massif index. ## When re-creating inclusion proofs, you are guaranteed to only need a single massif @@ -102,7 +115,6 @@ We provide convenience look up tables for these [Massif Blob Pre-Calculated Offs As mentioned above, we provide implementations of the algorithms needed to produce those tables in many languages under an MIT license. - ## The first 32 byte field in every massif is the sequencing header Using the following curl command, you can read the version and format information from the header field 0 @@ -135,7 +147,8 @@ You can see from the hex data above, that the idtimestamp of the last entry in t ### Decoding an idtimestamp -The idtimestamp is 40 bits of time at millisecond precision. The idtimestamp in the header field is always set to the idtimestamp of the most recently added leaf. +The idtimestamp is 40 bits of time at millisecond precision. +The idtimestamp in the header field is always set to the idtimestamp of the most recently added leaf. {{< tabs name="convert idtimestamp" >}} {{< tab name="Python" >}} @@ -158,8 +171,10 @@ In this example, the last entry in the log (at that time) was 2024/03/28, a litt ## The trieData entries are 512 bytes each and are formed from two fields -The trieData section is 2 * 32 * 2height bytes long. (Which is actually exactly double what we need). For the -standard massif height of 14, it has 8192 entries in the first 524288 bytes. The subsequent 524288 which will always be zero. The format of each entry is then, for a massif height of 14: +The trieData section is 2 * 32 * 2height bytes long. (Which is exactly double what we need). +For the standard massif height of 14, it has 8192 entries in the first 524288 bytes. +The subsequent 524288 will always be zero. +The format of each entry is then, for a massif height of 14: ``` +----------------+ @@ -191,9 +206,11 @@ SHA256(BYTE(0x00) || BYTES(idTimestamp) || event.identity) Note that the idtimestamp is unique to your tenant and the wider system, so even when sharing events with other tenants, this will not correlate directly with activity in their logs. -If you have the event record from our Events API, the idtimestamp is found at `merklelog_entry.commit.idtimestamp`. It is a hex string and prefixed with `01` which is the epoch from the header. +If you have the event record from the Events API, the idtimestamp is found at `merklelog_entry.commit.idtimestamp`. +It is a hex string and prefixed with `01` which is the epoch from the header. -To condition the string value, strip the leading `01` and convert the remaining hex to binary. Then substitute those bytes, in presentation order, for idTimestamp above. +To condition the string value, strip the leading `01` and convert the remaining hex to binary. +Then substitute those bytes, in presentation order, for idTimestamp above. Reworking our python example above to deal with the epoch prefix would look like this: @@ -223,25 +240,33 @@ h.update(bytes.fromhex("018e84dbbb6513a6"[2:])) h.update("assets/31de2eb6-de4f-4e5a-9635-38f7cd5a0fc8/events/21d55b73-b4bc-4098-baf7-336ddee4f2f2".encode()) h.hexdigest() ``` -The variable portion *for the first massif* contains exactly *16383* MMR *nodes*. Of those nodes, *8192* are the leaf entries in the Merkle tree corresponding to your events. +The variable portion *for the first massif* contains exactly *16383* MMR *nodes*. +Of those nodes, *8192* are the leaf entries in the Merkle tree corresponding to your events. -When a massif is initialized the trieData is pre populated for all leaves and set to all zero bytes. As events are recorded in the log, the zero padded index is filled in. A sub range of field 0 will change when we save the last idtimestamp in it. The mmr node values are strictly only ever appended to the blob. Once appended they will never change and they will never move. +When a massif is initialized the trieData is pre-populated for all leaves and set to all zero bytes. +As events are recorded in the log, the zero-padded index is filled in. +A sub-range of field 0 will change when saving the last idtimestamp in it. +The mmr node values are strictly only ever appended to the blob. +Once appended they will never change and they will never move. If you know the byte offset in the blob for the start of the mmr data then you can check the number of mmr nodes currently in it by doing `(blobSize - mmrDataStart)/32`. ## The peak stack and mmr data sizes are computable. -Please see [Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) if you want to avoid needing to calculate these. Implementations of the O(log base 2 n) algorithms are provided in various languages. They all have very hardware sympathetic implementations. +See [Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) to avoid needing to calculate these. +Implementations of the O(log base 2 n) algorithms are provided in various languages. +They all have very hardware-sympathetic implementations. ## The massif height is constant for all blobs in a log *configuration* For massif height 14, the fixed size portion is `1048864` bytes. -All massifs in a log are guaranteed to be the same *height*. If your log is re-configured having first been available at +All massifs in a log are guaranteed to be the same *height*. +If your log is re-configured having first been available at `https://app.datatrails.ai/verifiabledata/merklelogs/v1/mmrs/tenant/72dc8b10-dde5-43fe-99bc-16a37fd98c6a/0/` -Then on re-configuration it will become available (without downtime) at +Then on re-configuration it will become available (without downtime) at: `https://app.datatrails.ai/verifiabledata/merklelogs/v1/mmrs/tenant/72dc8b10-dde5-43fe-99bc-16a37fd98c6a/1/` @@ -251,10 +276,8 @@ And the previous path will no longer receive any additions. For the forseeable future (at least months) we don't anticipate needing to do this. {{< /note >}} - ## How to read a specific mmr node by its *mmrIndex* - Find the smallest "Last Node" in [Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) that is greater than your *mmrIndex* and use that row as your massif index Then taking massif index 0 (row 0) for example, and using the first mmrIndex for ease of example @@ -282,11 +305,13 @@ go run veracity/cmd/veracity/main.go -s jitavidfd1103b1099ab3aa \ a45e21c14ee5a0d12d4544524582b5feb074650e6bb2b31ed9a3aeffe4883099 ``` -The example javascript routines bellow the [Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) can be used if you want to accomplish this computationally +The example javascript routines below the [Massif Blob Pre-Calculated Offsets](/developers/developer-patterns/massif-blob-offset-tables) can be used to accomplish this computationally. ## But which nodes would I want ? -Typically, you would be verifying the inclusion of an event in the log. This inclusion is verified by selecting the sibling path needed to recreate the root hash starting from your leaf hash. You create your leaf hash using the original pre-image data of your event and the *commit* values assigned to it when it was included in the log. +Typically, you would be verifying the inclusion of an event in the log. +This inclusion is verified by selecting the sibling path needed to recreate the root hash starting from your leaf hash. +You create your leaf hash using the original pre-image data of your event and the *commit* values assigned to it when it was included in the log. You would have: @@ -296,15 +321,19 @@ You would have: We are going to give the subject of determining the sibling path its own article. Here we are going to set the scene by covering how our logical tree nodes map to storage. -So what is a sibling path ? To understand this we need to dig into how we organize the nodes in your Merkle log in storage and memory. +So what is a sibling path? +To understand this we need to dig into how we organize the nodes in your Merkle log in storage and memory. ## The tree maps to storage like this Merkle trees, at there heart, *prove* things by providing paths of hashes that lead to a single *common root* for all nodes in the tree. -All entries in a Merkle log each have a unique and *short* path of hashes, which when hashed together according to the data structures rules, will re-create the same root. If such a path does not exist, then by definition the leaf is not included - it is not in the log. +All entries in a Merkle log each have a unique and *short* path of hashes, which when hashed together according to the data structures rules, will re-create the same root. +If such a path does not exist, then by definition the leaf is not included - it is not in the log. -Where do those paths come from ? They come from adjacent and ancestor nodes in the hierarchical tree. And this means that when producing the path we need to access nodes throughout the tree to produce the proof. +Where do those paths come from? +They come from adjacent and ancestor nodes in the hierarchical tree. +And this means that when producing the path we need to access nodes throughout the tree to produce the proof. Using our "canonical" mmr log for illustration, we get this diagram @@ -329,14 +358,19 @@ A very nice visualization of how the peaks combine is available in this paper on A specific challenge for log implementations with very large data sets is answering "how far back" or "how far forward" may I need to look? -MMRs differ from classic binary Merkle trees in how the incomplete sub trees are combined into a common root. For an MMR, the common root is defined by an algorithm for combining the, adjacent, incomplete sub trees. Rather than by the more traditional, temporary, assignment of un-balanced siblings. Such un-balanced siblings would later have to be re-assigned (balanced) when there -were sufficient leaves to merge the sub trees. +MMRs differ from classic binary Merkle trees in how the incomplete sub-trees are combined into a common root. +For an MMR, the common root is defined by an algorithm for combining the adjacent, incomplete sub-trees. +Rather than by the more traditional, temporary, assignment of un-balanced siblings. +Such un-balanced siblings would later have to be re-assigned (balanced) when there were sufficient leaves to merge the sub-trees. This detail is what permits us to publish the log data immediately that your events are added. -So the specific properties of Merkle Mountain Ranges lead to an efficiently computable and stable answer to the questions of "which other nodes do I need". Such that we *know* categorically we do not need to look *forward* of the current massif and further we know precisely which nodes we need from the previous massifs. +So the specific properties of Merkle Mountain Ranges lead to an efficiently computable and stable answer to the question of "which other nodes do I need". +Such that we *know* categorically that we do not need to look *forward* of the current massif and further we know precisely which nodes we need from the previous massifs. -The "free nodes" in the alpine zone always require "ancestors" from previous nodes when producing inclusion proofs that pass through them, and when adding new nodes to the end of the log. Here we can see they are very predictable (can be calculated without reference to the tree data). We accumulate these peaks in a stack because the pop order is the order we need them when adding leaves at the end of the log. +The "free nodes" in the alpine zone always require "ancestors" from previous nodes when producing inclusion proofs that pass through them, and when adding new nodes to the end of the log. +Here we can see they are very predictable (can be calculated without reference to the tree data). +We accumulate these peaks in a stack because the pop order is the order we need them when adding leaves at the end of the log. The result can be visualized like this @@ -350,7 +384,8 @@ h=2 1 | 2 | 5 | 9 | 12 | 17 | 21 | <-- massif 'tree line' 0 |0 1 3 4| 7 8|10 11|15 16|18 19| ``` -We call the look back nodes the *peak stack*, because it always corresponds to the peaks of the earlier sub trees. We don't actually pop things off it ever, we just happen to reference it in reverse order of addition when adding new leaves. +We call the look-back nodes the *peak stack*, because it always corresponds to the peaks of the earlier sub-trees. +We don't pop things off it ever, we just happen to reference it in reverse order of addition when adding new leaves. The stability of the MMR data comes from the fact that the sub trees are not merged until a right sibling tree of equal height has been produced. @@ -442,7 +477,8 @@ The peak stack is [6, 9] ### massif 4 -Note that this case is particularly interesting because it completes a full cycle from one perfect power sized tree to the next. It is a fact of the MMR construction that the look back is never further than the most recent 'perfect' tree completing massif. +Note that this case is particularly interesting because it completes a full cycle from one perfect power-sized tree to the next. +A fact of the MMR construction is the look back is never further than the most recent 'perfect' tree completing massif. The peak stack is [14] @@ -468,4 +504,4 @@ The peak stack is [14] * The "look back" nodes needed to make each massif self contained are deterministic and are filled in when a new massif is started. * The dynamically sized portions of the format are all computable, but we offer pre-calculated tables for convenience. * Opensource tooling exists in multiple languages for navigating the format. -* Once you have a signed "root", all entries in any copies of your log are irrefutably attested by Data Trails +* Once you have a signed "root", all entries in any copies of your log are irrefutably attested by DataTrails