-
Notifications
You must be signed in to change notification settings - Fork 5.8k
BIP93: Generalize codex32 format for any hrp and fix typos #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Clarify codex32 format for different hrp values, specify master seed encoding standard, add new test vectors and enhance readability.
bip-0093.mediawiki
Outdated
| errors. The human-readable part is processed by first | ||
| feeding the higher bits of each character's US-ASCII value into the | ||
| checksum calculation followed by a zero and then the lower bits of each<ref>'''Why are the high bits of the human-readable part processed first?''' | ||
| This results in the actually checksummed data being ''[high hrp] 0 [low hrp] [data]''. This means that under the assumption that errors to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lengths limitations of the codex32 strings are working under the assumption that the HRP is not subject to error correction. We more or less cannot do that anyways as all sorts of various bech32 formats have appeared all with different checksums and characteristics. In order to run the checksum algorithm you have to know the prefix first in order to know which checksum algorithm to try.
This isn't really a problem in practice since there are only a small finite number of prefixes, and from context only a few are going to be applicable anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied over from BIP-00173. Delete it?
Bech32 attempts to decode two checksums, a universal bech32 decoder could try decoding the string with the bech32, bech32m and codex32 checksums to discover the format.
Unless covering the HRP exceeds the max length at HD=9, 2 subsitutions in the HRP will always be detected by every format.
If HRP is swapped between formats the chances of false verification is:
-
1 in 2^65 for a "codex32 checksum" validating when the encoding was Bech32/Bech32m
-
~1 in 2^30 for "Bech32 checksum" validating when the encoding was Codex32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This text was certainly the design goal of BIP-173, but we are not using their checksum, and we haven't realized this part of their design in codex32 in part because our 13 character checksum unfortunately works only on relatively short strings.
Instead we process this HRP in this way because that is what BIP-173 does, and we still want the HRP to change the residue to catch random errors, so me might as well do it in the standard way.
Unless covering the HRP exceeds the max length at HD=9, 2 subsitutions in the HRP will always be detected by every format.
The problem is that our particular 13 character checksum's max length for its error detection and correction properties is limited to 93 bech32 characters. That's why our payload is limited to 74 characters add in 13 character checksum and 6 characters for the header and we get 93 bech32 characters, with nothing left over to detect or correct errors in the HRP. Yes, in cases where the payload is 72 characters or less, our error correction / detection properties extend to the low 5 bits of the ascii characters of a 2 character prefix, but that doesn't apply to 73 or 74 character payloads.
I don't know if we really want to get into these subtleties. I'm not even sure correcting and detecting errors in the HRP is useful to begin. If you are a hardware wallet expecting a master seed and someone gives you a "cl" codex32 string, you don't need a fancy error correction algorithm to detect the "cl" prefix is wrong; if it is a expecting a master seed then the "cl" prefix must be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will delete the footmark and say:
"The human-readable part is processed as per BIP-0173."
I updated the rationale accordingly:
At this length, the human-readable part is not covered by the checksum. This is acceptable because the checksum scheme itself requires you to know that a valid human-readable part is being used in the first place. If the prefix is damaged and a user is guessing that the data might be using this scheme, then the user can enter the available data explicitly using the suspected prefix.
I'm not even sure correcting and detecting errors in the HRP is useful to begin
wallets.md import guidance prefills the prefix in applications expecting only one to prevent mistakes.
A future application for extended keys (Is long codex32 HD=9 for 74 bytes?) has a situation where decoders need to accept both "xprv" and "xpub" HRPs in the same descriptor. So here it is absolutely useful to detect and correct errors in the HRP.
However we should not go into details about that until such an application needing to disambiguate different HRP actually exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does HD=9 mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hamming distance 9, able to detect 8 substitution errors. (and correct 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the 15 character checksum has HD=9 up to a maximum length 1023, supports data up to 1008 = (1023 - 15) characters and a payload of 1002 (=1008 - 6 header characters) bech32 characters. If you want to cover the HRP you lose payload capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plenty of room to long codex32 encode extended keys with covered human-readable parts "xpub", "xprv", "tprv" and "tpub" someday. They're horrible to handle and type in base58. Is this something we should think about in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the numbers here and we must further restrict HRP lengths to 72 or 74 (same maximum as the payload). At len(hrp) = 74 two header characters two header characters aren't checked but what does a threshold or share index even mean for a 0 byte payload?
I also explicitly stated the maximum length for a codex32 string is 96. Although perhaps at length = 96 the HRP we assume "ms" while the max length for all other prefixes is 94?
bip-0093.mediawiki
Outdated
| As with bech32 strings, a codex32 string MUST be entirely uppercase or entirely lowercase. | ||
| For presentation, lowercase is usually preferable, but uppercase SHOULD be used for handwritten codex32 strings. | ||
| If a codex32 string is encoded in a QR code, it SHOULD use the uppercase form, as this is encoded more compactly. | ||
| The lowercase form is used when determining a character's value for checksum purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense. The lowercase form and uppercase form of Bech32 characters have the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for HRP which needs to be lower cased during decoding or bech32_hrp_expand(hrp) would return a different result.
This line is repeated from the test vectors, why explain the rules about case in the vectors instead of up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should reword this to make it more clear that the relevance is for the HRP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When constructing or verifying a checksum, the human-readable part MUST be interpreted in lowercase, as specified in BIP-0173."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might say "MUST be converted to lowercase" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to imply mutating the string when verifying a checksum.
Something BIP-93 omitted was that encoders should always emit lower-case strings. Did we relax that requirement?
Currently I have the sentence as:
"Encoders MUST emit lowercase; decoders MUST reject mixed-case and MUST lowercase the human-readable part during checksum verification."
And I am adding a section with a codex32_encode and codex32_decode definitions as I think it's easier to see these rules in code than english.
Uppercase/lowercase
The lowercase form is used when determining a character's value for checksum purposes.
Encoders MUST always output an all lowercase Bech32 string. If an uppercase version of the encoding result is desired, (e.g.- for presentation purposes, or QR code use), then an uppercasing procedure can be performed external to the encoding process.
Decoders MUST NOT accept strings where some characters are uppercase and some are lowercase (such strings are referred to as mixed case strings).
For presentation, lowercase is usually preferable, but inside QR codes uppercase SHOULD be used, as those permit the use of alphanumeric mode, which is 45% more compact than the normal byte mode.
bip-0093.mediawiki
Outdated
|
|
||
| * Secret share with index <code>S</code>: <code>MS100C8VSM32ZXFGUHPCHTLUPZRY9X8GF2TVDW0S3JN54KHCE6MUA7LQPZYGSFJD6AN074RXVCEMLH8WU3TK925ACDEFGHJKLMNPQRSTUVWXY06FHPV80UNDVARHRAK</code> | ||
| * Master secret (hex): <code>dc5423251cb87175ff8110c8531d0952d8d73e1194e95b5f19d6f9df7c01111104c9baecdfea8cccc677fb9ddc8aec5553b86e528bcadfdcc201c17c638c47e9</code> | ||
| unchecksummed string (bech32): <code>MS10C8VSM32ZXFGUHPCHTLUPZRY9X8GF2TVDW0S3JN54KHCE6MUA7LQPZYGSFJD6AN074RXVCEMLH8WU3TK925ACDEFGHJKLMNPQRSTUVWXY06F</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be included to remove this uncheckedsummed string. I'm really nervous displaying strings without a checksum anywhere. They are very problematic.
If you insist on going into this much detail in this test vector I'd say use the following bullets
- Master seed (hex):
- master node xprv
- Payload
- HRP
- Identifier
- Checksum
- Secret seed
That's the order I'd use, but maybe some other permutations are also good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about since the text said:
This example shows generating a new 512-bit master seed using "random" codex32 characters and appending a checksum.
human-readable part: MS
k value: 0
identifier: 0C8V
share index: S
payload: M32ZXFGUHPCHTLUPZRY9X8GF2TVDW0S3JN54KHCE6MUA7LQPZYGSFJD6AN074RXVCEMLH8WU3TK925ACDEFGHJKLMNPQRSTUVWXY06F
- checksum:
HPV80UNDVARHRAK - secret seed:
MS100C8VSM32ZXFGUHPCHTLUPZRY9X8GF2TVDW0S3JN54KHCE6MUA7LQPZYGSFJD6AN074RXVCEMLH8WU3TK925ACDEFGHJKLMNPQRSTUVWXY06FHPV80UNDVARHRAK - Master seed (hex):
dc5423251cb87175ff8110c8531d0952d8d73e1194e95b5f19d6f9df7c01111104c9baecdfea8cccc677fb9ddc8aec5553b86e528bcadfdcc201c17c638c47e9 - master node xprv:
xprv9s21ZrQH143K4UYT4rP3TZVKKbmRVmfRqTx9mG2xCy2JYipZbkLV8rwvBXsUbEv9KQiUD7oED1Wyi9evZzUn2rqK9skRgPkNaAzyw3YrpJN
No information is displayed we did not already in Vector 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you rather just change V5 text to match master's vectors?
From:
This example shows generating a new 512-bit master seed using "random" codex32 characters and appending a checksum.
To:
This example shows the long codex32 format, when used without splitting the secret into any shares.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhat prefer the current text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opps, I accidentally committed a fix that changed the text.
I'll revert it to the original text and change the vector data to my earlier comment in my next commit.
human-readable part: MS
k value: 0
identifier: 0C8V
share index: S
This info should be in sentence format not as a list because it gets double spaced in mediawiki and is hard to read.
Other vectors use data given in sentence format or on the main line and calculated data in bullets.
… case in checksum
| * A human-readable part, which is intended to convey the type of data, or anything else that is relevant to the reader. This part MUST contain 1 to 83 US-ASCII characters, with each character having a value in the range [33-126]. HRP validity may be further restricted by specific applications. | ||
| * A separator, which is always "1". In case "1" is allowed inside the human-readable part, the last one in the string is the separator<ref>'''Why include a separator in codex32 strings?''' That way the human-readable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be this wordy or should we say:
- A human-readable part, as specified in BIP-0173, which is intended to convey the type of data, or anything else that is relevant to the reader.
- A separator, as specified in BIP-0173, which is always "1".
| * All strings have the same human-readable part, the same threshold value ''k'', the same identifier, and the same length. | ||
| * All of the share index values are distinct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the periods I inadvertently added to these bullets?
| # Choose a threshold value ''k'' between 2 and 9, inclusive | ||
| # Choose a 4 bech32 character identifier | ||
| #* We do not define how to choose the identifier, beyond noting that it SHOULD be distinct for every master seed the user may need to disambiguate. | ||
| #* We do not define how to choose the identifier, beyond noting that it SHOULD be distinct for every secret the user may need to disambiguate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say "secret or set of shares" because this is the reshare case you mentioned that SHOULD have a unique identifier?
Here we make it sound like it's OK to reuse an identifier if the secret is the same which is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, set of shares.
| # Set the share index to <code>s</code> | ||
| # Set the payload to a bech32 encoding of the master seed, padded with arbitrary bits | ||
| # Generating a valid checksum in accordance with the Checksum section | ||
| # Set the payload to a bech32 encoding of the secret, padded with arbitrary bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to "secret data" or "secret bytes" to specify its encoding bytes or leave it? There may be some confusion between "secret" meaning the string with share index "s" and the decoded payload bytes of that [codex32] secret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"secret data" sounds good.
Summary of Changes:
Describe codex32 format for arbitrary human-readable parts not just "ms", specify master seed encoding standard, add new test vectors and enhance readability. This makes the document more like BIP-0173: proposing an encoding "codex32", then defining a standard for something using it.
See discussion on #2023 (comment).
Spec:
hrpTest Vectors:
ms32_verify_checksumfunction