bip-0173: test vectors, HRP, and casing requirements updates #587

mruddy · 2017-09-20T01:20:44Z

a12uel5l: vector added to demonstrate that it and its uppercase version, A12UEL5L are both valid and treated as equal. segwit_addr.bech32_encode('a', [])
10a06t8: vector added to demonstrate that a combined empty HRP and empty data encode to a valid Bech32 string. segwit_addr.bech32_encode('', [])
1qzzfhee: vector added to demonstrate that an empty HRP encodes to a valid Bech32 string. segwit_addr.bech32_encode('', [0])
pzry9x0s0muk: vector moved to a new section as it is misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

>>> segwit_addr.bech32_create_checksum('', [1,2,3,4,5,6])
[15, 16, 15, 27, 28, 22]
>>> segwit_addr.bech32_verify_checksum('', [1,2,3,4,5,6]+[15, 16, 15, 27, 28, 22])
True

1pzry9x0s0muk: vector removed as incorrect because "Empty HRP" is Bech32 valid -- HRP validity is application specific. segwit_addr.bech32_encode('', [1,2,3,4,5,6])
A1g7sgd8 and A1G7SGD8: vectors added to verify that implementations adhere to BIP requirement: "The lowercase form is used when determining a character's value for checksum purposes.". Note:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'  # BUG: violates "The lowercase form is used when determining a character's value for checksum purposes."

10a06t8: test vector added to demonstrate that while a valid Bech32 string, it is segwit application specific invalid.

luke-jr · 2017-09-20T03:27:40Z

@sipa @gmaxwell

mruddy · 2017-09-22T14:46:26Z

In addition to the test vector updates, I made some updates, in a second commit, regarding casing requirements.
Here's my rationale for these additional changes:

Encoding:
HRP should always be lowercased. This is because:

From the existing BIP: "The lowercase form is used when determining a character's value for checksum purposes."
Encoders should always output a non-mixed case Bech32 string. Lowercase is canonical due the previous item, and the fact that the encoding charset is all lowercase. Otherwise, the HRP would need to be scanned to make sure it is consistent case and then upper/lower case according to that finding.

For presentation, or QR code use, a simple uppercasing can be run externally on the encoding result.
This is consistent with the "A12UEL5L" test vector currently in the BIP, e.g.:

>>> segwit_addr.bech32_encode('a', []).upper()
'A12UEL5L'

Decoding:
Because of the BIP requirement that, "The lowercase form is used when determining a character's value for checksum purposes.":

Decoding treats uppercase and lowercase inputs as equal. The hrp still needs to be lowercased before hrp expansion during checksum calculation.
The transcription problems associated with mixed-case are avoided without restriction on acceptance of mixed-case input for decoding.
Not accepting mixed-case input for decoding is seemingly punitive.

Tangentially related: sipa/bech32#25

sipa · 2017-09-22T19:07:44Z

bip-0173.mediawiki

+If an uppercase version of the encoding result is desired, (e.g.- for presentation purposes, or QR code use),
+then an uppercasing procedure can be performed external to the encoding process.
+
+Decoders MUST accept uppercase, lowercase, and mixed-case strings.


Decoders must not accept mixed-case strings?

Not a typo. I think decoders should (and thus "must") accept mixed-case. Because of the existing requirement to use the lowercase form to determine a character's value for checksum purposes, there is no difference between upper and lower case characters in any position of the bech32 string. That applies to the HRP because it is also used to compute the checksum. So, on a micro level, I don't see a reason why a decoder should fail the input. On a macro level, I can see where you might be trying to discourage mixed case. But, the other changes I made also push in that direction (in fact fixing this: segwit_addr.bech32_encode('A', [])).

sipa

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

sipa · 2017-09-22T19:09:12Z

bip-0173.mediawiki

+* <tt>A1G7SGD8</tt>: The lowercase form is used when determining a character's value for checksum purposes.
+
+The following are invalid Bech32 strings (with reason for invalidity):
+* <tt>pzry9x0s0muk</tt>: No separator character


I think this section can be merged with the 'invalid Bech32 checksum' section.

That's how it was to start. I moved it to a new section because I thought it was misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

Right, but that's to for several of the existing examples in that section (like HRP character out of range, or length exceeded.

I think they can be combined into a single section with examples of invalid Bech32 strings.

mruddy · 2017-09-22T22:18:21Z

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

I don't see it that way. The input is always being canonicalized to lower anyways. Am I misunderstanding you somehow? I was working on a java reference impl encoder/decoder and it works. I'll put it up shortly.

sipa · 2017-09-22T22:22:35Z

@mruddy Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

sipa · 2017-09-22T22:23:02Z

There is a PR for a Java implementation already: sipa/bech32#19

I welcome review!

mruddy · 2017-09-22T22:23:10Z

Here's some java code. Maybe this will clarify:

https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L110

https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L177

mruddy · 2017-09-22T22:24:57Z

Yes, I saw that PR and thought that I could do better. This is my attempt https://github.com/mruddy/bech32/ :)

mruddy · 2017-09-22T22:29:39Z

Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

Agreed, but the case of a-Z and A-Z is interchangeable due to the lowercase canonicalization. That's what I mean by mixed-case.

sipa · 2017-09-22T22:30:38Z

I still don't understand why you want to add "Decoder must accept mixed case". I absolutely think it should not.

mruddy · 2017-09-22T22:31:57Z

Because from a user standpoint, if they happen to transcribe and get the case wrong somehow, it should still be accepted since the result will be the same.

sipa · 2017-09-22T22:33:18Z

If they transcribe it wrong, the software should complain and they should go back and fix it.

Trying to correct it for them is a footgun. If they can accidentally misread and transcribe as the wrong case, they could equally have transcribed as just the wrong character - and you wouldn't want to correct that for them either (even though error correction algorithms exist).

Effectively, the consistent case is part of the consistency check of the address, together with the checksum.

mruddy · 2017-09-22T22:40:45Z

Well, "wrong" was not a great word to use. I should have said "different" because that's all it is -- not wrong, because they are equal. It's less code to be more user friendly and get the same result.

Maybe I won't be able to persuade you on this point. But, how about the rest of the changes? Once we have general agreement, I'll update accordingly.

sipa · 2017-09-22T22:44:34Z

Big concept ACK on the rest of the changes. I've had other reports that the language was unclear wrt the behaviour of the encoder and uppercase input, and wanted to improve it. This seems like a good solution.

I do want to verify the extra test cases, though.

mruddy · 2017-09-22T22:49:20Z

Cool, thanks.

Also, I was thinking of accepting mixed case not so much as trying to correct users as much as it was unifying code paths.

BTW, the test cases should be testable with the examples in my first comment. I also have those test cases in my java.

mruddy · 2017-09-22T23:17:11Z

FYI, I just added a 3rd commit to capture what we discussed. I'll squash it all when we're all set.
I left it separate right now so you can easily see what I incrementally changed.

sipa · 2017-09-23T18:44:11Z

bip-0173.mediawiki

@@ -264,6 +267,9 @@ P2PKH addresses can be used.

 The following strings have a valid Bech32 checksum.
 * <tt>A12UEL5L</tt>
+* <tt>a12uel5l</tt>
+* <tt>10a06t8</tt>


There is a requirement that the HRP is non-empty, so while you're technically right that this has a valid checksum, it's not testable... any correct decoder should fail with that as input.

So I think it's maybe better to rename the section to "The following strings are valid Bech32" (and the next section to "The following string are not valid Bech32 (with reason for invalidity)", and have these two example with empty HRP in that section (with a note that technically their checksum is valid).

I read the BIP contrary to that. My interpretation was that no such requirement that the HRP is non-empty is in the BIP. The BIP even says that the HRP's validity is application specific (which builds on top of generic Bech32). That's why this works:

>>> segwit_addr.bech32_create_checksum('', []) [15, 29, 15, 26, 11, 7] >>> segwit_addr.bech32_verify_checksum('', [] + [15, 29, 15, 26, 11, 7]) True

It seems odd that we'd encode something that can't be decoded. I think the reference decoder is implementing a phantom requirement right now. If we add a requirement that HRP be non-empty, I think that's OK, just other things need to be updated also. Am I making sense?

You're right - it seems the non-empty HRP isn't mentioned in the BIP anywhere, except in the test vectors.

I prefer modifying it so that it does require that, and is uniformly implemented in decoders and encoders.

ok, i'll update to that effect.

btw, the non-checksum data portion can be empty right now. do you want to keep that as it is now? i think so, just checking.

Yes, sounds good. Thanks!

mruddy · 2017-09-24T00:17:21Z

I've updated and squashed everything back into one commit. Besides this BIP update, the reference code will need some updates.

Just wanted to point out that a couple of the things to fix, 1) mixed case encoding output and 2) using the upper case form of the HRP for checksum creation, are identified by this test:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'

mruddy · 2017-09-24T03:31:51Z

Added some info about conversion to US-ASCII and unmappable characters.

sipa · 2017-09-24T03:45:09Z

ACK 96d39e8, I've verified all the test vectors.

Thank you!

mruddy · 2017-09-24T12:08:59Z

@sipa thanks for reviewing! One more thing. I think we should definitely use UTF-8 instead of US-ASCII.

I've added a commit and will squash it upon approval.

Doing this will avoid the unmappable character issue and make using this easier for many people using higher level languages that use unicode strings. UTF-8 is compatible with ASCII because the mapping for ASCII uncompatible code points goes to multibyte, with each byte being out of ASCII range (which we'll already reject). This is easy to see in the table at the top of the description at https://en.wikipedia.org/wiki/UTF-8.

>>> '\U00000080'.encode('utf-8')
b'\xc2\x80'
>>> '\U00000800'.encode('utf-8')
b'\xe0\xa0\x80'
>>> '\U00010000'.encode('utf-8')
b'\xf0\x90\x80\x80'

sipa · 2017-09-24T17:17:00Z

@mruddy Yes, that's why it's restricted to the range of printable characters in the lower 7 bits, to avoid issues with character sets.

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

I think it's unnecessary though. The text does not say "The HRP must consist of Unicode characters, converted to US-ASCII" - that would be prone to having character conversion issues. It says "must consist of ASCII characters". If you're using it with other characters, regardless of how you encoding them, it's already used incorrectly.

mruddy · 2017-09-24T17:43:50Z

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

Agree, good catch.

The reason I bring this up is because of languages like Java (but not just Java) where the String type is UTF-16. So, that's what a user of the library is going to have and then it's going to be sent in and the library is going to have to encode to what Bech32 needs.
For example:
https://github.com/mruddy/bech32/blob/master/src/com/github/mruddy/bip173/bech32/Bech32.java#L150-L152
If Bech32 needs UTF-8, then there is no mapping issue. If Bech32 needs US-ASCII, then there can be problems introduced or exceptions generated. Otherwise the library user needs to know to convert to UTF-8 and then pass in a byte array, instead of the native String type, or do a code point check before sending the String in. Since we have the 7-bit restriction, this is safe. I think it makes the BIP more developer friendly.

sipa · 2017-09-24T17:45:52Z

The API should just take in a String, and then check whether all characters are in the ASCII range. UTF-8 shouldn't be involved at any point.

mruddy · 2017-09-24T18:25:39Z

Oh, I suppose :)

I dropped the UTF-8 commit and it's back to the one that you ACK'd, 96d39e8.

Thanks!

luke-jr added the Proposed BIP modification label Sep 20, 2017

mruddy changed the title ~~bip-0173 test vector updates, hrp lowercase~~ bip-0173 test vector updates Sep 21, 2017

mruddy changed the title ~~bip-0173 test vector updates~~ bip-0173 test vector updates and casing requirements Sep 22, 2017

mruddy mentioned this pull request Sep 22, 2017

Full BIP173 (Bech32) support bitcoin/bitcoin#11167

Merged

sipa reviewed Sep 22, 2017

View reviewed changes

sipa reviewed Sep 23, 2017

View reviewed changes

mruddy changed the title ~~bip-0173 test vector updates and casing requirements~~ bip-0173: test vectors, HRP, and casing requirements updates Sep 24, 2017

bip-0173: test vectors; HRP and casing requirements

96d39e8

luke-jr merged commit 249f1e4 into bitcoin:master Sep 24, 2017

mruddy deleted the bip173 branch September 24, 2017 19:34

mruddy mentioned this pull request Oct 11, 2017

Bech32 (BIP173) address support trezor/trezor-crypto#115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bip-0173: test vectors, HRP, and casing requirements updates #587

bip-0173: test vectors, HRP, and casing requirements updates #587

mruddy commented Sep 20, 2017 •

edited

Loading

luke-jr commented Sep 20, 2017

mruddy commented Sep 22, 2017 •

edited

Loading

sipa Sep 22, 2017 •

edited

Loading

mruddy Sep 22, 2017

sipa left a comment

sipa Sep 22, 2017

mruddy Sep 22, 2017

sipa Sep 22, 2017

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017 •

edited

Loading

sipa commented Sep 22, 2017 •

edited

Loading

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

sipa Sep 23, 2017 •

edited

Loading

mruddy Sep 23, 2017

sipa Sep 23, 2017

mruddy Sep 23, 2017

sipa Sep 23, 2017

mruddy commented Sep 24, 2017

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017 •

edited

Loading

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017 •

edited

Loading

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017

mruddy commented Sep 24, 2017

bip-0173: test vectors, HRP, and casing requirements updates #587

bip-0173: test vectors, HRP, and casing requirements updates #587

Conversation

mruddy commented Sep 20, 2017 • edited Loading

luke-jr commented Sep 20, 2017

mruddy commented Sep 22, 2017 • edited Loading

sipa Sep 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017 • edited Loading

sipa commented Sep 22, 2017 • edited Loading

mruddy commented Sep 22, 2017

sipa commented Sep 22, 2017

mruddy commented Sep 22, 2017

mruddy commented Sep 22, 2017

sipa Sep 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruddy commented Sep 24, 2017

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017 • edited Loading

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017 • edited Loading

mruddy commented Sep 24, 2017

sipa commented Sep 24, 2017

mruddy commented Sep 24, 2017

mruddy commented Sep 20, 2017 •

edited

Loading

mruddy commented Sep 22, 2017 •

edited

Loading

sipa Sep 22, 2017 •

edited

Loading

mruddy commented Sep 22, 2017 •

edited

Loading

sipa commented Sep 22, 2017 •

edited

Loading

sipa Sep 23, 2017 •

edited

Loading

sipa commented Sep 24, 2017 •

edited

Loading

sipa commented Sep 24, 2017 •

edited

Loading