Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bip-0173: test vectors, HRP, and casing requirements updates #587

Merged
merged 1 commit into from
Sep 24, 2017
Merged

bip-0173: test vectors, HRP, and casing requirements updates #587

merged 1 commit into from
Sep 24, 2017

Conversation

mruddy
Copy link
Contributor

@mruddy mruddy commented Sep 20, 2017

  • a12uel5l: vector added to demonstrate that it and its uppercase version, A12UEL5L are both valid and treated as equal. segwit_addr.bech32_encode('a', [])

  • 10a06t8: vector added to demonstrate that a combined empty HRP and empty data encode to a valid Bech32 string. segwit_addr.bech32_encode('', [])

  • 1qzzfhee: vector added to demonstrate that an empty HRP encodes to a valid Bech32 string. segwit_addr.bech32_encode('', [0])

  • pzry9x0s0muk: vector moved to a new section as it is misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

>>> segwit_addr.bech32_create_checksum('', [1,2,3,4,5,6])
[15, 16, 15, 27, 28, 22]
>>> segwit_addr.bech32_verify_checksum('', [1,2,3,4,5,6]+[15, 16, 15, 27, 28, 22])
True
  • 1pzry9x0s0muk: vector removed as incorrect because "Empty HRP" is Bech32 valid -- HRP validity is application specific. segwit_addr.bech32_encode('', [1,2,3,4,5,6])

  • A1g7sgd8 and A1G7SGD8: vectors added to verify that implementations adhere to BIP requirement: "The lowercase form is used when determining a character's value for checksum purposes.". Note:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'  # BUG: violates "The lowercase form is used when determining a character's value for checksum purposes."
  • 10a06t8: test vector added to demonstrate that while a valid Bech32 string, it is segwit application specific invalid.

@luke-jr
Copy link
Member

luke-jr commented Sep 20, 2017

@sipa @gmaxwell

@mruddy mruddy changed the title bip-0173 test vector updates, hrp lowercase bip-0173 test vector updates Sep 21, 2017
@mruddy mruddy changed the title bip-0173 test vector updates bip-0173 test vector updates and casing requirements Sep 22, 2017
@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

In addition to the test vector updates, I made some updates, in a second commit, regarding casing requirements.
Here's my rationale for these additional changes:

Encoding:
HRP should always be lowercased. This is because:

  1. From the existing BIP: "The lowercase form is used when determining a character's value for checksum purposes."
  2. Encoders should always output a non-mixed case Bech32 string. Lowercase is canonical due the previous item, and the fact that the encoding charset is all lowercase. Otherwise, the HRP would need to be scanned to make sure it is consistent case and then upper/lower case according to that finding.

For presentation, or QR code use, a simple uppercasing can be run externally on the encoding result.
This is consistent with the "A12UEL5L" test vector currently in the BIP, e.g.:

>>> segwit_addr.bech32_encode('a', []).upper()
'A12UEL5L'

Decoding:
Because of the BIP requirement that, "The lowercase form is used when determining a character's value for checksum purposes.":

  1. Decoding treats uppercase and lowercase inputs as equal. The hrp still needs to be lowercased before hrp expansion during checksum calculation.
  2. The transcription problems associated with mixed-case are avoided without restriction on acceptance of mixed-case input for decoding.
  3. Not accepting mixed-case input for decoding is seemingly punitive.

Tangentially related: sipa/bech32#25

If an uppercase version of the encoding result is desired, (e.g.- for presentation purposes, or QR code use),
then an uppercasing procedure can be performed external to the encoding process.

Decoders MUST accept uppercase, lowercase, and mixed-case strings.
Copy link
Member

@sipa sipa Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoders must not accept mixed-case strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo. I think decoders should (and thus "must") accept mixed-case. Because of the existing requirement to use the lowercase form to determine a character's value for checksum purposes, there is no difference between upper and lower case characters in any position of the bech32 string. That applies to the HRP because it is also used to compute the checksum. So, on a micro level, I don't see a reason why a decoder should fail the input. On a macro level, I can see where you might be trying to discourage mixed case. But, the other changes I made also push in that direction (in fact fixing this: segwit_addr.bech32_encode('A', [])).

Copy link
Member

@sipa sipa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

* <tt>A1G7SGD8</tt>: The lowercase form is used when determining a character's value for checksum purposes.

The following are invalid Bech32 strings (with reason for invalidity):
* <tt>pzry9x0s0muk</tt>: No separator character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section can be merged with the 'invalid Bech32 checksum' section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was to start. I moved it to a new section because I thought it was misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that's to for several of the existing examples in that section (like HRP character out of range, or length exceeded.

I think they can be combined into a single section with examples of invalid Bech32 strings.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.

I don't see it that way. The input is always being canonicalized to lower anyways. Am I misunderstanding you somehow? I was working on a java reference impl encoder/decoder and it works. I'll put it up shortly.

@sipa
Copy link
Member

sipa commented Sep 22, 2017

@mruddy Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

@sipa
Copy link
Member

sipa commented Sep 22, 2017

There is a PR for a Java implementation already: sipa/bech32#19

I welcome review!

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Yes, I saw that PR and thought that I could do better. This is my attempt https://github.com/mruddy/bech32/ :)

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case).

Agreed, but the case of a-Z and A-Z is interchangeable due to the lowercase canonicalization. That's what I mean by mixed-case.

@sipa
Copy link
Member

sipa commented Sep 22, 2017

I still don't understand why you want to add "Decoder must accept mixed case". I absolutely think it should not.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Because from a user standpoint, if they happen to transcribe and get the case wrong somehow, it should still be accepted since the result will be the same.

@sipa
Copy link
Member

sipa commented Sep 22, 2017

If they transcribe it wrong, the software should complain and they should go back and fix it.

Trying to correct it for them is a footgun. If they can accidentally misread and transcribe as the wrong case, they could equally have transcribed as just the wrong character - and you wouldn't want to correct that for them either (even though error correction algorithms exist).

Effectively, the consistent case is part of the consistency check of the address, together with the checksum.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Well, "wrong" was not a great word to use. I should have said "different" because that's all it is -- not wrong, because they are equal. It's less code to be more user friendly and get the same result.

Maybe I won't be able to persuade you on this point. But, how about the rest of the changes? Once we have general agreement, I'll update accordingly.

@sipa
Copy link
Member

sipa commented Sep 22, 2017

Big concept ACK on the rest of the changes. I've had other reports that the language was unclear wrt the behaviour of the encoder and uppercase input, and wanted to improve it. This seems like a good solution.

I do want to verify the extra test cases, though.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

Cool, thanks.

Also, I was thinking of accepting mixed case not so much as trying to correct users as much as it was unifying code paths.

BTW, the test cases should be testable with the examples in my first comment. I also have those test cases in my java.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 22, 2017

FYI, I just added a 3rd commit to capture what we discussed. I'll squash it all when we're all set.
I left it separate right now so you can easily see what I incrementally changed.

@@ -264,6 +267,9 @@ P2PKH addresses can be used.

The following strings have a valid Bech32 checksum.
* <tt>A12UEL5L</tt>
* <tt>a12uel5l</tt>
* <tt>10a06t8</tt>
Copy link
Member

@sipa sipa Sep 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a requirement that the HRP is non-empty, so while you're technically right that this has a valid checksum, it's not testable... any correct decoder should fail with that as input.

So I think it's maybe better to rename the section to "The following strings are valid Bech32" (and the next section to "The following string are not valid Bech32 (with reason for invalidity)", and have these two example with empty HRP in that section (with a note that technically their checksum is valid).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the BIP contrary to that. My interpretation was that no such requirement that the HRP is non-empty is in the BIP. The BIP even says that the HRP's validity is application specific (which builds on top of generic Bech32). That's why this works:

>>> segwit_addr.bech32_create_checksum('', [])
[15, 29, 15, 26, 11, 7]
>>> segwit_addr.bech32_verify_checksum('', [] + [15, 29, 15, 26, 11, 7])
True

It seems odd that we'd encode something that can't be decoded. I think the reference decoder is implementing a phantom requirement right now. If we add a requirement that HRP be non-empty, I think that's OK, just other things need to be updated also. Am I making sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - it seems the non-empty HRP isn't mentioned in the BIP anywhere, except in the test vectors.

I prefer modifying it so that it does require that, and is uniformly implemented in decoders and encoders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'll update to that effect.

btw, the non-checksum data portion can be empty right now. do you want to keep that as it is now? i think so, just checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good. Thanks!

@mruddy mruddy changed the title bip-0173 test vector updates and casing requirements bip-0173: test vectors, HRP, and casing requirements updates Sep 24, 2017
@mruddy
Copy link
Contributor Author

mruddy commented Sep 24, 2017

I've updated and squashed everything back into one commit. Besides this BIP update, the reference code will need some updates.

Just wanted to point out that a couple of the things to fix, 1) mixed case encoding output and 2) using the upper case form of the HRP for checksum creation, are identified by this test:

>>> segwit_addr.bech32_encode('a', [])
'a12uel5l'
>>> segwit_addr.bech32_encode('A', [])
'A1g7sgd8'

@mruddy
Copy link
Contributor Author

mruddy commented Sep 24, 2017

Added some info about conversion to US-ASCII and unmappable characters.

@sipa
Copy link
Member

sipa commented Sep 24, 2017

ACK 96d39e8, I've verified all the test vectors.

Thank you!

@mruddy
Copy link
Contributor Author

mruddy commented Sep 24, 2017

@sipa thanks for reviewing! One more thing. I think we should definitely use UTF-8 instead of US-ASCII.

I've added a commit and will squash it upon approval.

Doing this will avoid the unmappable character issue and make using this easier for many people using higher level languages that use unicode strings. UTF-8 is compatible with ASCII because the mapping for ASCII uncompatible code points goes to multibyte, with each byte being out of ASCII range (which we'll already reject). This is easy to see in the table at the top of the description at https://en.wikipedia.org/wiki/UTF-8.

>>> '\U00000080'.encode('utf-8')
b'\xc2\x80'
>>> '\U00000800'.encode('utf-8')
b'\xe0\xa0\x80'
>>> '\U00010000'.encode('utf-8')
b'\xf0\x90\x80\x80'

@sipa
Copy link
Member

sipa commented Sep 24, 2017

@mruddy Yes, that's why it's restricted to the range of printable characters in the lower 7 bits, to avoid issues with character sets.

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

I think it's unnecessary though. The text does not say "The HRP must consist of Unicode characters, converted to US-ASCII" - that would be prone to having character conversion issues. It says "must consist of ASCII characters". If you're using it with other characters, regardless of how you encoding them, it's already used incorrectly.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 24, 2017

You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though.

Agree, good catch.

The reason I bring this up is because of languages like Java (but not just Java) where the String type is UTF-16. So, that's what a user of the library is going to have and then it's going to be sent in and the library is going to have to encode to what Bech32 needs.
For example:
https://github.com/mruddy/bech32/blob/master/src/com/github/mruddy/bip173/bech32/Bech32.java#L150-L152
If Bech32 needs UTF-8, then there is no mapping issue. If Bech32 needs US-ASCII, then there can be problems introduced or exceptions generated. Otherwise the library user needs to know to convert to UTF-8 and then pass in a byte array, instead of the native String type, or do a code point check before sending the String in. Since we have the 7-bit restriction, this is safe. I think it makes the BIP more developer friendly.

@sipa
Copy link
Member

sipa commented Sep 24, 2017

The API should just take in a String, and then check whether all characters are in the ASCII range. UTF-8 shouldn't be involved at any point.

@mruddy
Copy link
Contributor Author

mruddy commented Sep 24, 2017

Oh, I suppose :)

I dropped the UTF-8 commit and it's back to the one that you ACK'd, 96d39e8.

Thanks!

@luke-jr luke-jr merged commit 249f1e4 into bitcoin:master Sep 24, 2017
@mruddy mruddy deleted the bip173 branch September 24, 2017 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants