Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two errors in the BIP 39 French wordlist #622

Merged
merged 1 commit into from
Aug 9, 2018

Conversation

nym-zone
Copy link
Contributor

@nym-zone nym-zone commented Jan 1, 2018

The BIP 39 French wordlist contains two significant technical errors:

  • Byte Order Marker (BOM) U+FEFF at the beginning of the first line, preceding the word “abaisser”.

  • No newline '\n' char terminating the last line, after “zoologie”.

The former may cause user loss of funds. An implementation which generates a mnemonic phrase and also turns it into a BIP 39 seed value may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the word “abaisser” to the user. Of course, it cannot be expected that the user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet. In the face of a buggy wordlist, whitespace handling and normalization cannot be absolutely relied on to remove a notoriously mischievous character. Those who provide technical support may be well advised to ask French users with unrestorable wallets, “Did your mnemonic phrase contain the word ‘abaisser’?”

The latter broke the shell script I use to massage wordlists into C sources when building easyseed.

I know of only one commonplace platform where software regularly prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line terminator on the last line even when asked to create a Unix ('\n') text file. It is RECOMMENDED that new wordlists be examined for correctness using standard shell tools on a sane platform.

The BIP 39 wordlist contained two significant technical errors:

 - Byte Order Marker (BOM) U+FEFF at the beginning of the first line,
   preceding the word "abaisser".

 - No newline '\n' char terminating the last line, after "zoologie".

The former may cause user loss of funds.  An implementation which
generates a mnemonic phrase and also turns it into a BIP 39 seed value
may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the
word "abaisser" to the user.  Of course, it cannot be expected that the
user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet.
In the face of a buggy wordlist, whitespace handling and normalization
cannot be absolutely relied on to remove a notoriously mischievous
character.  Those who provide technical support may be well advised to
ask French users with unrestorable wallets, "Did your mnemonic phrase
contain the word 'abaisser'?"

The latter broke the shell script I use to massage wordlists into C
sources when building https://github.com/nym-zone/easyseed .

I know of only one commonplace platform where software regularly
prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line
terminator on the last line even when asked to create a Unix ('\n') text
file.  It is RECOMMENDED that new wordlists be examined for correctness
using standard shell tools on a sane platform.
@dabura667
Copy link

ACK 50c4f12

I checked all other current wordlists and they all:

  1. Did not have the BOM
  2. Did have a trailing LF line break

I pulled @nym-zone 's branch locally and inspected the file using a hex editor to double check the new french.txt is fixed.

As for @nym-zone's concerns:

The only wallet I know of that shows users French phrases is Copay, and they don't use the files as-is from the BIP.
https://github.com/bitpay/bitcore-mnemonic/blob/master/lib/words/french.js

If there is a wallet that uses the exact file from the BIP that shows french users phrases in French. That warning applies... but I don't think any wallet does so.

@nym-zone
Copy link
Contributor Author

nym-zone commented Jan 1, 2018

Thanks, @dabura667. I should have clarified, I discovered this when writing a mnemonic phrase generator tool which embeds all eight wordlists currently in the BIP repository. I reported French, because that’s what was broken. Insofar as I could when dealing with multiple languages I do not know, I exercised reasonable care to assure the integrity of all the wordlists.

(Then I wrote an automated battery of runtime tests against the compile-time SHA-256 hashes of the files, in case somebody out there may have buggy tools which mangle UTF-8 when building...)

I just hexdumped Copay’s french.js to double-check; and its string representation is fine as for the ironically self-abasing “abaisser”.

@nym-zone
Copy link
Contributor Author

nym-zone commented Jan 5, 2018

Checking additional implementations in the wild, I have not (yet?) found any which carry the spurious U+FEFF. But it is not only a matter of Copay.

From a popular implementation, widely used because it runs in a web browser: iancoleman/bip39@3a8dbe9, checking wordlist_french.js:

00000030  20 3a 20 57 4f 52 44 4c  49 53 54 53 3b 0a 57 4f  | : WORDLISTS;.WO|
00000040  52 44 4c 49 53 54 53 5b  22 66 72 65 6e 63 68 22  |RDLISTS["french"|
00000050  5d 20 3d 20 5b 0a 22 61  62 61 69 73 73 65 72 22  |] = [."abaisser"|

From NovaCrypto/BIP39@5ecf568 (see also), checking French.java:

00000520  20 6e 65 77 20 53 74 72  69 6e 67 5b 5d 7b 0a 20  | new String[]{. |
00000530  20 20 20 20 20 20 20 20  20 20 20 22 61 62 61 69  |           "abai|
00000540  73 73 65 72 22 2c 0a 20  20 20 20 20 20 20 20 20  |sser",.         |

I will keep my eye out for other French-supporting BIP 39 implementations.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Notice:  This is hidden behind the -W flag; see 8aaa6f3.

This is not exactly the wordlist proposed in the pull request.  It is
the russian.txt from farazdagi/bips@a59cc3e, as modified by
approximately the following command:

	uconv -f utf-8 -t utf-8 -x '::nfkd;' | sort -s

The *result* has been confirmed to not have any leading BOM, and to have
a final line terminated with '\n' (bitcoin/bips#622).  I did not yet
examine the source for these issues.

The *result* russian.txt SHA-256 hash:
a8d7b9d8bdd3816eddd2aeb98718ad586d8e7dd8c364a944c072cdf3cd6bcb05
nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Notice:  This is hidden behind the -W flag; see 8aaa6f3.

This is not exactly the wordlist proposed in the pull request.  It is
the czech.txt from zizelevak/bips@b7f682f, as modified by approximately
the following command:

	uconv -f utf-8 -t utf-8 -x '::nfkd;' < czech.txt | \
		LC_ALL=C LANG=C sort -s > normalized/czech.txt

The *result* has been confirmed to not have any leading BOM, and to have
a final line terminated with '\n' (bitcoin/bips#622).  I did not yet
examine the source for these issues.  I did examine the source to
confirm that no lines had any trailing whitespace (see 08a05b4).

SHA-256 hash for the *resulting* czech.txt:
195136b3ba0f3099a9df625e0963f4efb56625b91c3a76bc5b4a9466a26880f7
@nym-zone
Copy link
Contributor Author

nym-zone commented Jan 9, 2018

ping @Kirvx @NicolasDorier @vosine @luke-jr

Bugfix on #152, fbe7196.

This is a simple technical fix which changes no functionality in correct implementations, but can help prevent implementation errors.

@Kirvx
Copy link
Contributor

Kirvx commented Jan 9, 2018

ACK.
Thanks @nym-zone and @dabura667.

@voisine Are breadwallet french users affected?

nym-zone referenced this pull request in breadwallet/breadwallet-legacy Jan 9, 2018
@nym-zone
Copy link
Contributor Author

nym-zone commented Jan 9, 2018

Thanks, @Kirvx.

I didn’t mean to imply anything about Breadwallet; I was simply trying to get the attention of persons with significant involvement in #152 and/or maintenance of BIP 39 wordlists and/or BIP maintenance. But, good idea! Now, I checked the only Breadwallet French file I could find on a brief search, BreadWallet/fr.lproj/BIP39Words.plist from breadwallet/breadwallet-legacy@fd14d42; and it is fine:

$ hd -s 173 -n 16 BIP39Words.plist
000000ad  3c 73 74 72 69 6e 67 3e  61 62 61 69 73 73 65 72  |<string>abaisser|
000000bd

(That file is delimited by XML, not newlines; so as long as “zoologie” is there, that’s fine also.)

By the way, thank you for your work creating this list. I strongly urge that BIP 39 should have broad language support; and French is an important language. I can see in #152 how much work was put in to make and refine the list. Too bad, it seems abaisser wanted to abase itself.

@Kirvx
Copy link
Contributor

Kirvx commented Jan 9, 2018

😃

Thank you for checking breadwallet @nym-zone 👍

@luke-jr
Copy link
Member

luke-jr commented Jan 10, 2018

@slush0 @prusnak @voisine @ebfull

@NicolasDorier
Copy link
Contributor

would be nice to add a test vector. NBitcoin also has french hard coded... I could not find a proper tool on windows (sublime, and vscode don't work for some reason) showing me hidden characters though.

@evoskuil
Copy link
Contributor

Try notepad++

@nym-zone
Copy link
Contributor Author

@NicolasDorier:

would be nice to add a test vector.

From the nym-zone/easyseed@c7d698a set of a dozen languages’ test vectors, I give you French test vectors in a convenient JSON format. Please run them with your implementation.

As stated in the preceding commit log from two days ago (nym-zone/easyseed@5f35cd0, q.v.), these vectors are specifically designed to flunk implementations which do not perform proper Unicode NFKD normalization—even with words not containing diacritics (including the English wordlist, too).

(I may rename or modify things; but the versioned link will obviously remain stable.)

@nym-zone
Copy link
Contributor Author

Pertinent to the below, but also as a general point: U+FEFF is not removed or otherwise affected by Unicode NFKD normalization. To shortcut discussion of Unicode character properties, here is an empirical object demonstration with an ICU utility:

$ echo -n $'a\ufeffb' | uconv -x '::nfkd;' | hd
00000000  61 ef bb bf 62                                    |a...b|
00000005

Of course, due to its double-personality as “ZERO WIDTH NO-BREAK SPACE” (ZWNBSP), it’s also zero-width—and therefore invisible on any display which supports Unicode:

$ echo $'a\ufeffb'
ab

(Check, there is an extra character in between there.)

Thus if it gets fed to PBKDF2 for BIP 39 seed generation, and the user is shown the corresponding mnemonic, then the user will write down a phrase which will not restore the wallet. That’s why I panicked when I saw this:

$ head -c16 french.txt | hd
00000000  ef bb bf 61 62 61 69 73  73 65 72 0a 61 62 61 6e  |...abaisser.aban|
00000010

@NicolasDorier, IWordlistSource.cs from MetacoSA/NBitcoin@45a0ad9 does not contain this bug:

$ hd -s 80771 -n 19 IWordlistSource.cs
00013b83  22 66 72 65 6e 63 68 22  2c 20 22 61 62 61 69 73  |"french", "abais|
00013b93  73 65 72                                          |ser|
00013b96

If you used a Windows text editor to somehow copypaste that, I would not expect it to have picked up the U+FEFF. AFAIK, the Windows tools which (unnecessarily) produce a Byte Order Marker for octet-stream UTF-8 tend to not consider it part of the text. But then, I also would not rely on this.

My greater concern is automatically processed text. I know for a fact that the U+FEFF can be slurped up from french.txt, because that’s what happened to me. In my Makefile, I use Unix shell tools to turn the wordlist .txt files into C const char * arrays. The resulting French C array did contain the U+FEFF before I fixed it.

Had I foisted that on users, somebody would have lost funds sooner or later.

(Apologies for the double post. My bad.)

nym-zone referenced this pull request in MetacoSA/NBitcoin Jan 11, 2018
@NicolasDorier
Copy link
Contributor

@nym-zone I dodged a bullet.

In .NET, the default encoding is UTF8 BOM: when reading a file it checks BOM, and if present, remove it from the string.

When writing to the file, the default UTF8 BOM bytes is added.

However, UTF8Encoding.GetBytes() does not emit it, even if the UTF8Encoding objects has BOM setting set, for that a call to UTF8Encoding.GetPreamble() is needed.

TL;DR: My implementation is conform... by chance. (Tested against your vector)
Hopefully the test vectors of BIP39 (which I added since beginning) should have caught the bug though.

NicolasDorier added a commit to MetacoSA/NBitcoin that referenced this pull request Jan 13, 2018
@dabura667
Copy link

@NicolasDorier the test vector generator included with Trezor-mnemonic library has all 0x00 as a vector generator for edge cases which results in all first words except the last.

@cornfeedhobo
Copy link

cornfeedhobo commented Jul 4, 2018

Can we get a resolution here? My understanding of these wordlist definitions negates the need for the BOM, and it is not recommended for use.

@NicolasDorier
Copy link
Contributor

why this has not been merged?

@voisine
Copy link
Contributor

voisine commented Jul 5, 2018

ACK

@cornfeedhobo
Copy link

@Kirvx Could we get your ACK on this?

@Kirvx
Copy link
Contributor

Kirvx commented Aug 1, 2018

ACK

@cornfeedhobo
Copy link

@luke-jr Anything else needed to get this merged?

@luke-jr luke-jr merged commit 29a8781 into bitcoin:master Aug 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants