-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix two errors in the BIP 39 French wordlist #622
Conversation
The BIP 39 wordlist contained two significant technical errors: - Byte Order Marker (BOM) U+FEFF at the beginning of the first line, preceding the word "abaisser". - No newline '\n' char terminating the last line, after "zoologie". The former may cause user loss of funds. An implementation which generates a mnemonic phrase and also turns it into a BIP 39 seed value may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the word "abaisser" to the user. Of course, it cannot be expected that the user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet. In the face of a buggy wordlist, whitespace handling and normalization cannot be absolutely relied on to remove a notoriously mischievous character. Those who provide technical support may be well advised to ask French users with unrestorable wallets, "Did your mnemonic phrase contain the word 'abaisser'?" The latter broke the shell script I use to massage wordlists into C sources when building https://github.com/nym-zone/easyseed . I know of only one commonplace platform where software regularly prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line terminator on the last line even when asked to create a Unix ('\n') text file. It is RECOMMENDED that new wordlists be examined for correctness using standard shell tools on a sane platform.
ACK 50c4f12 I checked all other current wordlists and they all:
I pulled @nym-zone 's branch locally and inspected the file using a hex editor to double check the new french.txt is fixed. As for @nym-zone's concerns: The only wallet I know of that shows users French phrases is Copay, and they don't use the files as-is from the BIP. If there is a wallet that uses the exact file from the BIP that shows french users phrases in French. That warning applies... but I don't think any wallet does so. |
Thanks, @dabura667. I should have clarified, I discovered this when writing a mnemonic phrase generator tool which embeds all eight wordlists currently in the BIP repository. I reported French, because that’s what was broken. Insofar as I could when dealing with multiple languages I do not know, I exercised reasonable care to assure the integrity of all the wordlists. (Then I wrote an automated battery of runtime tests against the compile-time SHA-256 hashes of the files, in case somebody out there may have buggy tools which mangle UTF-8 when building...) I just hexdumped Copay’s french.js to double-check; and its string representation is fine as for the ironically self-abasing “abaisser”. |
Checking additional implementations in the wild, I have not (yet?) found any which carry the spurious U+FEFF. But it is not only a matter of Copay. From a popular implementation, widely used because it runs in a web browser: iancoleman/bip39@3a8dbe9, checking wordlist_french.js:
From NovaCrypto/BIP39@5ecf568 (see also), checking French.java:
I will keep my eye out for other French-supporting BIP 39 implementations. |
Notice: This is hidden behind the -W flag; see 8aaa6f3. This is not exactly the wordlist proposed in the pull request. It is the russian.txt from farazdagi/bips@a59cc3e, as modified by approximately the following command: uconv -f utf-8 -t utf-8 -x '::nfkd;' | sort -s The *result* has been confirmed to not have any leading BOM, and to have a final line terminated with '\n' (bitcoin/bips#622). I did not yet examine the source for these issues. The *result* russian.txt SHA-256 hash: a8d7b9d8bdd3816eddd2aeb98718ad586d8e7dd8c364a944c072cdf3cd6bcb05
Notice: This is hidden behind the -W flag; see 8aaa6f3. This is not exactly the wordlist proposed in the pull request. It is the czech.txt from zizelevak/bips@b7f682f, as modified by approximately the following command: uconv -f utf-8 -t utf-8 -x '::nfkd;' < czech.txt | \ LC_ALL=C LANG=C sort -s > normalized/czech.txt The *result* has been confirmed to not have any leading BOM, and to have a final line terminated with '\n' (bitcoin/bips#622). I did not yet examine the source for these issues. I did examine the source to confirm that no lines had any trailing whitespace (see 08a05b4). SHA-256 hash for the *resulting* czech.txt: 195136b3ba0f3099a9df625e0963f4efb56625b91c3a76bc5b4a9466a26880f7
ACK. @voisine Are breadwallet french users affected? |
Thanks, @Kirvx. I didn’t mean to imply anything about Breadwallet; I was simply trying to get the attention of persons with significant involvement in #152 and/or maintenance of BIP 39 wordlists and/or BIP maintenance. But, good idea! Now, I checked the only Breadwallet French file I could find on a brief search,
(That file is delimited by XML, not newlines; so as long as “zoologie” is there, that’s fine also.) By the way, thank you for your work creating this list. I strongly urge that BIP 39 should have broad language support; and French is an important language. I can see in #152 how much work was put in to make and refine the list. Too bad, it seems abaisser wanted to abase itself. |
😃 Thank you for checking breadwallet @nym-zone 👍 |
would be nice to add a test vector. NBitcoin also has french hard coded... I could not find a proper tool on windows (sublime, and vscode don't work for some reason) showing me hidden characters though. |
Try notepad++ |
From the nym-zone/easyseed@c7d698a set of a dozen languages’ test vectors, I give you French test vectors in a convenient JSON format. Please run them with your implementation. As stated in the preceding commit log from two days ago (nym-zone/easyseed@5f35cd0, q.v.), these vectors are specifically designed to flunk implementations which do not perform proper Unicode NFKD normalization—even with words not containing diacritics (including the English wordlist, too). (I may rename or modify things; but the versioned link will obviously remain stable.) |
Pertinent to the below, but also as a general point:
Of course, due to its double-personality as “ZERO WIDTH NO-BREAK SPACE” (ZWNBSP), it’s also zero-width—and therefore invisible on any display which supports Unicode:
(Check, there is an extra character in between there.) Thus if it gets fed to PBKDF2 for BIP 39 seed generation, and the user is shown the corresponding mnemonic, then the user will write down a phrase which will not restore the wallet. That’s why I panicked when I saw this:
@NicolasDorier, IWordlistSource.cs from MetacoSA/NBitcoin@45a0ad9 does not contain this bug:
If you used a Windows text editor to somehow copypaste that, I would not expect it to have picked up the My greater concern is automatically processed text. I know for a fact that the Had I foisted that on users, somebody would have lost funds sooner or later. (Apologies for the double post. My bad.) |
@nym-zone I dodged a bullet. In .NET, the default encoding is UTF8 BOM: when reading a file it checks BOM, and if present, remove it from the string. When writing to the file, the default UTF8 BOM bytes is added. However, TL;DR: My implementation is conform... by chance. (Tested against your vector) |
@NicolasDorier the test vector generator included with Trezor-mnemonic library has all 0x00 as a vector generator for edge cases which results in all first words except the last. |
Can we get a resolution here? My understanding of these wordlist definitions negates the need for the BOM, and it is not recommended for use. |
why this has not been merged? |
ACK |
@Kirvx Could we get your ACK on this? |
ACK |
@luke-jr Anything else needed to get this merged? |
The BIP 39 French wordlist contains two significant technical errors:
Byte Order Marker (BOM) U+FEFF at the beginning of the first line, preceding the word “abaisser”.
No newline '\n' char terminating the last line, after “zoologie”.
The former may cause user loss of funds. An implementation which generates a mnemonic phrase and also turns it into a BIP 39 seed value may feed the string "<U+FEFF>abaisser" to the KDF, while displaying the word “abaisser” to the user. Of course, it cannot be expected that the user would enter "<U+FEFF>abaisser" upon attempt to restore a wallet. In the face of a buggy wordlist, whitespace handling and normalization cannot be absolutely relied on to remove a notoriously mischievous character. Those who provide technical support may be well advised to ask French users with unrestorable wallets, “Did your mnemonic phrase contain the word ‘abaisser’?”
The latter broke the shell script I use to massage wordlists into C sources when building easyseed.
I know of only one commonplace platform where software regularly prepends UTF-8 files with a spurious U+FEFF, and oftentimes omits a line terminator on the last line even when asked to create a Unix ('\n') text file. It is RECOMMENDED that new wordlists be examined for correctness using standard shell tools on a sane platform.