Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIP39: Added ukrainian wordlist #442

Closed
wants to merge 5 commits into from
Closed

Conversation

Bohdat
Copy link

@Bohdat Bohdat commented Sep 5, 2016

No description provided.

@luke-jr
Copy link
Member

luke-jr commented Sep 5, 2016

@voisine
Copy link
Contributor

voisine commented Sep 13, 2016

this needs to be NFKD normalized, which you can do with the following perl script:

#!/usr/bin/perl

use Unicode::Normalize;
use strict;
use warnings;
use open qw(:std :utf8);

while (<>) {
    print NFKD("$_");
}

@voisine
Copy link
Contributor

voisine commented Sep 13, 2016

Looks good to me, but I'd like a second ukranian speaker to go over the list and verify it meets the word list criteria before ACKing

@greenaddress
Copy link
Contributor

greenaddress commented Sep 13, 2016

we reviewed the words (ukranian speaker) and they look OK - however the list doesn't seem sorted (run sort on it, export LANG=C first if you don't have it set).

If sorted it allows faster processing (binary search) and we think it is worthwhile doing it.

@zerko
Copy link

zerko commented Sep 23, 2016

Doesn't look like words are identifiable by first four letters.

@slush0
Copy link
Contributor

slush0 commented Sep 23, 2016

Script for validating all BIP39 defined rules (like uniqueness of first four letters) is here:
https://github.com/trezor/python-mnemonic/blob/master/test_mnemonic.py

Maybe it will need fixes for UTF-8 (eventually slight rewrite for python3 which handle UTF much better), but passing such tests is needed for adding into BIP.

@slush0
Copy link
Contributor

slush0 commented Sep 23, 2016

Okay, I run test_mnemonic.py (with Python3 - with no problems) and it gave me such list of duplicates: http://pastebin.com/ztBqDT9q

There were some other minor errors, but this need some work.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Access to proposed wordlists not (yet) accepted into the BIP repository
is now controlled by the -W option.  This option is and will remain
undocumented in the manpage and other user documentation.

The purpose of this undocumented flag is to permit testing proposed
wordlists and generating test vectors, without inducing users to rely on
wordlists which may be changed or removed at any time without notice.
Users MUST NOT generate actual wallets based on proposed wordlists; so
doing could result in unrecoverable wallets and permanent funds loss.

I am now ready to import three more proposed wordlists I had not hereto
seen amongst BIP pull requests:

 - Russian, bitcoin/bips#432
 - Ukrainian, bitcoin/bips#442
 - Czech, bitcoin/bips#493

Wordlist already imported for testing, now hidden behind -W option:

 - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae)

Further changes hereby made, with due apologies for a non-atomic commit:

 - Integrating the -W option necessitated a general cleanup and overhaul
   of the options-handling code.

 - Whilst overhauling the options, I noticed that the documented -P
   option functionality was broken/nonexistent.  Fixed.
nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Notice:  This is hidden behind the -W flag; see 8aaa6f3.

This is not exactly the wordlist proposed in the pull request.  The file
ukrainian.txt from Bohdat/bips@152fc59 has a bug, in addition to the
usual normalization and sorting concerns:  A trailing space (0x20) and
tab (0x09, '\t') after the word at original index 1393, 1-based line
number 1394, and before the newline '\n'.  The problem was first
identified by failure of easyseed's extensive internal self-tests,
followed by examination of the problem with cmp(1) and hex dumps to
diagnose the difference between the wordlist in my source tree, and the
wordlist printed on stdout by `easyseed -W -P -l uk`.

The following is edited for line length limits in the git log, but it
adequately shows the problem:

$ grep -E '[[:space:]]$' ukrainian.txt | hd
00000000  d0 bf d1 96 d1 81 d0 bd  d1 8f 20 09 0a
$ grep -En '[[:space:]]$' ukrainian.txt
1394:пісня 	<*end of line is here*>

It is fixed with the following command:

$ sed -E -e 's/[[:space:]]+$//' < ukrainian.txt > ukfix1/uk_fixed0.txt

After verification that this command made no other changes, it is
normalized and sorted:

$ ls -l ukrainian.txt ukfix1/uk_fixed0.txt
-rw-r--r-- 1 user user 24550 Jan  7 21:26 ukfix1/uk_fixed0.txt
-rw-r--r-- 1 user user 24552 Jan  7 20:31 ukrainian.txt
$ diff -u3 ukrainian.txt ukfix1/uk_fixed0.txt
[...showing only the desired line changed...]
$ uconv -f utf-8 -t utf-8 -x '::nfkd;' < uk_fixed0.txt | \
	LC_ALL=C LANG=C sort -s > uk_fixed1.txt
$ mv -i uk_fixed1.txt ../../easyseed/wordlist/ukrainian.txt
mv: overwrite '../../easyseed/wordlist/ukrainian.txt'? y

(Note with ref to 234c66c:  When normalizing and sorting the russian.txt
list, I forgot to force the locale for `sort(1)`.  I verified that this
makes no difference, and the 234c66c russian.txt is correct.  It *does*
make a very large difference for the Ukrainian wordlist.)

SHA-256 hash for the resulting ukrainian.txt:
612ee29e1fa13dc38c9e1b31c7ef980db8f3c8dd30f1c9377170d1b10e895dc9
@nym-zone
Copy link
Contributor

nym-zone commented Jan 8, 2018

At nym-zone/easyseed@08a05b4, I have created a bugfixed ukrainian.txt which is NFKD-normalized and binary-sorted, and fixes one technical bug.

The ukrainian.txt from Bohdat/bips@152fc59 contains a trailing space (0x20) then tab (0x09, '\t') after the word at original index 1393 (1-based line number 1394), before the newline '\n'. The problem was first identified by failure of easyseed’s extensive internal self-tests, followed by examination with cmp(1) and hex dumps to diagnose the difference between the wordlist in my source tree, and the wordlist printed on stdout by easyseed -W -P -l uk.

The following commands pinpoint the problem:

$ grep -E '[[:space:]]$' ukrainian.txt | hd
00000000  d0 bf d1 96 d1 81 d0 bd  d1 8f 20 09 0a           |.......... ..|
0000000d
$ echo "\"`grep -En '[[:space:]]$' ukrainian.txt`\""
"1394:пісня 	"

(@dabura667, perhaps you may want to add that to your punch-list of technical checks.)

It is fixed with the following command:

$ sed -E -e 's/[[:space:]]+$//' < ukrainian.txt > ukfix1/uk_fixed0.txt

After verification that this command made no other changes, the list is normalized and sorted:

$ ls -l ukrainian.txt ukfix1/uk_fixed0.txt
-rw-r--r-- 1 user user 24550 Jan  7 21:26 ukfix1/uk_fixed0.txt
-rw-r--r-- 1 user user 24552 Jan  7 20:31 ukrainian.txt
$ diff -u3 ukrainian.txt ukfix1/uk_fixed0.txt
[...showing only the desired line changed...]
$ uconv -f utf-8 -t utf-8 -x '::nfkd;' < uk_fixed0.txt | \
	LC_ALL=C LANG=C sort -s > uk_fixed1.txt
$ mv -i uk_fixed1.txt ../../easyseed/wordlist/ukrainian.txt
mv: overwrite '../../easyseed/wordlist/ukrainian.txt'? y

SHA-256 hash for the resulting ukrainian.txt:

612ee29e1fa13dc38c9e1b31c7ef980db8f3c8dd30f1c9377170d1b10e895dc9

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 11, 2018
These are generated with easyseed and the bip39_vectorgen.sh script from
5f35cd0.  There are vectors in twelve languages:  The eight in the BIP
repository, and four more with pending proposals.

Three of the vectors for proposed languages are for wordlists which I
have modified:

 - Russian (234c66c, bitcoin/bips#432)
 - Ukrainian (08a05b4, bitcoin/bips#442)
 - Czech (ba25dfa, bitcoin/bips#493)

The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the
proposal.

Ironically, easyseed does not yet self-test itself with these.  That
will be added in a future release, to verify consistency between builds.
For now, I publish these to aid in interoperability testing between
implementations.
@DonaldTsang DonaldTsang mentioned this pull request Dec 24, 2018
22 tasks
@kittyandrew
Copy link

kittyandrew commented Jun 14, 2021

Hello, I've almost started working on my own list for this, and found this pr. Can someone tell me what exactly here needs to be fixed/updated/reviewed?

Follow up question: are there any preferences for nouns-verbs-adjectives? There are (at least) few special words that translate to "and", "or", "there" etc.

Edit: In addition, there are many closely related words and different forms of the same word - working on those.

Edit 2: probably will have to create new PR in the end, because current contributor is inactive.

@luke-jr luke-jr closed this Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants