Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bip39 Indonesian wordlist #621

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
7 participants
@perlancar
Copy link

commented Jan 1, 2018

How the wordlist is produced:

  1. Download and uncompress https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-pages-articles.xml.bz2 (the version used when producing the wordlist is 2017-121-30).
  2. Count the words in all the articles inside articles.xml using this script https://github.com/perlancar/perl-WordLists-ID-Common/blob/master/devscripts/count-words-in-mediawiki-articles . The result is
    https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words.txt .
  3. Curate the words manually (mostly removing non-Indonesian words). The result is https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words-curated.txt . You can diff this two wordlist text file to see the difference.
  4. Generate the BIP39 Indonesian wordlist using this script https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devscripts/gen-wordlist . This script basically picks the most frequent words in words-curated.txt that are not already in the English, Spanish, French, and Italian BIP39 wordlist.
@dabura667

This comment has been minimized.

Copy link

commented Jan 1, 2018

Technical Checklist ACK 2b35f48

Checked:

  1. Is NFKD normalized list
  2. No BOM
  3. Has single trailing LF line break and is separated by LF line breaks
@perlancar

This comment has been minimized.

Copy link
Author

commented Jan 1, 2018

@dabura667 Like English, Indonesian only uses 26 Latin letters and encodable in ASCII so no need to put in BOM and asciibetical sorting suffices. Please advise about trailing LF and LF line breaks, because that's how most of the other wordlist files use too.

@dabura667

This comment has been minimized.

Copy link

commented Jan 1, 2018

ACK means "ok to merge" and "Technical Checklist" means I checked for formatting errors and everything is OK.

@perlancar

This comment has been minimized.

Copy link
Author

commented Jan 1, 2018

Ah ok, thanks.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 5, 2018

Add proposed Indonesian wordlist (undocumented)
As an undocumented test feature, this adds the proposed Indonesian
wordlist from bitcoin/bips#621 .  The wordlist
is subject to change or removal, pending resolution of the BIP pull
request.  Do NOT use this for any purpose other than testing, unless or
until the wordlist be finalized and accepted in final form.

SHA-256 for indonesian.txt in its current version:
a88f63a3e2387453d0c6de09ffe06d318968fe76052c6324169f6a41fa2247e0

The Wikileaks wlupld3ptjvsgwqw.onion, encoded in proposed Indonesian:
perjalanan fokus jujur suaminya otak masuk kamu roket
@nym-zone

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2018

Has this any independent review from other native speakers or experts in the language? The earlier BIP wordlist addition pull requests witnessed much lively discussion and examination. One was superseded by a new proposal after significant problems were found.

Since this is said to be ASCII, it is easy to check some basic characteristics in addition to those checked by @dabura667:

$ grep '^[^a-z]' indonesian.txt
$ grep -Eo '^[a-z]{0,3}$' indonesian.txt
$ grep -Eo '^[a-z]{4}' indonesian.txt | sort -s | uniq | wc -l
2048

Sample 24-word mnemonic (outside of code-tagging to permit line breaks):

$ easyseed -b 256 -l id
keuskupan utuh kegunaan serta pesisir mungkin reguler cermin langsung enam parkir lari gaib bensin babak dinilai meluncurkan mandiri bijaksana keamanan domestik bercerita prefektur legislatif

@perlancar, a question from an implementer: Are these strings for identifying this wordlist sensible and appropriate for Indonesian-speaking users?

+	LANG(indonesian,	u8"Bahasa Indonesia",	"id",	ascii_space ),

I have proposed on bitcoin-dev that native language strings and short ASCII codes should be standardized:
https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2018-January/015498.html

@nym-zone

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2018

@perlancar, I recommend that you split 4fcfed7, “Add Perl bip39 implementation: Bitcoin::BIP39”, into a separate pull request. That should be reviewed separately from the wordlist; and it’s not even mentioned in the title or description of this pull request #621. People who glance through the list of pull requests will not even realize that this exists.

It is good that you did atomic commits, so that 2b35f48 (the proposed Indonesian wordlist) and 4fcfed7 (listing of the Perl module in bip-0039.mediawiki itself) can be handled separately.

I am not a maintainer here; therefore, I can only make a “recommendation”.

@perlancar

This comment has been minimized.

Copy link
Author

commented Jan 5, 2018

@nym-zone I did separate into two pull requests, but admittedly I didn't create a branch for the first PR. And then when I created the second PR, Github merged the two.

@perlancar

This comment has been minimized.

Copy link
Author

commented Jan 5, 2018

@nym-zone Thanks for the input. I did publish the proposed Indonesian wordlist as a Perl module (mentioned above) in hope of gathering some input, but of course the number of Indonesian Perl users is negligible. I will try to solicit input from more communities. For your information, I am a native Indonesian speaker. The wordlist were selected from most common Indonesian words which I have curated manually so I would say they are sensible. But that is my opinion.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 6, 2018

Unlist proposed Indonesian from -L option
Buglet:  A wordlist not yet accepted into the BIP repository should not
be visible in the user interface.  It is included only for purposes of
testing, since it may be changed or deleted at any time.  The -L option
automatically listed *all* available wordlist languages; now it has a
special-case check to not list Indonesian.  Change this back if/when
BIP PR bitcoin/bips#621 be accepted.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018

Add undocumented -W option for tests; fix options
Access to proposed wordlists not (yet) accepted into the BIP repository
is now controlled by the -W option.  This option is and will remain
undocumented in the manpage and other user documentation.

The purpose of this undocumented flag is to permit testing proposed
wordlists and generating test vectors, without inducing users to rely on
wordlists which may be changed or removed at any time without notice.
Users MUST NOT generate actual wallets based on proposed wordlists; so
doing could result in unrecoverable wallets and permanent funds loss.

I am now ready to import three more proposed wordlists I had not hereto
seen amongst BIP pull requests:

 - Russian, bitcoin/bips#432
 - Ukrainian, bitcoin/bips#442
 - Czech, bitcoin/bips#493

Wordlist already imported for testing, now hidden behind -W option:

 - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae)

Further changes hereby made, with due apologies for a non-atomic commit:

 - Integrating the -W option necessitated a general cleanup and overhaul
   of the options-handling code.

 - Whilst overhauling the options, I noticed that the documented -P
   option functionality was broken/nonexistent.  Fixed.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 11, 2018

Add BIP 39 test vectors for 12 languages
These are generated with easyseed and the bip39_vectorgen.sh script from
5f35cd0.  There are vectors in twelve languages:  The eight in the BIP
repository, and four more with pending proposals.

Three of the vectors for proposed languages are for wordlists which I
have modified:

 - Russian (234c66c, bitcoin/bips#432)
 - Ukrainian (08a05b4, bitcoin/bips#442)
 - Czech (ba25dfa, bitcoin/bips#493)

The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the
proposal.

Ironically, easyseed does not yet self-test itself with these.  That
will be added in a future release, to verify consistency between builds.
For now, I publish these to aid in interoperability testing between
implementations.
@ubunteroz

This comment has been minimized.

Copy link

commented May 27, 2018

Hello, native Indonesian speaker here. Thanks @perlancar for your effort!
Quick review from your wordlist:

  1. It's better to remove conjuctions (such as atau, tetapi, yaitu, and yakni) to avoid confusion.
  2. Also, non-root words can cause another confusion (peperangan -> perang, pedesaan -> desa, menyebabkan -> sebab).
  3. Limiting wordlist to max. 8/9-letter words can help us to remember easily.
@perlancar

This comment has been minimized.

Copy link
Author

commented Jun 2, 2018

Hi Surya (@ubunteroz), thanks for the comments. All are good. I will manually construct an edited new wordlist when I have some free time, but PR's or patches are welcome :)

@luke-jr

This comment has been minimized.

Copy link
Member

commented Jul 5, 2018

@dabura667 Only authors are supposed to ACK changes to BIPs.

In this case, these people: @slush0 @prusnak @voisine @ebfull

@sabran125

This comment has been minimized.

Copy link

commented on 2b35f48 Jul 11, 2018

@DonaldTsang DonaldTsang referenced this pull request Dec 24, 2018

Open

Binary Lists #44

0 of 22 tasks complete
upacara
upah
upaya
upeti

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This word is unfamiliar even to native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Sorry, by "this word" which one are you referring to? Because GitHub Web UI is showing 4 words: upacara, upah, upaya, upeti. I would argue that they are familiar to native speakers. What is your basis to say that they are not? Let's take "upeti" for example, this word is found 792 times in the Wikipedia article (see PR description which I have just updated to describe the process of producing the wordlist).

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book?

This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

This is a good idea, but what about license? And is there an online version for it?

tunduk
tunggal
tuntutan
turbin

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This word is unfamiliar even to native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Which word are you referring to? tunduk, tunggal, tuntutan, turbin? All of them? What is the basis? (See my comment on the upeti word above).

tragedi
trailer
transportasi
trek

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This word is unfamiliar even to native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Which word are you referring to? trek? I tend to agree.

tikus
timah
timbul
timnya

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This word is a conjugation.

tersebut
tertentu
terutama
terwujud

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This word is a conjugation.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

I agree that conjugated words should be avoided.

simbol
sinar
sinetron
singel

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

This imported word is has multiple spellings debated by native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Agreed.

sikap
siklus
silat
silinder

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

Technical term for most native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

I can agree to this.

senjata
sensus
sentral
senyawa

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

Is a Conjucation or Technical term for most native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Agreed.

semakin
sembilan
sementara
seminggu

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

Conjugation

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Agreed.

sama
sambil
sampai
samurai

This comment has been minimized.

Copy link
@heri16

heri16 Jul 3, 2019

Imported word not familiar to most native speakers.

This comment has been minimized.

Copy link
@perlancar

perlancar Jul 3, 2019

Author

Agreed on imported words, though I don't share your opinion of "samurai" being not familiar with most native speakers.

@heri16

This comment has been minimized.

Copy link

commented Jul 3, 2019

Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book?

This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

@perlancar

This comment has been minimized.

Copy link
Author

commented Jul 3, 2019

I have created another wordlist. The source is still from Wikipedia Indonesia, but this time I manually curate the BIP wordlist using this criteria:

  1. Words from 4 to 10 letters long.
  2. Avoid conjugated words, choose only root words.
  3. Avoid prepositions, conjunctions, pronouns (e.g. dan, atau, jika, kamu, saya, ...).
  4. Avoid technical words when possible.
  5. Avoid loan words when possible.
  6. Avoid words that have multiple competing spellings, if possible.
  7. The above criteria is balanced/countered by this: I want to make the BIP wordlist unique by its 4-letters. So even if user does not type or mistype some letters at the end of words, they can still be corrected.

The resulting BIP wordlist is put here: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words-bip.txt. I haven't made a change to this PR. Inputs/comments welcome.

As you might see, some of the words are indeed not very popular to native speakers, because I also want to satisfy criteria number 7.

Also useful is the larger wordlist from which I curated this BIP wordlist: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words.txt . I indented the words which I want to use in the BIP wordlist. You can suggest corrections or changes by submitting a PR which changes this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.