Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bip39 Indonesian wordlist #621

Closed
wants to merge 2 commits into from
Closed

Conversation

perlancar
Copy link

@perlancar perlancar commented Jan 1, 2018

How the wordlist is produced:

  1. Download and uncompress https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-pages-articles.xml.bz2 (the version used when producing the wordlist is 2017-121-30).
  2. Count the words in all the articles inside articles.xml using this script https://github.com/perlancar/perl-WordLists-ID-Common/blob/master/devscripts/count-words-in-mediawiki-articles . The result is
    https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words.txt .
  3. Curate the words manually (mostly removing non-Indonesian words). The result is https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words-curated.txt . You can diff this two wordlist text file to see the difference.
  4. Generate the BIP39 Indonesian wordlist using this script https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devscripts/gen-wordlist . This script basically picks the most frequent words in words-curated.txt that are not already in the English, Spanish, French, and Italian BIP39 wordlist.

@dabura667
Copy link

Technical Checklist ACK 2b35f48

Checked:

  1. Is NFKD normalized list
  2. No BOM
  3. Has single trailing LF line break and is separated by LF line breaks

@perlancar
Copy link
Author

@dabura667 Like English, Indonesian only uses 26 Latin letters and encodable in ASCII so no need to put in BOM and asciibetical sorting suffices. Please advise about trailing LF and LF line breaks, because that's how most of the other wordlist files use too.

@dabura667
Copy link

ACK means "ok to merge" and "Technical Checklist" means I checked for formatting errors and everything is OK.

@perlancar
Copy link
Author

Ah ok, thanks.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 5, 2018
As an undocumented test feature, this adds the proposed Indonesian
wordlist from bitcoin/bips#621 .  The wordlist
is subject to change or removal, pending resolution of the BIP pull
request.  Do NOT use this for any purpose other than testing, unless or
until the wordlist be finalized and accepted in final form.

SHA-256 for indonesian.txt in its current version:
a88f63a3e2387453d0c6de09ffe06d318968fe76052c6324169f6a41fa2247e0

The Wikileaks wlupld3ptjvsgwqw.onion, encoded in proposed Indonesian:
perjalanan fokus jujur suaminya otak masuk kamu roket
@nym-zone
Copy link
Contributor

nym-zone commented Jan 5, 2018

Has this any independent review from other native speakers or experts in the language? The earlier BIP wordlist addition pull requests witnessed much lively discussion and examination. One was superseded by a new proposal after significant problems were found.

Since this is said to be ASCII, it is easy to check some basic characteristics in addition to those checked by @dabura667:

$ grep '^[^a-z]' indonesian.txt
$ grep -Eo '^[a-z]{0,3}$' indonesian.txt
$ grep -Eo '^[a-z]{4}' indonesian.txt | sort -s | uniq | wc -l
2048

Sample 24-word mnemonic (outside of code-tagging to permit line breaks):

$ easyseed -b 256 -l id
keuskupan utuh kegunaan serta pesisir mungkin reguler cermin langsung enam parkir lari gaib bensin babak dinilai meluncurkan mandiri bijaksana keamanan domestik bercerita prefektur legislatif

@perlancar, a question from an implementer: Are these strings for identifying this wordlist sensible and appropriate for Indonesian-speaking users?

+	LANG(indonesian,	u8"Bahasa Indonesia",	"id",	ascii_space ),

I have proposed on bitcoin-dev that native language strings and short ASCII codes should be standardized:
https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2018-January/015498.html

@nym-zone
Copy link
Contributor

nym-zone commented Jan 5, 2018

@perlancar, I recommend that you split 4fcfed7, “Add Perl bip39 implementation: Bitcoin::BIP39”, into a separate pull request. That should be reviewed separately from the wordlist; and it’s not even mentioned in the title or description of this pull request #621. People who glance through the list of pull requests will not even realize that this exists.

It is good that you did atomic commits, so that 2b35f48 (the proposed Indonesian wordlist) and 4fcfed7 (listing of the Perl module in bip-0039.mediawiki itself) can be handled separately.

I am not a maintainer here; therefore, I can only make a “recommendation”.

@perlancar
Copy link
Author

perlancar commented Jan 5, 2018

@nym-zone I did separate into two pull requests, but admittedly I didn't create a branch for the first PR. And then when I created the second PR, Github merged the two.

@perlancar
Copy link
Author

@nym-zone Thanks for the input. I did publish the proposed Indonesian wordlist as a Perl module (mentioned above) in hope of gathering some input, but of course the number of Indonesian Perl users is negligible. I will try to solicit input from more communities. For your information, I am a native Indonesian speaker. The wordlist were selected from most common Indonesian words which I have curated manually so I would say they are sensible. But that is my opinion.

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 6, 2018
Buglet:  A wordlist not yet accepted into the BIP repository should not
be visible in the user interface.  It is included only for purposes of
testing, since it may be changed or deleted at any time.  The -L option
automatically listed *all* available wordlist languages; now it has a
special-case check to not list Indonesian.  Change this back if/when
BIP PR bitcoin/bips#621 be accepted.
nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Access to proposed wordlists not (yet) accepted into the BIP repository
is now controlled by the -W option.  This option is and will remain
undocumented in the manpage and other user documentation.

The purpose of this undocumented flag is to permit testing proposed
wordlists and generating test vectors, without inducing users to rely on
wordlists which may be changed or removed at any time without notice.
Users MUST NOT generate actual wallets based on proposed wordlists; so
doing could result in unrecoverable wallets and permanent funds loss.

I am now ready to import three more proposed wordlists I had not hereto
seen amongst BIP pull requests:

 - Russian, bitcoin/bips#432
 - Ukrainian, bitcoin/bips#442
 - Czech, bitcoin/bips#493

Wordlist already imported for testing, now hidden behind -W option:

 - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae)

Further changes hereby made, with due apologies for a non-atomic commit:

 - Integrating the -W option necessitated a general cleanup and overhaul
   of the options-handling code.

 - Whilst overhauling the options, I noticed that the documented -P
   option functionality was broken/nonexistent.  Fixed.
nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 11, 2018
These are generated with easyseed and the bip39_vectorgen.sh script from
5f35cd0.  There are vectors in twelve languages:  The eight in the BIP
repository, and four more with pending proposals.

Three of the vectors for proposed languages are for wordlists which I
have modified:

 - Russian (234c66c, bitcoin/bips#432)
 - Ukrainian (08a05b4, bitcoin/bips#442)
 - Czech (ba25dfa, bitcoin/bips#493)

The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the
proposal.

Ironically, easyseed does not yet self-test itself with these.  That
will be added in a future release, to verify consistency between builds.
For now, I publish these to aid in interoperability testing between
implementations.
@ubunteroz
Copy link

Hello, native Indonesian speaker here. Thanks @perlancar for your effort!
Quick review from your wordlist:

  1. It's better to remove conjuctions (such as atau, tetapi, yaitu, and yakni) to avoid confusion.
  2. Also, non-root words can cause another confusion (peperangan -> perang, pedesaan -> desa, menyebabkan -> sebab).
  3. Limiting wordlist to max. 8/9-letter words can help us to remember easily.

@perlancar
Copy link
Author

Hi Surya (@ubunteroz), thanks for the comments. All are good. I will manually construct an edited new wordlist when I have some free time, but PR's or patches are welcome :)

@luke-jr
Copy link
Member

luke-jr commented Jul 5, 2018

@dabura667 Only authors are supposed to ACK changes to BIPs.

In this case, these people: @slush0 @prusnak @voisine @ebfull

upacara
upah
upaya
upeti
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word is unfamiliar even to native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by "this word" which one are you referring to? Because GitHub Web UI is showing 4 words: upacara, upah, upaya, upeti. I would argue that they are familiar to native speakers. What is your basis to say that they are not? Let's take "upeti" for example, this word is found 792 times in the Wikipedia article (see PR description which I have just updated to describe the process of producing the wordlist).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book?

This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

This is a good idea, but what about license? And is there an online version for it?

tunduk
tunggal
tuntutan
turbin
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word is unfamiliar even to native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which word are you referring to? tunduk, tunggal, tuntutan, turbin? All of them? What is the basis? (See my comment on the upeti word above).

tragedi
trailer
transportasi
trek
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word is unfamiliar even to native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which word are you referring to? trek? I tend to agree.

tikus
timah
timbul
timnya
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word is a conjugation.

tersebut
tertentu
terutama
terwujud
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word is a conjugation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that conjugated words should be avoided.

simbol
sinar
sinetron
singel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This imported word is has multiple spellings debated by native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

sikap
siklus
silat
silinder
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technical term for most native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can agree to this.

senjata
sensus
sentral
senyawa
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a Conjucation or Technical term for most native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

semakin
sembilan
sementara
seminggu
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conjugation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

sama
sambil
sampai
samurai
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imported word not familiar to most native speakers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on imported words, though I don't share your opinion of "samurai" being not familiar with most native speakers.

@heri16
Copy link

heri16 commented Jul 3, 2019

Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book?

This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.

This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.

@perlancar
Copy link
Author

perlancar commented Jul 3, 2019

I have created another wordlist. The source is still from Wikipedia Indonesia, but this time I manually curate the BIP wordlist using this criteria:

  1. Words from 4 to 10 letters long.
  2. Avoid conjugated words, choose only root words.
  3. Avoid prepositions, conjunctions, pronouns (e.g. dan, atau, jika, kamu, saya, ...).
  4. Avoid technical words when possible.
  5. Avoid loan words when possible.
  6. Avoid words that have multiple competing spellings, if possible.
  7. The above criteria is balanced/countered by this: I want to make the BIP wordlist unique by its 4-letters. So even if user does not type or mistype some letters at the end of words, they can still be corrected.

The resulting BIP wordlist is put here: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words-bip.txt. I haven't made a change to this PR. Inputs/comments welcome.

As you might see, some of the words are indeed not very popular to native speakers, because I also want to satisfy criteria number 7.

Also useful is the larger wordlist from which I curated this BIP wordlist: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words.txt . I indented the words which I want to use in the BIP wordlist. You can suggest corrections or changes by submitting a PR which changes this file.

@luke-jr
Copy link
Member

luke-jr commented Jul 2, 2021

@luke-jr luke-jr closed this Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants