-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bip39 Indonesian wordlist #621
Conversation
Technical Checklist ACK 2b35f48 Checked:
|
@dabura667 Like English, Indonesian only uses 26 Latin letters and encodable in ASCII so no need to put in BOM and asciibetical sorting suffices. Please advise about trailing LF and LF line breaks, because that's how most of the other wordlist files use too. |
ACK means "ok to merge" and "Technical Checklist" means I checked for formatting errors and everything is OK. |
Ah ok, thanks. |
As an undocumented test feature, this adds the proposed Indonesian wordlist from bitcoin/bips#621 . The wordlist is subject to change or removal, pending resolution of the BIP pull request. Do NOT use this for any purpose other than testing, unless or until the wordlist be finalized and accepted in final form. SHA-256 for indonesian.txt in its current version: a88f63a3e2387453d0c6de09ffe06d318968fe76052c6324169f6a41fa2247e0 The Wikileaks wlupld3ptjvsgwqw.onion, encoded in proposed Indonesian: perjalanan fokus jujur suaminya otak masuk kamu roket
Has this any independent review from other native speakers or experts in the language? The earlier BIP wordlist addition pull requests witnessed much lively discussion and examination. One was superseded by a new proposal after significant problems were found. Since this is said to be ASCII, it is easy to check some basic characteristics in addition to those checked by @dabura667:
Sample 24-word mnemonic (outside of code-tagging to permit line breaks):
@perlancar, a question from an implementer: Are these strings for identifying this wordlist sensible and appropriate for Indonesian-speaking users?
I have proposed on bitcoin-dev that native language strings and short ASCII codes should be standardized: |
@perlancar, I recommend that you split 4fcfed7, “Add Perl bip39 implementation: Bitcoin::BIP39”, into a separate pull request. That should be reviewed separately from the wordlist; and it’s not even mentioned in the title or description of this pull request #621. People who glance through the list of pull requests will not even realize that this exists. It is good that you did atomic commits, so that 2b35f48 (the proposed Indonesian wordlist) and 4fcfed7 (listing of the Perl module in bip-0039.mediawiki itself) can be handled separately. I am not a maintainer here; therefore, I can only make a “recommendation”. |
@nym-zone I did separate into two pull requests, but admittedly I didn't create a branch for the first PR. And then when I created the second PR, Github merged the two. |
@nym-zone Thanks for the input. I did publish the proposed Indonesian wordlist as a Perl module (mentioned above) in hope of gathering some input, but of course the number of Indonesian Perl users is negligible. I will try to solicit input from more communities. For your information, I am a native Indonesian speaker. The wordlist were selected from most common Indonesian words which I have curated manually so I would say they are sensible. But that is my opinion. |
Buglet: A wordlist not yet accepted into the BIP repository should not be visible in the user interface. It is included only for purposes of testing, since it may be changed or deleted at any time. The -L option automatically listed *all* available wordlist languages; now it has a special-case check to not list Indonesian. Change this back if/when BIP PR bitcoin/bips#621 be accepted.
Access to proposed wordlists not (yet) accepted into the BIP repository is now controlled by the -W option. This option is and will remain undocumented in the manpage and other user documentation. The purpose of this undocumented flag is to permit testing proposed wordlists and generating test vectors, without inducing users to rely on wordlists which may be changed or removed at any time without notice. Users MUST NOT generate actual wallets based on proposed wordlists; so doing could result in unrecoverable wallets and permanent funds loss. I am now ready to import three more proposed wordlists I had not hereto seen amongst BIP pull requests: - Russian, bitcoin/bips#432 - Ukrainian, bitcoin/bips#442 - Czech, bitcoin/bips#493 Wordlist already imported for testing, now hidden behind -W option: - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae) Further changes hereby made, with due apologies for a non-atomic commit: - Integrating the -W option necessitated a general cleanup and overhaul of the options-handling code. - Whilst overhauling the options, I noticed that the documented -P option functionality was broken/nonexistent. Fixed.
These are generated with easyseed and the bip39_vectorgen.sh script from 5f35cd0. There are vectors in twelve languages: The eight in the BIP repository, and four more with pending proposals. Three of the vectors for proposed languages are for wordlists which I have modified: - Russian (234c66c, bitcoin/bips#432) - Ukrainian (08a05b4, bitcoin/bips#442) - Czech (ba25dfa, bitcoin/bips#493) The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the proposal. Ironically, easyseed does not yet self-test itself with these. That will be added in a future release, to verify consistency between builds. For now, I publish these to aid in interoperability testing between implementations.
Hello, native Indonesian speaker here. Thanks @perlancar for your effort!
|
Hi Surya (@ubunteroz), thanks for the comments. All are good. I will manually construct an edited new wordlist when I have some free time, but PR's or patches are welcome :) |
@dabura667 Only authors are supposed to ACK changes to BIPs. In this case, these people: @slush0 @prusnak @voisine @ebfull |
upacara | ||
upah | ||
upaya | ||
upeti |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This word is unfamiliar even to native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, by "this word" which one are you referring to? Because GitHub Web UI is showing 4 words: upacara, upah, upaya, upeti. I would argue that they are familiar to native speakers. What is your basis to say that they are not? Let's take "upeti" for example, this word is found 792 times in the Wikipedia article (see PR description which I have just updated to describe the process of producing the wordlist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book?
This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist.
This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies.
This is a good idea, but what about license? And is there an online version for it?
tunduk | ||
tunggal | ||
tuntutan | ||
turbin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This word is unfamiliar even to native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which word are you referring to? tunduk, tunggal, tuntutan, turbin? All of them? What is the basis? (See my comment on the upeti word above).
tragedi | ||
trailer | ||
transportasi | ||
trek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This word is unfamiliar even to native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which word are you referring to? trek? I tend to agree.
tikus | ||
timah | ||
timbul | ||
timnya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This word is a conjugation.
tersebut | ||
tertentu | ||
terutama | ||
terwujud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This word is a conjugation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that conjugated words should be avoided.
simbol | ||
sinar | ||
sinetron | ||
singel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This imported word is has multiple spellings debated by native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
sikap | ||
siklus | ||
silat | ||
silinder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technical term for most native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can agree to this.
senjata | ||
sensus | ||
sentral | ||
senyawa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a Conjucation or Technical term for most native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
semakin | ||
sembilan | ||
sementara | ||
seminggu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conjugation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
sama | ||
sambil | ||
sampai | ||
samurai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imported word not familiar to most native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on imported words, though I don't share your opinion of "samurai" being not familiar with most native speakers.
Native indonesian speaker here. Instead of the Wikipedia Corpus (which contain too many technical terminology), have we considered the "Sari Kata Bahasa Indonesia" book? This book is used nationwide in junior/primary schools. It would ensure anyone with a basic education would be able to understand our wordlist. This same book is also used as a reference to test if a foreigner can be permitted to work in Indonesian companies. |
I have created another wordlist. The source is still from Wikipedia Indonesia, but this time I manually curate the BIP wordlist using this criteria:
The resulting BIP wordlist is put here: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words-bip.txt. I haven't made a change to this PR. Inputs/comments welcome. As you might see, some of the words are indeed not very popular to native speakers, because I also want to satisfy criteria number 7. Also useful is the larger wordlist from which I curated this BIP wordlist: https://github.com/perlancar/perl-WordList-ID-BIP39/blob/master/devdata/words.txt . I indented the words which I want to use in the BIP wordlist. You can suggest corrections or changes by submitting a PR which changes this file. |
How the wordlist is produced:
https://raw.githubusercontent.com/perlancar/perl-WordLists-ID-Common/master/devdata/words.txt .