-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Czech wordlist for BIP0039 #493
Conversation
Words are sorting according English alphabet (Czech sorting has difference in “ch”)
Wordlist passed our unit test in python-mnemonic. I especially like that words don't use diacritic. ACK for me. |
Is "nynfa" Czech word? orchest? peleton? |
@paveljanik thanx, you found two mistakes. Correct forms are "nymfa" and "orchestr". I fixed both mistakes. Peleton is OK, according Institute of the Czech Language http://ssjc.ujc.cas.cz/ |
@paveljanik We both use databases from Institute of the Czech Language. You use only handbook, I use Dictionary of written languages, my database is more complex. Meaning of peleton: The peloton (from French, originally meaning 'platoon') is the main group or pack of riders in a road bicycle race. |
@zizelevak More complex may not be always correct, see yourself: http://prirucka.ujc.cas.cz/?slovo=peloton – even referenced SSJČ states that "peleton" is incorrect form. By the way, Prirucka is currently the best normative reference available. |
@xHire @paveljanik Peleton was removed, Piksla was added |
ch after c? |
What about limitace -> limit? |
mincovna -> mince? |
motivace -> motiv, motorka -> motor? |
moudrost -> moudro |
nikl? [nykl]: not found ;-) Is this a good word? |
normativ -> norma |
novotvar -> novota? |
tankista -> tank? |
changed bariera --> smaragd ch after c? - worldlist is sorted by english alphabet, not by czech alphabet. Its simpler for implementation this wordlist in wallet software. Limit - it is not possible use, its included in english worldlist "Motivace" - I think its more suitable than "motiv" Both words have foreign orgin, but "motivace" has czech suffix -ace. motor - it is not possible use, its included in english worldlist "moudro" - I dont agree. "Modrost" is 50-times more frequent than "moudro" according czech corpus SYN2005 nikl - its OK its metal, chemical element with atomic number 28 norma - it is not possible use, its included in spain worldlist changed Novotvar --> novota "odolat" - it is not possible, its simular to other word in wordlist "odvolat" "okov" - Its part of well, "Okovy" it is shackles, I prefer Okovy, I think its more common changed otrhanec --> otrhat "popis" - it is not possible, its simular to other word in wordlist "dopis" |
@zizelevak Thanks for updates! Can you now please squash? I'm fine with the list now. Great work, BTW! 👍 |
Ah, thanks for reminder (via notification), I also had some comments/questions, will post them later today (probably in the evening)! :c) |
I divide my comment into several groups to make it easier to work with. One thing to write first: while I make comments about some words, at the end I also provide a buffer of alternatives just in case it is unclear with what words potentially replace some of those problematic ones. (By the way, comments at the beginning of each section are mostly for non-Czech speaking followers.)
Infrequently used/known wordsIn this list I include also (not only) words that I have never heard of. :c) It doesn’t mean I can’t just look them up, but I suppose they are so rare that they might not be so good to be in this dictionary.
Words with different diacriticsThere are some words in Czech that have dual spelling—one time without diacritics, other time with diacritics. Those I list below sound to me quite forced.
Plural formsAs stated in README, words should be in their base forms which means they should be in singular.
Words with (optional) spacingSo called „příslovečné spřežky“—words which are allowed to have their preposition transformed into prefix. Below are those I suppose are not so much more common in a single word form or might be confusing or don’t sound so well to me.
Miner modification proposalsSo as to make some words sound more natural or clearer.
Forced forms of wordsTechnically correct, but (mostly) uncommon word forms (forced into these just to make them lose their diacritics).
More like informal forms and/or not sounding neutral(I know some are (probably) correct, they just don’t sound ideal to me.)
Words I simply don’t like here ;c)(Or I couldn’t decide in which other group to put them.)
AlternativesI counted 64 words without a suggestion in the upper lists. The words below are already checked not to collide on prefix level and many also on a single letter difference level. I have 58 of them which is 6 words short… So I put 6 or so words above into parentheses to indicate their lower priority. ;c)
I’m looking forward to hear your opinion! :c) |
@xHire falzum - removed chlor - removed holinky → holinka - OK changed nadlouho - removed dominant → dominanta (no, dominanta is too long) drtivost - removed drtivost, zdrtit, added drtit fabrika - removed euro - removed borka - its very uncommon Added my new Words |
What's the status of this? |
@luke-jr |
@zizelevak what about aspirace and kapybara? |
@paveljanik In czech frequency corpus: aspirace has frequency 433 points, kapybara has 16 points. Aspirace is quite common czech word. Kapybara is not much frequent, but we have there tens of words with less frequency than kapybara. I will keep both words in list. |
@luke-jr |
@slush0 Can you re-ACK the changed version? |
Friendly ping @slush0 |
Access to proposed wordlists not (yet) accepted into the BIP repository is now controlled by the -W option. This option is and will remain undocumented in the manpage and other user documentation. The purpose of this undocumented flag is to permit testing proposed wordlists and generating test vectors, without inducing users to rely on wordlists which may be changed or removed at any time without notice. Users MUST NOT generate actual wallets based on proposed wordlists; so doing could result in unrecoverable wallets and permanent funds loss. I am now ready to import three more proposed wordlists I had not hereto seen amongst BIP pull requests: - Russian, bitcoin/bips#432 - Ukrainian, bitcoin/bips#442 - Czech, bitcoin/bips#493 Wordlist already imported for testing, now hidden behind -W option: - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae) Further changes hereby made, with due apologies for a non-atomic commit: - Integrating the -W option necessitated a general cleanup and overhaul of the options-handling code. - Whilst overhauling the options, I noticed that the documented -P option functionality was broken/nonexistent. Fixed.
Notice: This is hidden behind the -W flag; see 8aaa6f3. This is not exactly the wordlist proposed in the pull request. It is the czech.txt from zizelevak/bips@b7f682f, as modified by approximately the following command: uconv -f utf-8 -t utf-8 -x '::nfkd;' < czech.txt | \ LC_ALL=C LANG=C sort -s > normalized/czech.txt The *result* has been confirmed to not have any leading BOM, and to have a final line terminated with '\n' (bitcoin/bips#622). I did not yet examine the source for these issues. I did examine the source to confirm that no lines had any trailing whitespace (see 08a05b4). SHA-256 hash for the *resulting* czech.txt: 195136b3ba0f3099a9df625e0963f4efb56625b91c3a76bc5b4a9466a26880f7
I have created a Unicode NFKD-normalized and sorted
The result has been confirmed to not have any leading BOM, and to have a final line terminated with SHA-256 hash for the resulting
|
These are generated with easyseed and the bip39_vectorgen.sh script from 5f35cd0. There are vectors in twelve languages: The eight in the BIP repository, and four more with pending proposals. Three of the vectors for proposed languages are for wordlists which I have modified: - Russian (234c66c, bitcoin/bips#432) - Ukrainian (08a05b4, bitcoin/bips#442) - Czech (ba25dfa, bitcoin/bips#493) The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the proposal. Ironically, easyseed does not yet self-test itself with these. That will be added in a future release, to verify consistency between builds. For now, I publish these to aid in interoperability testing between implementations.
Sorry guys, I completely missed this. If the wordlist still passes bip39 tests, I'm fine with that (I didn't test it myself). |
Cross-reference #753 |
I'm not sure how this was missed, but the order of words in the wordlist is not alphabetical. The word "svetr" should come after "svazek". The fact that the list looks sorted but is not may cause subtle bugs in applications using binary search to look up code words. |
Consensus set of words after discussion in czech comunity in facebook group