Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Czech wordlist for BIP0039 #493

Merged
merged 15 commits into from
Sep 19, 2019
Merged

Czech wordlist for BIP0039 #493

merged 15 commits into from
Sep 19, 2019

Conversation

zizelevak
Copy link
Contributor

Consensus set of words after discussion in czech comunity in facebook group

@luke-jr
Copy link
Member

luke-jr commented Jan 30, 2017

@slush0
Copy link
Contributor

slush0 commented Jan 31, 2017

Wordlist passed our unit test in python-mnemonic. I especially like that words don't use diacritic. ACK for me.

@paveljanik
Copy link
Contributor

paveljanik commented Jan 31, 2017

Is "nynfa" Czech word? orchest? peleton?

@zizelevak
Copy link
Contributor Author

@paveljanik thanx, you found two mistakes. Correct forms are "nymfa" and "orchestr". I fixed both mistakes. Peleton is OK, according Institute of the Czech Language http://ssjc.ujc.cas.cz/

@paveljanik
Copy link
Contributor

@zizelevak
Copy link
Contributor Author

zizelevak commented Jan 31, 2017

@paveljanik We both use databases from Institute of the Czech Language. You use only handbook, I use Dictionary of written languages, my database is more complex. Meaning of peleton: The peloton (from French, originally meaning 'platoon') is the main group or pack of riders in a road bicycle race.
Edit: I explain peleton (older version of this post was explaining nymfa)

@xHire
Copy link

xHire commented Jan 31, 2017

@zizelevak More complex may not be always correct, see yourself: http://prirucka.ujc.cas.cz/?slovo=peloton – even referenced SSJČ states that "peleton" is incorrect form. By the way, Prirucka is currently the best normative reference available.

@zizelevak
Copy link
Contributor Author

@xHire @paveljanik Peleton was removed, Piksla was added

@paveljanik
Copy link
Contributor

@paveljanik
Copy link
Contributor

ch after c?

@paveljanik
Copy link
Contributor

What about limitace -> limit?

@paveljanik
Copy link
Contributor

mincovna -> mince?

@paveljanik
Copy link
Contributor

motivace -> motiv, motorka -> motor?

@paveljanik
Copy link
Contributor

moudrost -> moudro

@paveljanik
Copy link
Contributor

nikl? [nykl]: not found ;-) Is this a good word?

@paveljanik
Copy link
Contributor

normativ -> norma

@paveljanik
Copy link
Contributor

novotvar -> novota?

@paveljanik
Copy link
Contributor

tankista -> tank?

@zizelevak
Copy link
Contributor Author

@paveljanik

changed bariera --> smaragd

ch after c? - worldlist is sorted by english alphabet, not by czech alphabet. Its simpler for implementation this wordlist in wallet software.

Limit - it is not possible use, its included in english worldlist
mince - it is not possible use, its included in french worldlist

"Motivace" - I think its more suitable than "motiv" Both words have foreign orgin, but "motivace" has czech suffix -ace.

motor - it is not possible use, its included in english worldlist

"moudro" - I dont agree. "Modrost" is 50-times more frequent than "moudro" according czech corpus SYN2005

nikl - its OK its metal, chemical element with atomic number 28

norma - it is not possible use, its included in spain worldlist

changed Novotvar --> novota

"odolat" - it is not possible, its simular to other word in wordlist "odvolat"

"okov" - Its part of well, "Okovy" it is shackles, I prefer Okovy, I think its more common

changed otrhanec --> otrhat

"popis" - it is not possible, its simular to other word in wordlist "dopis"
"robot" - it is not possible use, its included in english worldlist
"tank" - it is not possible use, its included in english worldlist

@zizelevak zizelevak closed this Feb 10, 2017
@zizelevak zizelevak reopened this Feb 10, 2017
@paveljanik
Copy link
Contributor

@zizelevak Thanks for updates! Can you now please squash?

I'm fine with the list now. Great work, BTW! 👍

@xHire
Copy link

xHire commented Feb 11, 2017

Ah, thanks for reminder (via notification), I also had some comments/questions, will post them later today (probably in the evening)! :c)

@xHire
Copy link

xHire commented Feb 11, 2017

I divide my comment into several groups to make it easier to work with. One thing to write first: while I make comments about some words, at the end I also provide a buffer of alternatives just in case it is unclear with what words potentially replace some of those problematic ones. (By the way, comments at the beginning of each section are mostly for non-Czech speaking followers.)

  • Lhota (spelled as a name)

Infrequently used/known words

In this list I include also (not only) words that I have never heard of. :c) It doesn’t mean I can’t just look them up, but I suppose they are so rare that they might not be so good to be in this dictionary.

  • falzum
  • gondola
  • karfiol
  • luneta
  • nefrit
  • nestor
  • normativ
  • opuka
  • pagoda
  • ponton
  • rytec
  • sahel
  • sutana

Words with different diacritics

There are some words in Czech that have dual spelling—one time without diacritics, other time with diacritics. Those I list below sound to me quite forced.

  • chlor
  • chrom
  • folklor
  • globus
  • kasino
  • kastrol
  • lahev
  • mixer → mixovat
  • naftalen → nafta (but it’s probably already taken in another dictionary I suppose)
  • ozon
  • tampon
  • vitamin → vitalita

Plural forms

As stated in README, words should be in their base forms which means they should be in singular.

  • holinky → holinka
  • lenilky → lentilka
  • piliny → pilina

Words with (optional) spacing

So called „příslovečné spřežky“—words which are allowed to have their preposition transformed into prefix. Below are those I suppose are not so much more common in a single word form or might be confusing or don’t sound so well to me.

  • nadlouho
  • nadrobno
  • natrvalo
  • natvrdo
  • navenek
  • zdaleka

Miner modification proposals

So as to make some words sound more natural or clearer.

  • dominant → dominanta
  • hltan → hltat
  • kmit → kmitat
  • logicky → logika
  • mulat → mula
  • muzika → muzikant
  • naposled → naposledy
  • pasivum → pasivita
  • roup → roupice
  • vespod → vespodu

Forced forms of words

Technically correct, but (mostly) uncommon word forms (forced into these just to make them lose their diacritics).

  • drtivost
  • kalnost
  • (kluzkost)
  • (ladnost)
  • levnost
  • mokrost → mokro
  • movitost
  • suknice → sukno (suknice is really so old ;c))

More like informal forms and/or not sounding neutral

(I know some are (probably) correct, they just don’t sound ideal to me.)

  • fabrika
  • fanda
  • fara → farnost
  • fiflena
  • fixa → fixace
  • glejt
  • hafan
  • hezoun
  • kafe
  • machr
  • marodka
  • mejdan
  • nimrod
  • piksla
  • smola
  • spratek
  • (tatarka)
  • vatra
  • vloni → vloha
  • (zrzek)

Words I simply don’t like here ;c)

(Or I couldn’t decide in which other group to put them.)

  • euro (not a Czech word and might also become deprecated soon)
  • flirt → flamendr
  • lepra
  • (limitace)
  • minibar → miniatura/minimalista (just wow, why to choose the word "minibar" among all those words prefixed with mini- :-D)
  • (mocensky)
  • nahota → nahodile/nahoru
  • nevina
  • (oktet)
  • onkolog
  • podle (more often is a preposition IMO)
  • sekvoje (because it’s commonly spelled a tiny bit differently)
  • sklivec
  • tankista → tanker (if not already used elsewhere)
  • tavenina → tavidlo
  • tenor → teror/terorista
  • tunika
  • varan

Alternatives

I counted 64 words without a suggestion in the upper lists. The words below are already checked not to collide on prefix level and many also on a single letter difference level. I have 58 of them which is 6 words short… So I put 6 or so words above into parentheses to indicate their lower priority. ;c)

  • borka
  • celistvost
  • deflace
  • destilace
  • dioda
  • displej
  • epopej
  • firma
  • fukar
  • holokaust
  • horda
  • kabriolet
  • kapybara
  • klima/klimatizace
  • kosmonaut
  • kotoul
  • kotrmelec
  • kropit/tropit
  • lamela
  • litr
  • lodivod
  • lokomotiva
  • loterie
  • mela/melasa
  • mydlinky
  • nanometr
  • nektarinka
  • nora
  • nutrie
  • orangutan
  • parabola
  • peloton
  • periskop
  • pikolitr
  • ponynka
  • prahora (prahory? in this case, I’m tending to call it a plurale tantum, although it’s (strictly speaking) not the case)
  • pranostika
  • pruh/prut
  • rakovina
  • relevance
  • rotoped
  • rydlo
  • seschnout
  • sinusoida
  • spokojenost
  • stranou
  • surikata
  • tempo
  • tiskopis
  • titrace
  • tranzistor
  • traverza
  • trend
  • utiskovat
  • vodivost
  • vyrvat
  • ziskovost
  • zkontrolovat/zkonfiskovat
  • zmutovat

I’m looking forward to hear your opinion! :c)

@zizelevak
Copy link
Contributor Author

@xHire
Lhota - removed

falzum - removed
gondola - removed
karfiol - removed
luneta - removed
nefrit - removed
nestor - removed
normativ - removed
opuka - removed
pagoda - removed
ponton - removed
rytec - removed
sahel - removed
sutana - removed

chlor - removed
chrom - removed
folklor - removed
globus - removed
kasino - removed
kastrol - removed
lahev - removed
mixer → mixovat - OK, changed
naftalen → nafta - removed
ozon - removed
tampon - removed
vitamin → vitalita - OK changed

holinky → holinka - OK changed
lenilky → lentilka - OK changed
piliny → pilina - OK changed

nadlouho - removed
nadrobno - removed
natrvalo - removed
natvrdo - removed
navenek - keep it
zdaleka - keep it

dominant → dominanta (no, dominanta is too long)
hltan → hltat - OK changed
kmit → kmitat - OK changed
logicky → logika - OK changed
mulat → mula (no, mula is in spain worldlist)
muzika → muzikant - OK changed
naposled → naposledy (no, naposledy is too long)
pasivum → pasivita - OK changed
roup → roupice (no, I dont find roupice in dictionary and I never heard it)
vespod → vespodu - OK changed

drtivost - removed drtivost, zdrtit, added drtit
kalnost - removed
(kluzkost) - keep it
(ladnost) - keep it
levnost - removed
mokrost → mokro - OK changed
movitost - removed
suknice → sukno - OK changed

fabrika - removed
fanda - removed
fara → farnost - removed (farnost have colision with marnost)
fiflena - removed
fixa → fixace - OK changed
glejt - keep it, its older word, no informal
hafan - removed
hezoun - removed
kafe - removed
machr - removed
marodka - removed
mejdan - removed
nimrod - removed
piksla - removed
smola - removed
spratek - removed
(tatarka) - removed
vatra - removed
vloni → vloha - removed
(zrzek) - changed to zrzavost

euro - removed
flirt → flamendr - keep it, flamender is 6-times less frequent
lepra- removed
(limitace) - removed
minibar → miniatura/minimalista - keep it, your words are too long
(mocensky) - removed
nahota → nahodile/nahoru - changed
nevina - keep it
(oktet)- removed
onkolog - removed
podle - removed
sekvoje (because it’s commonly spelled a tiny bit differently)
sklivec - removed
tankista → tanker - changed
tavenina → tavidlo - keep it, I never heard "tavidlo"
tenor → teror/terorista - keep it, I try avoid words with terorism
tunika - removed
varan - keep it

borka - its very uncommon
celistvost - no, too many letters, 8 is max limit
deflace - OK added
destilace - no, too many letters, 8 is max limit
dioda - OK added
displej - OK added
epopej - OK added
firma - no, in spain wordlist
fukar - OK added
holokaust - no, too many letters, 8 is max limit
horda - OK added
kabriolet - no, too many letters, 8 is max limit
kapybara - OK, added
klima - OK, added
kosmonaut - no, too many letters, 8 is max limit
kotoul - OK, added
kotrmelec - no, too many letters, 8 is max limit
kropit - OK, added
lamela - OK, added
litr - no, simular to "lotr"
lodivod - OK, added
lokomotiva - no, too many letters, 8 is max limit
loterie - no, in french wordlist
melasa - OK, added
mydlinky - no, its plural, and singular "midlinka" is uncommon
nanometr - OK, added
nektarinka - no, too many letters, 8 is max limit
nora - OK, added
nutrie - OK, added
orangutan - no, too many letters, 8 is max limit
parabola - no, in italian wordlist
peloton - OK, added
periskop - OK, added
pikolitr - no, its very uncommon
ponynka - no, I never heard this word and not found in dictionary
prahory - OK, added. plural is OK, its name of geological epoch
pranostika - no, too many letters, 8 is max limit
prut - OK, added
rakovina - OK, added
relevance - no, too many letters, 8 is max limit
rotoped - OK, added
rydlo - OK, added
seschnout - no, too many letters, 8 is max limit
sinusoida - no, too many letters, 8 is max limit
spokojenost - no, too many letters, 8 is max limit
stranou - no, first 4 letters same as "strach"
surikata - OK, added
tempo - no, in italian wordlist
tiskopis - OK, added
titrace - no, its very uncommon
tranzistor - no, too many letters, 8 is max limit
traverza - OK, added
trend - no, in english wordlist
utiskovat - no, too many letters, 8 is max limit
vodivost - OK, added
vyrvat - OK, added
ziskovost - no, too many letters, 8 is max limit
zkontrolovat/zkonfiskovat - no, too many letters, 8 is max limit
zmutovat - OK, added

Added my new Words
dynamit
reflex
arogance
abdikace
linoleum
zhatit
svodidlo
hematom
manko
lomcovat
termoska
glazura
amputace
anulovat
hluchota
aspirace
sediment
rarita
skafandr
reputace
unavit
sponzor
mahagon
rubrika
veranda
koalice

@luke-jr
Copy link
Member

luke-jr commented Mar 6, 2017

What's the status of this?

@zizelevak
Copy link
Contributor Author

@luke-jr
Orginal wordlist was checked by @slush0 and passed all tests. Some members adviced changing some words, which were problematic from language reasons. I dealt with all requests. I think that actual worldlist should be checked by @slush0 again (I am sure that will passed his tests, because I made own tests) and its ready for publish

@paveljanik
Copy link
Contributor

@zizelevak what about aspirace and kapybara?

@zizelevak
Copy link
Contributor Author

@paveljanik In czech frequency corpus: aspirace has frequency 433 points, kapybara has 16 points. Aspirace is quite common czech word. Kapybara is not much frequent, but we have there tens of words with less frequency than kapybara. I will keep both words in list.

@zizelevak
Copy link
Contributor Author

@luke-jr
I think that this version is finnal.

@luke-jr
Copy link
Member

luke-jr commented Jul 26, 2017

@slush0 Can you re-ACK the changed version?

@jonathancross
Copy link
Contributor

Friendly ping @slush0

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Access to proposed wordlists not (yet) accepted into the BIP repository
is now controlled by the -W option.  This option is and will remain
undocumented in the manpage and other user documentation.

The purpose of this undocumented flag is to permit testing proposed
wordlists and generating test vectors, without inducing users to rely on
wordlists which may be changed or removed at any time without notice.
Users MUST NOT generate actual wallets based on proposed wordlists; so
doing could result in unrecoverable wallets and permanent funds loss.

I am now ready to import three more proposed wordlists I had not hereto
seen amongst BIP pull requests:

 - Russian, bitcoin/bips#432
 - Ukrainian, bitcoin/bips#442
 - Czech, bitcoin/bips#493

Wordlist already imported for testing, now hidden behind -W option:

 - Indonesian, bitcoin/bips#621 (b2f66ba, special-cased d03ddae)

Further changes hereby made, with due apologies for a non-atomic commit:

 - Integrating the -W option necessitated a general cleanup and overhaul
   of the options-handling code.

 - Whilst overhauling the options, I noticed that the documented -P
   option functionality was broken/nonexistent.  Fixed.
nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 7, 2018
Notice:  This is hidden behind the -W flag; see 8aaa6f3.

This is not exactly the wordlist proposed in the pull request.  It is
the czech.txt from zizelevak/bips@b7f682f, as modified by approximately
the following command:

	uconv -f utf-8 -t utf-8 -x '::nfkd;' < czech.txt | \
		LC_ALL=C LANG=C sort -s > normalized/czech.txt

The *result* has been confirmed to not have any leading BOM, and to have
a final line terminated with '\n' (bitcoin/bips#622).  I did not yet
examine the source for these issues.  I did examine the source to
confirm that no lines had any trailing whitespace (see 08a05b4).

SHA-256 hash for the *resulting* czech.txt:
195136b3ba0f3099a9df625e0963f4efb56625b91c3a76bc5b4a9466a26880f7
@nym-zone
Copy link
Contributor

nym-zone commented Jan 8, 2018

I have created a Unicode NFKD-normalized and sorted czech.txt from zizelevak/bips@b7f682f, as modified by approximately the following command:

uconv -f utf-8 -t utf-8 -x '::nfkd;' < czech.txt | \
	LC_ALL=C LANG=C sort -s > normalized/czech.txt

The result has been confirmed to not have any leading BOM, and to have a final line terminated with '\n' (#622). I did not yet examine the source for these issues. I did examine the source to confirm that no lines had any trailing whitespace (see nym-zone/easyseed@08a05b4, #442).

SHA-256 hash for the resulting czech.txt:

195136b3ba0f3099a9df625e0963f4efb56625b91c3a76bc5b4a9466a26880f7

nym-zone added a commit to nym-zone/easyseed that referenced this pull request Jan 11, 2018
These are generated with easyseed and the bip39_vectorgen.sh script from
5f35cd0.  There are vectors in twelve languages:  The eight in the BIP
repository, and four more with pending proposals.

Three of the vectors for proposed languages are for wordlists which I
have modified:

 - Russian (234c66c, bitcoin/bips#432)
 - Ukrainian (08a05b4, bitcoin/bips#442)
 - Czech (ba25dfa, bitcoin/bips#493)

The wordlist for Indonesian (bitcoin/bips#621) is unmodified from the
proposal.

Ironically, easyseed does not yet self-test itself with these.  That
will be added in a future release, to verify consistency between builds.
For now, I publish these to aid in interoperability testing between
implementations.
@slush0
Copy link
Contributor

slush0 commented Aug 2, 2018

Sorry guys, I completely missed this. If the wordlist still passes bip39 tests, I'm fine with that (I didn't test it myself).

@zizelevak
Copy link
Contributor Author

@slush0 @luke-jr Every version (including the last one) was checked by my program, that satisfies BIP 0039 rules, including no colissions with other BIP 0039 dictionaries

@DonaldTsang DonaldTsang mentioned this pull request Dec 24, 2018
22 tasks
@DonaldTsang
Copy link

Cross-reference #753

@luke-jr luke-jr merged commit 51a8d83 into bitcoin:master Sep 19, 2019
@tevador
Copy link

tevador commented Nov 16, 2021

I'm not sure how this was missed, but the order of words in the wordlist is not alphabetical. The word "svetr" should come after "svazek".

The fact that the list looks sorted but is not may cause subtle bugs in applications using binary search to look up code words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants