Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding BIP-39 wordlist in German (2nd try) #942

Closed
wants to merge 5 commits into from

Conversation

cr
Copy link

@cr cr commented Jul 8, 2020

Our project is under significant pressure to offer a German translation of the BIP-39 wordlist in our frontends. Since #721 seems to be stalled due to quality issues, we have undertaken a concerted effort to devise a word list based on the most frequent words of the German language across all word classes. The list has undergone several three-people review cycles (two of us are trained editors) and was carefully filtered for names and everything else that you wouldn't want in such a list.

Here's a rundown of our main criteria:

  1. Just words with four or more letters
  2. Just words with a-z (no umlauts etc.)
  3. First four characters unique to word
  4. Each word unique among existing BIP-39 wordlists
  5. Strong focus on short and common words
  6. No names
  7. Avoiding low edit distance
  8. Avoiding homophones
  9. No obscenities or negativity

@cr
Copy link
Author

cr commented Jul 10, 2020

A few 121-bit random samples:

abnehmen darf ertrag statt klang irre echt spontan obwohl auch

enigmatisch expansion redner stets hinunter viren beide proaktiv daher korrekt

bezirk trendig ampel kummer ankunft gewagt herz winzigen infrastruktur ausland

hose durfte sache plaudern antrag gefunden entrinnen stinken verhalten vehement

furios phase kreis vergabe anstalt sozial gesicht sprache aufnahme herbst

gefolgt omen blau sagen mantel zensur verbot reiten treibt auskunft

beenden biegen bude fruchtbar brunnen erhielt taten fenster anfechten wird

bohne abfall anblick trug aufmerksam zwischen sammeln zueinander existenz ahnung

klage farbe anzahl initiative stieg banane seide holt gesagt ahnen

abstand tadellos gitarre bleibe dessen deswegen sitz duftend tisch radfahrer

@SebastianFloKa
Copy link

Hi,
for any reason I don't get notifications even I subscribed - so sorry for late response!
Your new list of criteria is quite similar to mine from the other thread
BIP0039.German.Wordlist.-.SOR-2019-01-15.pdf

But there are also some differences:

  • You don't mention a maximum amount of letters. I highly recommend to limit this to 8 letters to be aligned with the common sense of other wordlists (only some exceptions) and compatibility with most wallets.
  • From my experience having no word identical with other wordlist will be quite challenging, but if it's intended and possible - OK. Mandatory would only be to care about spanish wordlist (see Y75QMO) which doesn't accept words from their list to be used in others. What is your reason behind this criteria?
  • What you describe as "Avoiding low edit distance" is the work intensive part, also the tricky part. There are 3 types of levenshtein distance:
    -- substitution (should be 2 minimum)
    -- addition (tbd 1 or 2)
    -- permutation (tbd, min. 2?)
    Main question: Is levenshtein distance addition of 1 (Hut & HAUT) OK or do we request 2?
  • I see you didn't follow my recommendation of using capital letters neither. I'm still confinced that for the german wordlist this would be the better choice. You don't see advantage in this?

Having just a quick look to your list:

  • In your list you allow inflected verbs like the word "sagt" (="says") or "spreche" (=similar to he "speaks") next to the word "sprache" (="language") in your list. Beside the issue of levenshtein distance "substitution" of only 1 I highly recommend not to implement such words. Same with using the Präteritum like "sank" or "litt" or declination of adjectives like "teure". About plural like "tricks" could be discussed but I recommend using the singular if exists.
  • There's an inconsistency regarding the ph & f. E. g. the Duden recommends "Biografie" instead of "Biographie". I would recommend to use the new way of writing, at least consistency with old OR new way of writing.
  • And I generally have the impression that the list is quite full of disregard of levenshtein distance but haven't checked in detail.

Sorry to say that I'm not yet confinced of this list and at the same time I'm open to support.

@cr
Copy link
Author

cr commented Jul 21, 2020

Thanks for the feedback, @SebastianFloKa! It doesn't sound like it can be saved with a few updates and that it needs to be redone, but I'll try nonetheless. The easiest fix will be to replace a few questionable choices with uncommon nouns. You obviously found some candidates that we missed.

I generally worked of edit distance which is counting deletions and insertions only.

A few comments on some of the decisions we made:

  • Inflections increase memorability due to more natural sounding phrases
  • German inflections are often shorter than the base.
  • A number of verbs are more common and some exclusively used in an inflected form.
  • Most plurals or longer inflections are in the list because their singluar form is taken in another language or is too similar to another word in the list.
  • The German language is quite dense in short words, so short words vs. dissimilar words is a compromise you need to make.
  • We allow word pairs with low edit distance when they're phonetically and grammatically sufficiently distinct.
  • German (non-compound) words are comparably long. A limit of 8 characters will shift the set greatly towards uncommon words, niche terminology, and leanwords (all of which we actively avoided).
  • The guaranteed uniqueness of the first four letters lets the IME take care of longer words.

Do you think it makes sense to proceed?

@cr
Copy link
Author

cr commented Jul 21, 2020

I see you didn't follow my recommendation of using capital letters neither. I'm still confinced that for the german wordlist this would be the better choice. You don't see advantage in this?

The advantage of falling in line with the other languages and using all lowercase is that it eliminates capitalization as an unnecessary criterion. It lowers the mental overhead of reflecting on proper capitalization, as well as the number of key presses a typical user would perform on entry. If you start capitalizing the nouns, you open up to unfruitful (mental) debates like "Does Capitalization make a difference as it does in other passphrases?", "Why isn't the first word in the phrase capitalized?" and even "Why isn't there punctuation?" That's why we strongly encourage the simplicity of all lowercase, as sends a strong message of "You don't need to care".

@SebastianFloKa
Copy link

SebastianFloKa commented Jul 21, 2020

Hi @cr

I generally worked of edit distance which is counting deletions and insertions only.

Means levenshtein distance addition type. That's fine but I wouldn't leave substitution and permutation type aside.

Inflections increase memorability due to more natural sounding phrases

I'm not sure if there's a misunderstanding. The mnemonic seed will almost never (extremely difficult and uncommon) be a normal phrase. The expression "phrase" which is typically used for the sum of the words of a seed is misleading and it's explicitely recommended not to create a sentence by picking words from the list as the checksum in the end is important:

So as you can't create gramatically correct sentences as a seed anyway I think inflections would rather be a risk of misunderstanding (spreche - sprechen etc.) than an advantage.

German (non-compound) words are comparably long. A limit of 8 characters will shift the set greatly towards uncommon words, niche terminology, and leanwords (all of which we actively avoided).

Compared to the extreme disadvantage of creating a german wordlist that is incompatible with standard cold wallets (limit to 8 letters per word) it's absolutely worth to accept slightly more uncommon words (however uncommon is defined). If the word enigmatisch is seen as common, I'm sure we find sufficien ones that are within that range.

It lowers the mental overhead of reflecting on proper capitalization, as well as the number of key presses a typical user would perform on entry.

I think there's a misunderstanding. I meant to say "all caps" instead of "capital letters", sorry for this. I think our intentions are very close together as I also don't want people to mix lower case and upper case - agree 100%. But writing nouns in lower case is quite uncommon for german speakers whereas filling out templates in upper case (all caps) is much more common (official documents etc.). From your example:
klage farbe anzahl initiative stieg banane seide holt gesagt ahnen
KLAGE FARBE ANZAHL INITIATIVE STIEG BANANE SEIDE HOLT GESAGT AHNEN

  • Words in upper case letters are much easier to be reconstructed if some part of a word might be difficult to read on a physical wallet.
  • Even above example implies a different impression (on the computer), but you require much more height with lower case letters because you have to take the "g", "p", "q", "y" etc. into account which use the space "below the writing line" which ends up in smaller letters per same space. Main disadvantage is that some physical cold wallets with prestemped letters might require an supplementary edition (but actually require less prestemped letters in total compared to mixed upper and lower case).
    What's your opinion on all caps after this explanation?

What is the reason why you don't want to allow words from other wordlists be used in the German one? There's no technical or security risk and only a minor advantage for the user. The probably to create yourself a list of words that fit completely in two language lists is extremely low:
7.5x10^-13 (!) for a 12 words seed even if 200 words would be identical between 2 wordlists. 3.4x10^-32 for 100 identical words. Having "all caps" might make the list much more unmistakable than avoiding identical words.

But generally it looks like you have a quite resonable intention and our criteria are not so far from each other as it looks like from my writings. We can have an offline conversation if you like.

Notification is now working after changing email in settings, sorry again.

@cr
Copy link
Author

cr commented Jul 21, 2020

I'm not sure if there's a misunderstanding. The mnemonic seed will almost never (extremely difficult and uncommon) be a normal phrase.

There is no misunderstanding. I am fully aware of the bijective properties of BIP-39 representation and its implications for grammatical quality of BIP-39 encoding especially when it comes to randomized key material.

I agree that we need to eliminate the ambiguous cases we missed (And thanks for pointing them out, please keep them coming!), however, since German inflections rarely happen within the unique first four letters, there's ample room for variation for making words more distinguishable at the end.

Variable inflection can make a set of words harder to misread or misunderstand. A few examples:

  • bunt -> bunter (for increasing edit stance to bund)
  • ganz -> ganze (distancing from gans)
  • kurs -> kurse (distancing from kurz)

In German, infinitive verb forms have the unfortunate property of being among the longest. Increasing variability is also in line with perception models: the more variability in the distinguishable feature space among a set of items, the easier they become to distinguish and memorize.

Compared to the extreme disadvantage of creating a german wordlist that is incompatible with standard cold wallets (limit to 8 letters per word) it's absolutely worth to accept slightly more uncommon words

I wasn't aware of this criterion for cold wallets (and neither were the Italians, it seems), so this must undoubtedly be fixed. As I said, an eight character limit is especially tough on German, but for our current proposal it means replacing just about 200 words, which I'd rate as salvageable. I'll propose a fix here by the end of next week.

I think there's a misunderstanding. I meant to say "all caps"

Thanks for clarifying. I don't see why the German wordlist shouldn't align with the other languages, but if there's a general preference here to go this way, I'll gladly comply – and leave it to the frontend to choose not to SCREAM AT THE USER ALL THE TIME. :)

Words in upper case letters are much easier to be reconstructed if some part of a word might be difficult to read on a physical wallet.

This is really good thinking, though my tendency is to leave representation to the frontend designers and not needlessly encode such assumptions into the standard. The next frontend may need lowercase kerning to fit more letters on screen, or CamelCase will look even better, who knows?

What is the reason why you don't want to allow words from other wordlists be used in the German one

Language detection moves from statistical to trivial, as ideally you know by the first word alone which wordlist to check the rest against (for improved autocompletion performance after entering the first four letters). Other wordlists adhere to this standard, too, so I'd advise not to divert from it. There were just one or two dozen collisions filtered out of our list here, but it's probably a good idea to try and stay away from loanwords and shared words, anyway.

Admittedly, the French and English wordlists share junior, but that's obviously just an oversight.

@SebastianFloKa
Copy link

I will think about your arguments of not using infinitive forms. Interesting way of thinking ...

I wasn't aware of this criterion for cold wallets (and neither were the Italians, it seems)

right, the Italians are the exception I was talking about with 159 words being 9 letters long. Great to hear that you agree to this very critical point not to exceed 8 letters.

This is really good thinking, though my tendency is to leave representation to the frontend designers and not needlessly encode such assumptions into the standard

It's just that I don't want to forget those many people writing down their seed manually as a backup. Not urgent decision, let's keep sharing pro's and con's - it's a fruitfuil discours ...

Admittedly, the French and English wordlists share junior, but that's obviously just an oversight.

"junior" and 99 other words, beginning with abandon and ending with wagon - there are 100.
Anyway I fully agree it would be good to have no collission and your argument of language detection is actually the reason why I tried to succeed to fulfill this in the beginning as well. Eventhough I think for a computer it might be easy to check several words as it has to do be done anyway because there already exists wordlists with shared words. But let's give it a try - D'accord.

@p2w34
Copy link

p2w34 commented Jul 23, 2020

Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

@cr
Copy link
Author

cr commented Jul 23, 2020

Can't such requirements be shared across many languages?

They are, although some authors and reviewers seem to have slipped up. If by "shared" you mean provided as a turnkey toolset for checking against a (proposed) list of words, detecting collisions, analysing mutual edit distance, and so on, that would be truly helpful. I had to write my own set of tools for doing this, but that really was just the simplest and most straight-forward part of the task (for me at least). Still, having had access to such tools would have saved me half a day of work at least.

If I had to break down our work on the German list so far:

  • 10% compiling word lists
  • 10% writing tools for pre-processing and BIP-39 rule enforcement
  • 80% editing and quality checking

If I read your proposal correctly, you're aiming for generating new wordlists automatically, sparing everyone the work of creating one – more or less – manually. I'd expect that this will be very hard to achieve. First, where do you get the base data from? Crawl the Internet? Try to find precompiled lists, hopefully ordered by frequency?

Laguages are quite varied. Some of the rules may make sense in one language that don't make much sense in others. German infinitive verb forms are not the shortes, for example. Edit distance breaks down for different character sets. This points towards requiring good judgement in flexibly applying the rules which are somewhat futile to cast in stone.

And then how do you implement quality checking?

  • Can you definitively weed out all names?
  • Can you spot the imported loan word, or is the word sufficiently assimilated?
  • Can you tell common from uncommon words?
  • Can you find all the words with negative or sexual connotations, or a questionable political history, or a recent unfavorable public discourse?
  • Can you trust your spellchecker? Does it require manual interaction?
  • Is the spelling perhaps not wrong but obsolete?
  • Do flexions add benefits, or do they just make things more contrived?
  • Do certain words just sound too similar, and how to modify them such that they don't?
  • Which words are easily confused, or regularly misspelled?

In summary, I think there's great benefit from a toolset for pre-processing and contiuous rule checking. Especially a great UI can cut down editing time during quality control a good deal. However, I am not convinced that the process of generating a word list can be automated beyond compiling and pre-processing an initial set for a human editor. Most of the work is in the (likely manual) process of reaching the final quality level which I don't think a non-native or not highly skilled speaker of the target language can achieve.

@SebastianFloKa
Copy link

Hi @p2w34
appreciated and "Chapeau" for the work you did/do related to the wordlists: Both, your language one but also your intention of harmonization.
In the first attempt of a german wordlist I did exactly what you are talking about: Read all the other languages "wordlist threads" to understand the reasons behind certain criteria and also their importance (nice-to-have, must-have). The outcome is above attached pdf. But it's a living document. For example in the first attempt the length of words was limited to only 6 letters and it took me far more than hundret hours to accept that this is not possible, particularly in combination with levenshtein distance. However I checked your reqiuirement list is quite similar to my my personal understanding of requirements on wordlists in general which you can check in above PDF and
B) Is mainly aligned with the first as well as this second attempt of a German wordlist

except two points:

  • the "no plural" one which is "under discussion" in this PR and no consens found so far.
  • "only nouns and adjectives" I think we would have had consens in this group that we don't want to limit to these to types. Why not verbs for example?
    But there's no consens yet about which types we actually would allow or disallow and if interflections are allowed or not etc.

I would only recommend that you mark the criteria in your list with different "importance" or how strict a rule is.
Also you haven't mention that words from the spanish list are not allowed in others anymore. By the way, There's also a discussion about this
A) Is it OK to use words from other wordlist (except spanisch) and if yes how many?
B) Is it OK to continue the spanish way of refusing other, future lists to use words from your own one?

It's a good apporach to harmonize. In the same time I agree with @cr that based on different backgrounds (systematic of language, cultur, etc.) it is important and it is good that this work is done by humans. These simple words will be for many people the link, connection and interface to a quite complex sphere. Therefore: The wordlist is probably the most human component in blockchain technology so it's good it's made by humans.

@p2w34
Copy link

p2w34 commented Jul 23, 2020

Hi @cr, Hi @SebastianFloKa,

thank you for your extensive comments. I think we all see that there is a lot of manual work with regards to choosing proper words. I am strongly convinced the words should be chosen manually. By that, I mean choosing a common 'preliminary word list'. Then, based on such list, the work could be automated to some extent. The translation part could be done either manually or automated, but then extracting kind of draft for a final list - for sure automated. I have already prepared a script for that (most likely not in the final version, and only for European languages). It is able to extract only words fulfilling requirements, including the Levenshtein requirement. And then, in order to get the final version of the list from the draft - for sure this may be achieved only manually.

@cr you express a lot of concerns with regards to how to make sure the words are 'the proper ones' - these are valid concerns. My personal opinion is that one should be pragmatic here - the lists should be good enough, they for sure will not be perfect. Issues related to spelling may be caught automatically, some of them for sure need eyeballing, and some will be irrelevant (like whether a particular word is known enough - I do not think it is possible to fulfill such requirements).

I myself would use only nouns and adjectives. I do not consider verbs as most nouns come from them. That should suffice for European languages. For other groups of languages - this will have to be yet clarified.
Regarding plurals - I do not have an answer for it now. Maybe the way to go is to have stricter requirements (in this case to forbid plurals) and try to create a list or two to see how it goes and then maybe draw some conclusions?

However I checked your requirement list is quite similar to my personal understanding of requirements on wordlists in general which you can check in above PDF and
B) Is mainly aligned with the first as well as this second attempt of a German wordlist

That makes me happy. You mention reading all the threads and I read them too. I am also aware of the problem of words shared across multiple lists. My personal feeling is this might cause more problems than benefits. Why? It is because:

  • which words will be most likely duplicated? the ones which are the most common,
  • with each new list, the task of choosing words will be harder and harder,
  • I am in favor of recognizing the lists not by a single word, but by a solution which is industry-standard: using hashes (I put it into one of the wlips proposals); thanks to it event multiple lists for a single language may coexist - for example, current English list and hopefully newer one, fulfilling new requirements,
  • I would use it to my advantage - the more words shared across the lists, the easier to create them. I wouldn't put any limit here - the more the better.

I am not able to come up with any nice summary of this post, so let me repost yours as I find it well put:

It's a good apporach to harmonize. In the same time I agree with @cr that based on different backgrounds (systematic of language, cultur, etc.) it is important and it is good that this work is done by humans. These simple words will be for many people the link, connection and interface to a quite complex sphere. Therefore: The wordlist is probably the most human component in blockchain technology so it's good it's made by humans.

@cr
Copy link
Author

cr commented Jul 24, 2020

@p2w34, I finally understand your approach of manually compiling categorized wordlists upfront, then filtering them down with the tool chain. While this method can definitely work, here's one data point that may be worth contemplating when considering this approach.

For the German word list, we started with a set of 10k most frequent words from a random internet crawl. Just enforcing a-z only, 4-13 letters, 4-uniqueness, and BIP-39 wordlist uniqueness brought that list down to just about 1500 words. We then manually removed about 100 words related to names, countries, etc, and added another ~750 words more or less manually before beginning the process of final quality control. Cutting the final list down to just 8 letters will now cost another ~300 words. In other words, we observed an overall shrinkage of almost 90% just by enforcing technical BIP-39 rules.

This shrinkage will be quite variable across languages, and it's hard for me to predict, but what I am worried about is that the initial word list may have to be multiple times larger than the 2k we're aiming for, perhaps even tenfold. That may be a quite a task for compiling an upfront word list.

the lists should be good enough, they for sure will not be perfect.

Yes, any word list can be improved and it can't be objectively perfect. There are many judgement calls involved – as discussed before –, and for languages like German, you will have to find a good balance between edit distance and linguistic distinctiveness within that tiny 4 to 8 letter window, so any word list will also end up with an individual style, and the more you look, the more you'll find to improve.

I am not sure what you meant by "good enough", but taken at face value I don't agree with the sentiment. The word lists we're producing will be part of a global standard, and they're nearly impossible to repair without major impact on the software ecosystem that supports them. In my opinion, under these conditions, any word list deserves as much care and attention as sensibly possible.

@p2w34
Copy link

p2w34 commented Jul 25, 2020

For the German word list, we started with a set of 10k most frequent words from a random internet crawl. Just enforcing a-z only, 4-13 letters, 4-uniqueness, and BIP-39 wordlist uniqueness brought that list down to just about 1500 words. We then manually removed about 100 words related to names, countries, etc, and added another ~750 words more or less manually before beginning the process of final quality control. Cutting the final list down to just 8 letters will now cost another ~300 words. In other words, we observed an overall shrinkage of almost 90% just by enforcing technical BIP-39 rules.

When I was compiling the Polish word list, I had a different approach. I downloaded a Polish dictionary, and whenever I had time I would go through it and I would pick words manually. I recall that there was a moment that I already went through the whole dictionary and I still needed ~100 words. I did manage to complete the list but I could have chosen the requirements differently. Especially one of them - I decided to avoid words with Polish diacritics (Polish alphabet consists of aąbcćdeęfghijklłmnńoóprsśtuwyzźż and I decided to not to use words with: ąćęłńóśźż). That limited number of eligible words by a lot. That is why in wlips I recommend involving words with diacritics as well. This is also a reason why I am against avoiding duplicates among the lists - this unnecessary makes the work harder.
Another thing is that in my opinion words should be manually selected without using lists like 'the most frequent words'. IMHO it is not about how frequent a particular word is, but whether it is known to (almost) everyone. When having just a glimpse at such lists I could quickly notice that there are many eligible words missing there. That is why I started to craft manually the preliminary word list

For the German word list, we started with a set of 10k most frequent words from a random internet crawl. Just enforcing a-z only, 4-13 letters, 4-uniqueness, and BIP-39 wordlist uniqueness brought that list down to just about 1500 words. We then manually removed about 100 words related to names, countries, etc, and added another ~750 words more or less manually before beginning the process of final quality control. Cutting the final list down to just 8 letters will now cost another ~300 words. In other words, we observed an overall shrinkage of almost 90% just by enforcing technical BIP-39 rules.

I hope that with the manually selected words + allowing duplicates + allowing diacritics the shrinkage will be lower.
I just did a quick check. At the moment of writing, the preliminary word list consists of 1394 words. I run the script to extract a set of words fulfilling the requirements (with simply: /usr/bin/python3.6 /home/pawel/priv/wlips/scripts/create_word_list.py -l english_us) and as a result, I got a list with 859 words.
I am aware, the longer the list, the higher the shrinkage. I also agree this will vary across languages. There is no guarantee that preliminary_word_list of length ~5k or even 100k will suffice to provide 2048 or more. However, one can add manually missing words. Adding missing words is easier than creating them from scratch.

I am not sure what you meant by "good enough", but taken at face value I don't agree with the sentiment. The word lists we're producing will be part of a global standard, and they're nearly impossible to repair without major impact on the software ecosystem that supports them. In my opinion, under these conditions, any word list deserves as much care and attention as sensibly possible.

I am also of this opinion. This is my "good enough".

@cr
Copy link
Author

cr commented Jul 25, 2020

@p2w34, overall I think that your work is going in a great direction with great potential, and this is truly a fruitful discussion to have, but I think it is digressing further out of scope of this pull request. I would like to ask to focus discussion in this PR on the lists that's been proposed and which we're currently working on to get to the 8-letter maximum specs, only to then put it through some more quality checking.

If I digest our wlips discussion down to the portions that are relevant to this PR, you were thinking out loud not to continue work on this list, and to regroup and channel it through your process instead which essentially means to start over.

I mentioned that we're under time pressure to provide a German translation in out frontends, and at the moment we're not too far from actually fulfilling that. Considering to switch over to your process, it begs the question of timeline, so how much time would you estimate it takes to get your toolset to a point where it's polished enough to invite contributors to new languages, and then for them to get up to the point that we're already at right now?

@p2w34
Copy link

p2w34 commented Jul 25, 2020

@cr I thank you for the discussion too. When I wrote the first post here I didn't expect the discussion would evolve so much.

If I digest our wlips discussion down to the portions that are relevant to this PR, you were thinking out loud not to continue work on this list, and to regroup and channel it through your process instead which essentially means to start over.

My intention, let me quote myself, was:

https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

So again, I am not the one who decides.

I mentioned that we're under time pressure to provide a German translation in out frontends, and at the moment we're not too far from actually fulfilling that. Considering to switch over to your process, it begs the question of timeline, so how much time would you estimate it takes to get your toolset to a point where it's polished enough to invite contributors to new languages, and then for them to get up to the point that we're already at right now?

The work on wlips is not meant to be done any under time pressure. The toolset is more or less ready, the most time consuming part left is to finish mentioned prelimiary_word_list. I do not put any time frame as I would like to first see that the approach with wlips is correct (so translations of at least a couple of languages should be created). It might happen that during that one has to draw some conclusions, take a step back and modify the approach. All that is absolutely needed to make the word lists good enough.
If you are under time pressure I do not really have advice for you ;/.

@cr
Copy link
Author

cr commented Jul 25, 2020

@p2w34, I know what you wrote, and any time someone writes "I'm not saying ..., but ...", they're still saying it, and I was only considering what you were not really not saying. :) Please don't worry, I am taking this lightly, and I do see the advantages of a toolchain like yours, but with the way things are, we will keep focusing on finalizing our proposal here and hopefully get a German translation into the standard as soon as possible.

However, since you're here, running our current proposal as a preliminary list through your toolchain and learning about the various metrics it collects could be really helpful. Would you be able to help out there?

@p2w34
Copy link

p2w34 commented Jul 25, 2020

@cr,
one can find useful scripts under https://github.com/p2w34/wlips/blob/master/scripts/ directory.
I run list of the words from this PR through create_word_list.py, and (mainly due to the 8 character requirements and Levenshtein distance) it produced a list of 1448 words.
One can also just validate a list with validate_word_list.py. As the hardest requirement to meet is the Levenshtein one, this scripts also spits out an output which can be then fed into a Neo4j database to graphically visualise the distances between words. The result is quite nice (example: https://github.com/p2w34/blog/blob/master/2019-03-25-polish-word-list-bip0039/graph_english.png) although requires further manual work. That is why it is imho better to use tools like create_word_list.py to automate choosing words fulfilling the requirements. As of now they do not provide any advanced statics though.

@cr
Copy link
Author

cr commented Jul 31, 2020

As promised there's a new list that replaces all words longer than 8 characters. However, I refrained from shortening it to 2048 as now begins the process of optimizing for Levenshtein edit distance which will be rather long. We need to edit about 600 words.

@luke-jr
Copy link
Member

luke-jr commented Aug 1, 2020

Note #941

@SebastianFloKa
Copy link

@cr
Is your idea behind inflections to "allow" them in order to get sufficient words or is it that you "require" them for any reason.
In other words: In case it would be possible to create a list of max. 8 words and fulfilling the other requirements and avoiding inflections at the same time, would this be acceptable to you?

@cr
Copy link
Author

cr commented Aug 6, 2020

@SebastianFloKa At the moment we're still operating our list just within "fixable" space when it comes to beating the final end boss of edit distance requirements. However, asking to limit to infinitive verbs and proper nouns (and what about the other word classes?) puts us safely inside "start over from scratch" territory. It's a tough call, essentially asking to admit defeat and committing the work of many days to the trash can.

In a nutshell, allowing flexions and various word classes has the major advantages of creating more natural and hence more memorable phrases, more distinguishable words, and the increased variance allows for a denser packing of more common words into the narrow gap permissible for BIP39 wordlists. What, on the other hand, are the advantages of limiting to infinitive verbs and proper nouns?

@SebastianFloKa
Copy link

@cr Well, if we talk about "days to the trash can" in my case it were literally many many fulltime weeks of "useless" work when trying to fulfill the initial requirement of limiting to 6 letters (even 5 letters were in play). And taking such a tough restriction into consideration the outcome was actually not too bad (See PDF in first attempt). Anyway, the list was not completely satisfying that's why I mentioned to continue in August (so now) on a revised wordlist which is extended to 8 letters. There will be almost no access to Internet for the next four to five weeks for me but I prepared some dicitonary extracts and can check if generally an infinitive 8-letters wordlist might be feasible. Eventhough I understand your point I'm not completely convinced yet by including those interflections but it's also not my decision.

and what about the other word classes?
I'm generally with you that it could make sense to extend to more wordclasses (I think you are referring to a statement of p2w34) eventhough for the standard user (not brain wallet etc.) the expectaion of common words would go for nouns, verbs and adjectives.

If "creating memorable phrases" (brain wallet) or the idea of "marking words in a book" as a wallet (all of them actually not really recommended) would be the main (or only) intention behind the wordlist it might have some advantages to include interflections, I agree. On the other hand the infinitive form of regular verbs are identical with the 1. pers. singular and 3. pers. plural (infinitive = gehen, 1. pers. plural = wir gehen, 3. pers. plural = sie gehen, etc.). Means having an infinitive form does not limit you from creating memorable phrases using interflexions.

What, on the other hand, are the advantages of limiting to infinitive verbs and proper nouns?

One of the advantages of not having interflections would be for the "standard user" to reduce the risk of accidentially misinterpreting a word. Example: Your Seed starts with "FUCHS, GELB, LAUFEN" and then follows a "ANGLE" (1. pers. singular). Some people might either oversee this and assume the noun "ANGEL" is foressen or at least they are confused if there might be a typo (no matter if all caps or not). Another constellation could be when two more expected words are 1 levenshtein addition or subraction away. Example "FANGE" (1. pers. singular) and the expected "FANG" (noun) or "FANGEN" (verb) aside. This is the advantage I see behind reducing the amount of wordclasses (and limit to infinitives): it reduces the risk of misinterpretation when copying, writing down, communicating or reading a seed.

@DavidMStraub
Copy link

Just found this PR by accident. Great work by @cr, but I find your statement

Since #721 seems to be stalled due to quality issues,

almost insulting. The reason it stalled was stated by me in that thread:

As far as I'm concerned, I've abandoned this. It's really a weird PR. First, I get a few useful comments, make a few improvements. Then, I get a list of requests from @thomasklemm, some of which I proved even with a Jupyter notebook are not realistic/reasonable. No reply. Half a year later, @SebastianFloKa comes with a totally different suggestion, but does not actually submit a PR.

Looks like your PR is stalling for the same reason.

@cr
Copy link
Author

cr commented Oct 26, 2020

Yes, it has stalled indeed as ATM our proposal requires too much re-work to fit the edit distance requirement, and it looks like we'll simply run with our own proprietary list to offer something suitable for our German users in the near future.

I find your statement [...] almost insulting.

@DavidMStraub, I'm sorry for having a what I thought was an objective statement come across as an insult. I took the time to review your PR before starting a new one, and after spotting among the first 130 words or so things like Aktuar, Antifa, Antje, Apsis, Attika and about a dozen other proper names and questionable words that should be avoided for a standardized list, the conclusion was that it hasn't undergone even a single editorial review which would have easily caught such obvious issues. What's your assessment on words like the ones I just pointed out?

@DavidMStraub
Copy link

No worries, I seriously think you did a much better job! I just find it weird that such enhancements never get merged but instead get stuck in endless discussions.

@cr
Copy link
Author

cr commented Jan 28, 2021

We have dropped BIP39 support from our clients, so unfortunately we don't plan to invest the time it takes to finalize this proposal to also conform to the mutual edit distance requirement.

@cr cr closed this Jan 28, 2021
@SebastianFloKa
Copy link

I'm working on it right now but it's extremely timeconsuming to achieve the levensthein distance and other requirements. As my own tool becomes very slow with length up to 8 letters I will try to switch to the tools mentioned by @p2w34 in order to shorten this.

@SebastianFloKa
Copy link

A new (third attempt) PR was created - see #1071

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants