-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding BIP-39 wordlist in German (2nd try) #942
Conversation
A few 121-bit random samples:
|
Hi, But there are also some differences:
Having just a quick look to your list:
Sorry to say that I'm not yet confinced of this list and at the same time I'm open to support. |
Thanks for the feedback, @SebastianFloKa! It doesn't sound like it can be saved with a few updates and that it needs to be redone, but I'll try nonetheless. The easiest fix will be to replace a few questionable choices with uncommon nouns. You obviously found some candidates that we missed. I generally worked of edit distance which is counting deletions and insertions only. A few comments on some of the decisions we made:
Do you think it makes sense to proceed? |
The advantage of falling in line with the other languages and using all lowercase is that it eliminates capitalization as an unnecessary criterion. It lowers the mental overhead of reflecting on proper capitalization, as well as the number of key presses a typical user would perform on entry. If you start capitalizing the nouns, you open up to unfruitful (mental) debates like "Does Capitalization make a difference as it does in other passphrases?", "Why isn't the first word in the phrase capitalized?" and even "Why isn't there punctuation?" That's why we strongly encourage the simplicity of all lowercase, as sends a strong message of "You don't need to care". |
Hi @cr
Means levenshtein distance addition type. That's fine but I wouldn't leave substitution and permutation type aside.
I'm not sure if there's a misunderstanding. The mnemonic seed will almost never (extremely difficult and uncommon) be a normal phrase. The expression "phrase" which is typically used for the sum of the words of a seed is misleading and it's explicitely recommended not to create a sentence by picking words from the list as the checksum in the end is important:
Compared to the extreme disadvantage of creating a german wordlist that is incompatible with standard cold wallets (limit to 8 letters per word) it's absolutely worth to accept slightly more uncommon words (however uncommon is defined). If the word enigmatisch is seen as common, I'm sure we find sufficien ones that are within that range.
I think there's a misunderstanding. I meant to say "all caps" instead of "capital letters", sorry for this. I think our intentions are very close together as I also don't want people to mix lower case and upper case - agree 100%. But writing nouns in lower case is quite uncommon for german speakers whereas filling out templates in upper case (all caps) is much more common (official documents etc.). From your example:
What is the reason why you don't want to allow words from other wordlists be used in the German one? There's no technical or security risk and only a minor advantage for the user. The probably to create yourself a list of words that fit completely in two language lists is extremely low: But generally it looks like you have a quite resonable intention and our criteria are not so far from each other as it looks like from my writings. We can have an offline conversation if you like. Notification is now working after changing email in settings, sorry again. |
There is no misunderstanding. I am fully aware of the bijective properties of BIP-39 representation and its implications for grammatical quality of BIP-39 encoding especially when it comes to randomized key material. I agree that we need to eliminate the ambiguous cases we missed (And thanks for pointing them out, please keep them coming!), however, since German inflections rarely happen within the unique first four letters, there's ample room for variation for making words more distinguishable at the end. Variable inflection can make a set of words harder to misread or misunderstand. A few examples:
In German, infinitive verb forms have the unfortunate property of being among the longest. Increasing variability is also in line with perception models: the more variability in the distinguishable feature space among a set of items, the easier they become to distinguish and memorize.
I wasn't aware of this criterion for cold wallets (and neither were the Italians, it seems), so this must undoubtedly be fixed. As I said, an eight character limit is especially tough on German, but for our current proposal it means replacing just about 200 words, which I'd rate as salvageable. I'll propose a fix here by the end of next week.
Thanks for clarifying. I don't see why the German wordlist shouldn't align with the other languages, but if there's a general preference here to go this way, I'll gladly comply – and leave it to the frontend to choose not to SCREAM AT THE USER ALL THE TIME. :)
This is really good thinking, though my tendency is to leave representation to the frontend designers and not needlessly encode such assumptions into the standard. The next frontend may need lowercase kerning to fit more letters on screen, or CamelCase will look even better, who knows?
Language detection moves from statistical to trivial, as ideally you know by the first word alone which wordlist to check the rest against (for improved autocompletion performance after entering the first four letters). Other wordlists adhere to this standard, too, so I'd advise not to divert from it. There were just one or two dozen collisions filtered out of our list here, but it's probably a good idea to try and stay away from loanwords and shared words, anyway. Admittedly, the French and English wordlists share junior, but that's obviously just an oversight. |
I will think about your arguments of not using infinitive forms. Interesting way of thinking ...
right, the Italians are the exception I was talking about with 159 words being 9 letters long. Great to hear that you agree to this very critical point not to exceed 8 letters.
It's just that I don't want to forget those many people writing down their seed manually as a backup. Not urgent decision, let's keep sharing pro's and con's - it's a fruitfuil discours ...
"junior" and 99 other words, beginning with abandon and ending with wagon - there are 100. |
Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you. |
They are, although some authors and reviewers seem to have slipped up. If by "shared" you mean provided as a turnkey toolset for checking against a (proposed) list of words, detecting collisions, analysing mutual edit distance, and so on, that would be truly helpful. I had to write my own set of tools for doing this, but that really was just the simplest and most straight-forward part of the task (for me at least). Still, having had access to such tools would have saved me half a day of work at least. If I had to break down our work on the German list so far:
If I read your proposal correctly, you're aiming for generating new wordlists automatically, sparing everyone the work of creating one – more or less – manually. I'd expect that this will be very hard to achieve. First, where do you get the base data from? Crawl the Internet? Try to find precompiled lists, hopefully ordered by frequency? Laguages are quite varied. Some of the rules may make sense in one language that don't make much sense in others. German infinitive verb forms are not the shortes, for example. Edit distance breaks down for different character sets. This points towards requiring good judgement in flexibly applying the rules which are somewhat futile to cast in stone. And then how do you implement quality checking?
In summary, I think there's great benefit from a toolset for pre-processing and contiuous rule checking. Especially a great UI can cut down editing time during quality control a good deal. However, I am not convinced that the process of generating a word list can be automated beyond compiling and pre-processing an initial set for a human editor. Most of the work is in the (likely manual) process of reaching the final quality level which I don't think a non-native or not highly skilled speaker of the target language can achieve. |
Hi @p2w34 except two points:
I would only recommend that you mark the criteria in your list with different "importance" or how strict a rule is. It's a good apporach to harmonize. In the same time I agree with @cr that based on different backgrounds (systematic of language, cultur, etc.) it is important and it is good that this work is done by humans. These simple words will be for many people the link, connection and interface to a quite complex sphere. Therefore: The wordlist is probably the most human component in blockchain technology so it's good it's made by humans. |
Hi @cr, Hi @SebastianFloKa, thank you for your extensive comments. I think we all see that there is a lot of manual work with regards to choosing proper words. I am strongly convinced the words should be chosen manually. By that, I mean choosing a common 'preliminary word list'. Then, based on such list, the work could be automated to some extent. The translation part could be done either manually or automated, but then extracting kind of draft for a final list - for sure automated. I have already prepared a script for that (most likely not in the final version, and only for European languages). It is able to extract only words fulfilling requirements, including the Levenshtein requirement. And then, in order to get the final version of the list from the draft - for sure this may be achieved only manually. @cr you express a lot of concerns with regards to how to make sure the words are 'the proper ones' - these are valid concerns. My personal opinion is that one should be pragmatic here - the lists should be good enough, they for sure will not be perfect. Issues related to spelling may be caught automatically, some of them for sure need eyeballing, and some will be irrelevant (like whether a particular word is known enough - I do not think it is possible to fulfill such requirements). I myself would use only nouns and adjectives. I do not consider verbs as most nouns come from them. That should suffice for European languages. For other groups of languages - this will have to be yet clarified.
That makes me happy. You mention reading all the threads and I read them too. I am also aware of the problem of words shared across multiple lists. My personal feeling is this might cause more problems than benefits. Why? It is because:
I am not able to come up with any nice summary of this post, so let me repost yours as I find it well put:
|
@p2w34, I finally understand your approach of manually compiling categorized wordlists upfront, then filtering them down with the tool chain. While this method can definitely work, here's one data point that may be worth contemplating when considering this approach. For the German word list, we started with a set of 10k most frequent words from a random internet crawl. Just enforcing a-z only, 4-13 letters, 4-uniqueness, and BIP-39 wordlist uniqueness brought that list down to just about 1500 words. We then manually removed about 100 words related to names, countries, etc, and added another ~750 words more or less manually before beginning the process of final quality control. Cutting the final list down to just 8 letters will now cost another ~300 words. In other words, we observed an overall shrinkage of almost 90% just by enforcing technical BIP-39 rules. This shrinkage will be quite variable across languages, and it's hard for me to predict, but what I am worried about is that the initial word list may have to be multiple times larger than the 2k we're aiming for, perhaps even tenfold. That may be a quite a task for compiling an upfront word list.
Yes, any word list can be improved and it can't be objectively perfect. There are many judgement calls involved – as discussed before –, and for languages like German, you will have to find a good balance between edit distance and linguistic distinctiveness within that tiny 4 to 8 letter window, so any word list will also end up with an individual style, and the more you look, the more you'll find to improve. I am not sure what you meant by "good enough", but taken at face value I don't agree with the sentiment. The word lists we're producing will be part of a global standard, and they're nearly impossible to repair without major impact on the software ecosystem that supports them. In my opinion, under these conditions, any word list deserves as much care and attention as sensibly possible. |
When I was compiling the Polish word list, I had a different approach. I downloaded a Polish dictionary, and whenever I had time I would go through it and I would pick words manually. I recall that there was a moment that I already went through the whole dictionary and I still needed ~100 words. I did manage to complete the list but I could have chosen the requirements differently. Especially one of them - I decided to avoid words with Polish diacritics (Polish alphabet consists of aąbcćdeęfghijklłmnńoóprsśtuwyzźż and I decided to not to use words with: ąćęłńóśźż). That limited number of eligible words by a lot. That is why in wlips I recommend involving words with diacritics as well. This is also a reason why I am against avoiding duplicates among the lists - this unnecessary makes the work harder.
I hope that with the manually selected words + allowing duplicates + allowing diacritics the shrinkage will be lower.
I am also of this opinion. This is my "good enough". |
@p2w34, overall I think that your work is going in a great direction with great potential, and this is truly a fruitful discussion to have, but I think it is digressing further out of scope of this pull request. I would like to ask to focus discussion in this PR on the lists that's been proposed and which we're currently working on to get to the 8-letter maximum specs, only to then put it through some more quality checking. If I digest our wlips discussion down to the portions that are relevant to this PR, you were thinking out loud not to continue work on this list, and to regroup and channel it through your process instead which essentially means to start over. I mentioned that we're under time pressure to provide a German translation in out frontends, and at the moment we're not too far from actually fulfilling that. Considering to switch over to your process, it begs the question of timeline, so how much time would you estimate it takes to get your toolset to a point where it's polished enough to invite contributors to new languages, and then for them to get up to the point that we're already at right now? |
@cr I thank you for the discussion too. When I wrote the first post here I didn't expect the discussion would evolve so much.
My intention, let me quote myself, was:
So again, I am not the one who decides.
The work on wlips is not meant to be done any under time pressure. The toolset is more or less ready, the most time consuming part left is to finish mentioned prelimiary_word_list. I do not put any time frame as I would like to first see that the approach with wlips is correct (so translations of at least a couple of languages should be created). It might happen that during that one has to draw some conclusions, take a step back and modify the approach. All that is absolutely needed to make the word lists good enough. |
@p2w34, I know what you wrote, and any time someone writes "I'm not saying ..., but ...", they're still saying it, and I was only considering what you were not really not saying. :) Please don't worry, I am taking this lightly, and I do see the advantages of a toolchain like yours, but with the way things are, we will keep focusing on finalizing our proposal here and hopefully get a German translation into the standard as soon as possible. However, since you're here, running our current proposal as a preliminary list through your toolchain and learning about the various metrics it collects could be really helpful. Would you be able to help out there? |
@cr, |
As promised there's a new list that replaces all words longer than 8 characters. However, I refrained from shortening it to 2048 as now begins the process of optimizing for Levenshtein edit distance which will be rather long. We need to edit about 600 words. |
Note #941 |
@cr |
@SebastianFloKa At the moment we're still operating our list just within "fixable" space when it comes to beating the final end boss of edit distance requirements. However, asking to limit to infinitive verbs and proper nouns (and what about the other word classes?) puts us safely inside "start over from scratch" territory. It's a tough call, essentially asking to admit defeat and committing the work of many days to the trash can. In a nutshell, allowing flexions and various word classes has the major advantages of creating more natural and hence more memorable phrases, more distinguishable words, and the increased variance allows for a denser packing of more common words into the narrow gap permissible for BIP39 wordlists. What, on the other hand, are the advantages of limiting to infinitive verbs and proper nouns? |
@cr Well, if we talk about "days to the trash can" in my case it were literally many many fulltime weeks of "useless" work when trying to fulfill the initial requirement of limiting to 6 letters (even 5 letters were in play). And taking such a tough restriction into consideration the outcome was actually not too bad (See PDF in first attempt). Anyway, the list was not completely satisfying that's why I mentioned to continue in August (so now) on a revised wordlist which is extended to 8 letters. There will be almost no access to Internet for the next four to five weeks for me but I prepared some dicitonary extracts and can check if generally an infinitive 8-letters wordlist might be feasible. Eventhough I understand your point I'm not completely convinced yet by including those interflections but it's also not my decision.
If "creating memorable phrases" (brain wallet) or the idea of "marking words in a book" as a wallet (all of them actually not really recommended) would be the main (or only) intention behind the wordlist it might have some advantages to include interflections, I agree. On the other hand the infinitive form of regular verbs are identical with the 1. pers. singular and 3. pers. plural (infinitive = gehen, 1. pers. plural = wir gehen, 3. pers. plural = sie gehen, etc.). Means having an infinitive form does not limit you from creating memorable phrases using interflexions.
One of the advantages of not having interflections would be for the "standard user" to reduce the risk of accidentially misinterpreting a word. Example: Your Seed starts with "FUCHS, GELB, LAUFEN" and then follows a "ANGLE" (1. pers. singular). Some people might either oversee this and assume the noun "ANGEL" is foressen or at least they are confused if there might be a typo (no matter if all caps or not). Another constellation could be when two more expected words are 1 levenshtein addition or subraction away. Example "FANGE" (1. pers. singular) and the expected "FANG" (noun) or "FANGEN" (verb) aside. This is the advantage I see behind reducing the amount of wordclasses (and limit to infinitives): it reduces the risk of misinterpretation when copying, writing down, communicating or reading a seed. |
Just found this PR by accident. Great work by @cr, but I find your statement
almost insulting. The reason it stalled was stated by me in that thread:
Looks like your PR is stalling for the same reason. |
Yes, it has stalled indeed as ATM our proposal requires too much re-work to fit the edit distance requirement, and it looks like we'll simply run with our own proprietary list to offer something suitable for our German users in the near future.
@DavidMStraub, I'm sorry for having a what I thought was an objective statement come across as an insult. I took the time to review your PR before starting a new one, and after spotting among the first 130 words or so things like |
No worries, I seriously think you did a much better job! I just find it weird that such enhancements never get merged but instead get stuck in endless discussions. |
We have dropped BIP39 support from our clients, so unfortunately we don't plan to invest the time it takes to finalize this proposal to also conform to the mutual edit distance requirement. |
I'm working on it right now but it's extremely timeconsuming to achieve the levensthein distance and other requirements. As my own tool becomes very slow with length up to 8 letters I will try to switch to the tools mentioned by @p2w34 in order to shorten this. |
A new (third attempt) PR was created - see #1071 |
Our project is under significant pressure to offer a German translation of the BIP-39 wordlist in our frontends. Since #721 seems to be stalled due to quality issues, we have undertaken a concerted effort to devise a word list based on the most frequent words of the German language across all word classes. The list has undergone several three-people review cycles (two of us are trained editors) and was carefully filtered for names and everything else that you wouldn't want in such a list.
Here's a rundown of our main criteria: