-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add German word list for BIP0039 #721
Conversation
What are the chances for this to get merged soon? |
Are 'Abfall' and 'Apfel' too phonetically similar? Would a blind user listening to a spoken seed be potentially confused by having both those words in the list? |
@rodasmith, no, those two actually sound quite distinct (Upfall vs. Upfle would be a rough English transcription). |
The words 'Graf' and 'Graph' are homophones, so one should be replaced. |
Another pair of homophones: 'Miene' and 'Mine'. |
Here is a few words I've thrown out when shortening to 2048 that can be put back in if more problems appear:
|
I'm also not sure whether 'Faser' and 'Phase' are distinct enough. |
Are 'Spray' and 'Spree' homophones? |
How about 'Staat' and 'Stadt'? The vowel is longer the the former but I don't know whether that is a clear enough distinction. |
Spray and Spree are totally different, the latter is proununced with 'Sh'. Staat and Stadt is a very long vs a very short vowel, hardly mistakable. Faser vs Phase might be problematic, throwing the latter out. |
Are 'Uran' and 'Urahn' homophones? |
No, emphasis on 2nd vs 1st syllable. |
Ack (now that those pairs of homophones are removed) |
Is there a possibility that the user will forget that all words start with capital? Does capitalizing the first letter have a grammatical meaning for conjugation in German? I would say to go with whatever is in the dictionary. If nouns in the dictionary all start capitalized, then I think this is OK. If a user has to remember case (I know some people who write in all upper case no matter what) then this is not a good idea I think. |
German nouns are always capitalized, so that won't be a problem. |
Abart <-> Abort |
Some more comments from reading the list (thanx for compiling it!) |
Thanks @pbengert, you have some good points there, especially the offending words I overlooked. I'll try to see how I can replace them. Forbidding words that differ by a single letter but are pronounced very differently (Abart/Abort, Pose/Posse) maybe seems a bit too restrictive to me... |
@pbengert, I think I've removed all the problematic words and replaced them with innocuous ones. |
@DavidMStraub Thanks for putting this list together, it feels quite good already. Since this feels like a major decision for the ecosystem and probably can't be changed easily, I want to point out a few things:
I'll make some concrete suggestions tomorrow, feel like we could bring in some words like city names (Berlin, Kiel, Halle) or common first names instead of the mentioned ones above. What's the reasoning behind limiting to 6 characters? Couldn't we diversify the words quite a bit if 7 or 8 characters were allowed like with the English list? |
@thomasklemm, you find the words "Obrist", "Odem" and "Sept" (https://de.wikipedia.org/wiki/Sept) too uncommon, want to exclude scientific terms, but ask for a larger Levenshtein distance. And in the end we should still have 2048 words without any Umlaute or ß (this is a significant constraint). I don't think this is realistic, but in any case it would mean starting from scratch. If you are serious about this, I guess it's better to close this PR and make a new one. By the way, I don't have the impression the English wordlist is very different qualitatively.
and so on. |
To make this more quantitative, this notebook proves that the distribution of pairwise Levenshtein distances is not worse than for the English word list. |
@DavidMStraub Thanks for creating this PR. I can't really contribute to the discussion unfortunately but I was just wondering are there any indications if this might get merged any time soon ? |
@majuric that's a good question. At this point I find it frustrating that nothing happens even though I addressed many suggestions. |
Many Thanks to @DavidMStraub for initiating this and I fully agree with his intention to achieve a BIP0039 German wordlist with low number of letters per word. One will see the advantage when engraving the mnemonic into stone, steel, etc. - it would save quite a lot of time to many people. |
Habe die Ehre, As @DavidMStraub already mentioned, it’s indeed not possible to create a reasonable BIP0039 German Wordlist out of nouns with maximum 6 letters and a minimum levenshtein distance of 2 (TWO) plus all the other criteria at the same time.
Other BIP0039 Wordlists solved this quite differently, most allow verb and adjectives, some allow levenshtein distance of ONE --> see attached overview. |
In order to proceed, attached a wordlist related to above mentioned Scenario 1 (keep max. 6 letters per word) based on following compromises:
I assume it’s easier for everybody to look into a PDF-file (at least for the moment), so here’s the Proposal-2019-01-15 and the related overview of criteria (SOR): This is (at least close to) the best compromise between the max. length of 6 letters per word and the non-similarity of words (better than BIP0039 English WL, not so good as BIP0039 French WL) + all the other criteria. By the way: The average length of word would currently be only 5,1 letters, the English WL is 5,4 and the French WL is 6,8 --> so not bad.
|
Thanks @SebastianFloKa for this impressive proposal. I will have a detailed look at the wordlist and let you know in case I notice any problems (I don't expect to). I agree that scenario 1 makes a lot of sense. On a purely aesthetical note, I don't quite understand why you want to have it all-caps. IMO that makes it much less readable (and quite ugly). I don't see any ambiguities when using the proper capitalization as there are no words that differ only by capitalization. Good work! |
So looks like we all agree that writing nouns completely in lowercase is a no-go for Germans, so this should be out of scope. Majuscule font (= all caps, = uppercase) in a text is uncomfortable to read - I fully agree with @DavidMStraub. But for standalone information with outstanding character (importance) and need to be associated as such, it's quite common to use all-caps in German. For example the ID-card, passport and driving license of Germany and Austria: All relevant information (names, street, colour of eyes, authority, nationality, etc.) are written in uppercase. Many official documents (tax documents, opening a bank account, contracts etc.) require to be filled out in all-caps as well. This is mainly to avoid confusion. If somebody stores his seed on an electronic device (wallet), the type of character doesn’t matter. But for people creating a physical backup of their seed, for example with beat letters (= Schlagbuchstaben in German) into steel plates or using prefabricated letters etc., it matters a lot. Allowing a mixture of lowercase AND uppercase would require a separate set of beat letters (respectively prefabricated letters) and cause a more confusing and time consuming procedure to manually create the Backup. As the intention of a BIP0039 Wordlist is to provide an easy to handle solution to users, the amount of potential characters should be kept as low as possible - which would be fulfilled by using all-caps only. Then the issue with same words as noun or adjective or verb, and there are quite some of them: Husten/husten, Klasse/klasse, Nutzen/nutzen, Rennen/rennen, Wissen/wissen etc.. There will always be uncertainties about the correct writing if a mixture between lowercase and uppercase is allowed, particularly if words are transmitted verbally. If only uppercase characters are set, it doesn’t matter which meaning somebody connects to a word (such as homonyms), the orthography will always be predefined and therefore unambiguous. If restoring of a partly unreadable seed is necessary, it could be helpful as well. If there will exist many different wordlists in the future (a tendency we can already see today), it would help to distinguish the BIP0039 German Wordlist from others. Software tools (wallets) could take advantage of this difference and provide more relevant word proposals to the user and request for uppercase letters (or exchange automatically) if German is pre-set as language. |
Thinking about pros and cons of limiting words to 6 letters it might make sense to extend this to at least 7 or maybe even 8 letter per word. I think it's worth it. |
What's the current state of this issue? I would offer my help |
As far as I'm concerned, I've abandoned this. It's really a weird PR. First, I get a few useful comments, make a few improvements. Then, I get a list of requests from @thomasklemm, some of which I proved even with a Jupyter notebook are not realistic/reasonable. No reply. Half a year later, @SebastianFloKa comes with a totally different suggestion, but does not actually submit a PR. @FaustmannChr, thanks for offering your help, but if I were you I wouldn't invest time unless one of the maintainers confirms that this actually has a chance of getting merged. |
Yeah @DavidMStraub, I think you are right... And many thanks for the summary and the fast response! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start 3rd approach: levenshtein distances & other main requirements included
bip-0039/german.txt
Outdated
Weib | ||
Weiher | ||
Weiler | ||
Weimar | ||
Weizen | ||
Welpe | ||
Welt | ||
Werber | ||
Werfer | ||
Werk | ||
Werner | ||
Wesen | ||
Wespe | ||
Wicke | ||
Widder | ||
Wien | ||
Wille | ||
Wimpel | ||
Wind | ||
Winter | ||
Winzer | ||
Wipfel | ||
Wirt | ||
Witwe | ||
Witz | ||
Woche | ||
Woge | ||
Wolke | ||
Wonne | ||
Wort | ||
Wrack | ||
Wucht | ||
Wulst | ||
Wunsch | ||
Wurf | ||
Wurst | ||
Wust | ||
Zacken | ||
Zahn | ||
Zander | ||
Zange | ||
Zaster | ||
Zaum | ||
Zaun | ||
Zebra | ||
Zeche | ||
Zecke | ||
Zehe | ||
Zehner | ||
Zeiger | ||
Zeile | ||
Zeiten | ||
Zelle | ||
Zement | ||
Zenit | ||
Zensor | ||
Zepter | ||
Zicke | ||
Ziege | ||
Zieher | ||
Zierde | ||
Ziffer | ||
Zikade | ||
Zimt | ||
Zink | ||
Zinn | ||
Zins | ||
Zipfel | ||
Zirkel | ||
Zitat | ||
Zoff | ||
Zoll | ||
Zombie | ||
Zone | ||
Zopf | ||
Zuber | ||
Zubrot | ||
Zucht | ||
Zucker | ||
Zufall | ||
Zufuhr | ||
Zugabe | ||
Zukauf | ||
Zulage | ||
Zuname | ||
Zunder | ||
Zunft | ||
Zunge | ||
Zuruf | ||
Zusatz | ||
Zuse | ||
Zutat | ||
Zutun | ||
Zuzug | ||
Zweck | ||
Zweig | ||
Zwerg | ||
Zwist | ||
Zyklen | ||
Zypern |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
@cr initiated a second attempt to create a BIP-0039 German Wordlist at #942 which was closed recently, so let’s continue here with a third attempt.
A standard computer spell checker was used, so prior to release (if this will ever happen) we should find somebody competent to double check. Same for the base criteria check (levenshtein etc.) we should ask somebody such as @bitmover-studio to ensure our tools work properly. There are some discussions on external platforms ongoing, be invited to join here. |
... sorry, will adjust the PR properly later ... Here the adjusted "special considerations", which need to be added later to the main BIP-0039:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these words seem to be homophones according to my understanding of German pronunciation. I think that may cause problems for blind Germans who use an audio interface to work with their seed mnemonic.
Gong | ||
Gosse | ||
Gote | ||
Gotha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pronounced the same as "Gote"?
Novum | ||
Nuance | ||
Nudel | ||
Nugat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pronounced the same as "Nougat"?
Pensum | ||
Pest | ||
Pfad | ||
Pfahl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In northern Germany, is this pronounced the same as "Fall"?
Pest | ||
Pfad | ||
Pfahl | ||
Pfalz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In northern Germany, is this pronounced the same as "Falz"?
Pfeil | ||
Pferd | ||
Pflock | ||
Pflug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In northern Germany, is this pronounced the same as "Flug"?
Pforte | ||
Pfote | ||
Pfuhl | ||
Pfund |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In northern Germany, is this pronounced the same as "Fund"?
Sohn | ||
Soja | ||
Sold | ||
Sole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pronounced the same as "Sohle"?
Unzahl | ||
Unze | ||
Urahn | ||
Uran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pronounced the same as "Urahn"?
@rodasmith you are referring to the initial proposal. The new proposal is inside the comment section because I’m not the initiator. Will check for a way to make new proposal more visible - maybe with a separate fork. |
The restriction is against having two or more words that sound the same, regardless of gender. The list could include 'Graph' or 'Graf' but not both. |
Based on the confusion caused by above improvement proposal in the comment section and for improved traceability etc. a new PR for this 3rd attempt was created - see #1071. |
@SebastianFloKa I will be happy to help. Just mention me when you have your list ready! |
@bitmover-studio excellent, thank you. Other spell checkers found one more error (Avocado) which will be replaced during the next improval loop which is in preparation. You will be informed - thanks again. |
@DavidMStraub Consider closing this in favor of #1071 unless you believe the two proposals are competing with each other. (It appears like #1071 extends on this one, but I could be mistaken.) |
This adds a German word list with the following properties: