-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIP39: Clarify necessity for ideographic spaces. #130
Conversation
Dude, you're making it WAY more complex then it needs to be. Make the test vectors with ASCII space as is expected by the output and simply leave the comment about displaying the mnemonic with ideographic space there. I understood what you meant about displaying with ideographic before I even got around to implementing the test vectors and was displaying the wider spaces in my UI. By providing not normalised test vectors and all these comment you are deliberately making things harder than they need be! Code to spec will output ASCII spaces, it has too. All the developer needs to know is that when displaying they should use u3000 and you already had that covered initially! No need to provide test data that needs manipulation then write more comments etc, the fact remains that test data should always be constant. What if whatever they use to normalise the test data does it incorrectly? There is too many ways for error when you require the test data to be altered for tests to pass! To be honest, I have not known one person who would allow test data that requires manipulation before use....and I'm usually the one who gets busted for not doing the right thing but I'd never ever do that with test data. |
Don't call me dude.
I am making it in the way it needs to be to protect user funds.
If you are generating a Japanese mnemonic phrase with ASCII spaces, it is a failure to follow spec. As of this pull request, Japanese phrases that are separated by ASCII spaces is considered a failure.
If you understood, then you would understand the need for the generated phrase output to be not NFKD normalized. If I placed the normalized phrase (with ASCII spaces) in there to begin with, then how would you know if your normalization function is working properly?
Writing an if-then statement saying "if (output should be Japanese) then (split with ideographic space) else (split with ASCII space)" is not exactly what i would consider "hard". If more languages require these types of considerations, and you don't want to deal with them, then don't support those languages. |
"Don't call me dude." Ok man :p "If you are generating a Japanese mnemonic phrase with ASCII spaces, it is a failure to follow spec. As of this pull request, Japanese phrases that are separated by ASCII spaces is considered a failure." No, it is not considered a failure. It would be considered a failure to display the words in the UI without ideographic spaces but I'm assuming that is known by someone who speaks Japanese is it not? A Japanese developer would be used to taking ASCII space and replacing with ideographic I'm sure this would not be the first place they do it? "If you understood, then you would understand the need for the generated phrase output to be not NFKD normalized. If I placed the normalized phrase (with ASCII spaces) in there to begin with, then how would you know if your normalization function is working properly?" Does the reference code output not NFKD normalized output? NO IT DOES NOT so regardless of how you feel it should look in the UI.....this is independent of EXPECTED OUTPUT. "Writing an if-then statement saying "if (output should be Japanese) then (split with ideographic space) else (split with ASCII space)" is not exactly what i would consider "hard". If more languages require these types of considerations, and you don't want to deal with them, then don't support those languages." No it is really easy.....I am in fact doing it and that's what I'm saying.....devs will know to do that, the test data should not be deliberately made to fail to get a point across.....THE TEST DATA NEEDS TO MATCH EXPECTED OUTPUT FROM THE CODE, NOT HOW YOU THINK THE OUTPUT SHOULD LOOK IN A UI. How is it even remotely logical to have test data that will knowingly fail when compared against output from the reference code? it is not logical. If you want to make sure people display Japanese mnemonics with u3000 I completely agree and understand, however providing dud test data is not the way to do it!!!! Your comment about the UI considerations is enough to get the message across! Again as I have said, provide test data that works to the reference code else I will. |
If the code does not output ideographic spaces for Japanese, it is incorrect code and should be changed immediately. Generating a Japanese phrase using ASCII spaces will risk user funds, and should not be done. The reference code you speak of was made for the English wordlist, and the creators of the reference code were not aware of this problem. That was the whole reason the "special considerations" section was added. So that devs would know what to do for each wordlist. If your code does use ideographic spaces, then your code is correct and will pass my tests. If someone else wants to copy/paste the reference code and slap Japanese.txt into it without reading the special considerations, then they will be generating incorrect phrases (with ASCII spaces) and that is not a PASS for the Japanese word list. |
"If the code does not output ideographic spaces for Japanese, it is incorrect code and should be changed immediately." Why? Why would a specification that explicitly states a normalised wordlist with ASCII space between words have to output different for a different culture? Globalization, fonts, etc are the responsibility of the UI designer. Japanese people know to use ideographic spaces in their own language I'm sure..... "Generating a Japanese phrase using ASCII spaces will risk user funds, and should not be done." Care to explain how so? "The reference code you speak of was made for the English wordlist, and the creators of the reference code were not aware of this problem." It's not a problem, if a localization requires wide spaces, the person displaying the mnemonic will add wide spaces? Not a problem at all. "That was the whole reason the "special considerations" section was added. So that devs would know what to do for each wordlist." I agree, you listed in the special considerations that when displaying the Japanese mnemonic, if the words were not spaced enough we use wide space. I literally learned that from your special consideration and put this in my code accordingly, that is the correct place for it, the special considerations, not the test data. "If someone else wants to copy/paste the reference code and slap Japanese.txt into it without reading the special considerations, then they will be generating incorrect phrases (with ASCII spaces) and that is not a PASS for the Japanese word list." No....that is the whole point of test data....someone SHOULD be able to copy and paste it into the reference code and it should pass without modification. You do not use the test data to make a point, the test data should ALWAYS represent EXPECTED OUTPUT. |
@Thashiznets ok, this conversation needs to continue somewhere else, and preferrably in one place and not 3. This pull request is about clarifying that phrase generation should use ideographic spaces. Your claim is dealing with the test vectors and is not directly related to this pull request. The clarification stands. Generate phrases with ideographic spaces. Whether that be an add-on on top of the reference code (like in your case) or a modification of the reference code to accommodate Japanese. If you would like to discuss the test vectors being split with ideographic spaces, and you would like to discuss it in a more visible place (bitcoin/bips) then make a pull request changing the link to my repo to a link to your repo. If the discussion there seems to give consensus that the test vectors should "pass" phrases that are dangerous to the Japanese user, then I will change my repo. I am not unreasonable, I am just trying to keep the interests of an under-represented population on github in mind. P.S.
なのか かえる was reported as being mistaken for なの かかえる when using ASCII ひめじし すあし was reported as being mistaken for ひめじ しす あし I think this happens less so in English... but this is why I like Mycelium's iPhone app just came out with showing one word at a time separately. |
Ok, I still think the phrase that is used to derive seed bytes is what would be stored and compared against in test results, and I am very surprised that you would prefer test data that fails the reference code rather than to modify the reference code. But I guess it is out of my hands. I won't submit new test vectors because I feel it is a bit rude of me to steal your thunder as it were, and my motivation was just to help other devs in the future who maybe scratch their head wondering why seemingly similar looking phrases fail their unit tests. I have got the tests modified and working in my code so I guess at least someone can look to my code and grab the phrases from there if they go looking. Yea the thing with the spaces example you show is that the phrase has to be a minimum of 12 words and in increments of three words. If the user gets one or two spaces wrong, the sentence will no longer be a multiple of 3 and will fail code that checks for this! (Note I am unsure if the reference code does this checking, I am doing it in my code). It is however feasible if it is a big phrase so say 24 words and they miss 3 spaces so they turn 6 words into 3 and end up with 21 words instead of 24 and I agree that would cause key loss. I would strongly suggest you fork and change the reference code to detect Japanese words in use and force ideographic space on phrase output, especially as it is normalised by the reference code on the way in I think it makes sense and see no issue to do that. And I don't see it breaking current implementations because ideographic or ascii in they all get normalised regardless in the reference code. (I have to normalise in from GUI layer because PCL doesn't do normalisation unfortunately). Is there a reason you don't want to alter the code? I mean e.g. in my case I don't really know Python as a language. |
I don't know if it is of any help in your debate. For normalization, I do nothing different between ideographic or ascii space. I just pass what the user pass as-is. (It works fine because ideographic is transformed into ascii during the normalization) The only problem is when I need to split the words to extract the indices of the mnemonic. With my solution, the user is not forced to use ideographic space for Japanese, and even if he does, the generated seed or list of indices will be the same anyway. When I generate the mnemonic, I use the worldlist space, since it should be the "convention" of the language. I have a basic level in Japanese, and I agree that using ASCII space is harder to read for this language. |
@NicolasDorier You should normalise the the Mnemonic when it is input by the user!!! The reason for this I believe was to allow for reinputting the mnemonic with a keyboard that may not have accented character sets etc. What you are doing I think is dangerous. Any user input needs to be normalised!!!!! |
@Thashiznets I don't think @NicolasDorier is doing anything bad or dangerous.
So he says for seed generation (which is what matters, really) he just normalizes, and due to ideographic turns into ASCII after NFKD normalization, this is correct.
I think he is saying that:
@NicolasDorier Am I right? Also, subsequent generations of seed will draw the mnemonic phrase and join them with wordlist.space, then normalize and PBFKD2? It doesn't seem dangerous, but I could see where someone trying to mimic that method might be more prone to coding mistakes. |
@dabura667 @NicolasDorier ah yes appologies guys, I missed the part where it states that nicolas is actually normalising before seed generation, that is fine. My error. |
@dabura667 you are right. I don't use the array of words internally after splitting (only for getting indices), I expose the array in Mnemonic just for UI developers, so they can show it nicely to the user. |
HI Nicolas, :. Andre Amorim .: Key-ID: A70B444E Fingerprint: FAAD3B6556A8877B938EAB2FC05E2CB0A70B444E On 31 January 2015 at 13:58, Nicolas Dorier notifications@github.com
|
@andre-amorim I think this would be better to talk about that in NBitcoin page, I do not know what a Ring Signature is. I am not a cryptographer, so if you have time to explain why it would help bitcoin and point out some link so I understand the why, what and how, I may implement it. |
@NicolasDorier @andre-amorim agreed, tag me in any discussions as well as I have not heard of such thing either but keen to learn. |
I don't see any reason why bip39 couldn't be used to encode a seed for deriving ring signature private keys. |
@voisine Sorry they got off topic. The topic of the pull request is specifying that Japanese phrases should be split with ideographic spaces to prevent users from mistaking words that might appear to run together and make different words. |
I think it's useful to explain ideographic space handling to help non-japanese developers. I only learned of it when testing to see how they are handled by NFKD normalization. If it's the case that the reference implementation fails to handle japanese test cases, then we should correct the reference implementation concurrently with including the test cases. There are no spaces between characters in the same word in the japanese word list, so you can normalize the phrase first, then separate words on ASCII space, and use the words to index into the wordlist to calculate the checksum. Is this correct? |
@voisine Correct, so programmatically the reference code would just generate a mnemonic with an ASCII space between each Japanese word. And this is why I had an issue with the test vectors containing ideographic spaces instead of ASCII, because the mnemonic generated by the reference code was ASCII spaced, so I was failing against the Jap test data unless I explicitly added ideographic space after building the mnemonic. I didn't like this, however it has been decided (and I believe popped into the reference code) that any mnemonic that comes out that is Japanese comes out with ideographic spaces and I have followed suit in my .NET implementation. |
Trezor Python Mnemonic supports Japanese and ideographic spaces now.
This is incorrect, as most interpreters will view them as different strings and will not be able to find the word in the list. (non-normalized string != normalized string on a per-byte basis... though depending on the interpreter perhaps it will accommodate for normalized strings when comparing... but I would rather not rely on that.) Example: |
@dabura667 All words on the wordlist are already Normalised as per the spec tho right? So unless the normalisation process created a space between characters of words there should be no spaces between characters and I believe that is what @voisine is asking. So basically what my interpretation of what @voisine is asking is can you take each word from the word list, add an ASCII space between them and generate a mnemonic. Correct me if I'm wrong but I think the answer to that is yes. The Ideographic is purely trying to account for user input and GUI display right? |
If I understand @dabura667 correctly, then the japanese word list itself contains non-NFKD normalized words? If so, that should be corrected. Since key derivation involves normalization, then we can normalize the word list without affecting already generated wallets. |
I left it unclear / open to interpretation on whether to use ideograpic spaces, but realized that without being specific on its necessity, developers may implement something that would cause trouble with the Japanese user. (two words looking like one word, or phrase verification failing because it can't handle ideographic spaces, etc.)
Misunderstood specification.
@bip39JP excellent, BIP39 actually specifies that wordlists must be NFKD normalized to begin with. I will try to remember to verify this is the case when future wordlists are submitted. Thanks for including a fix. |
@bip39JP I'm assuming the test vectors remain the same as we should be normalising before generating the seed anyway correct? |
@voisine @Thashiznets I have normalized the actual words on the wordlist now. (as well as the test vectors on my other repo) All I need to do now is make a pull request with python-mnemonic. This should not cause problems, though, as all implementations should normalize the phrases before hashing, not to mention the only places using the Japanese list currently are my bip32jp site and Electrum (which is no longer BIP39 anyways, and has decided to normalize >> then remove all spaces for CJK languages... to I think I'll leave Electrum as is.) |
@dabura667 I fear that github is displaying the wordlist the same either normalised or not, they certainly look the same I mean I don't know if copy and pasting from either/or results in different actual bytes between them, so it may be worth while forcing devs to normalise the list themselves rather than copy and paste? I believe I did this when putting the list in my implementation. As you say tho, given we normalise pre seed generation anyway I don't think anything is going to break as a result. |
@Thashiznets normalization should result in identical looking results in most cases (ideographic spaces appear to be the odd exception) ACK, this is important to get in. |
https://github.com/trezor/python-mnemonic/pull/25/files Yes. If you view it in a browser it shows normalized text as the same as it's non-normalized partner. (or at least in the case of hiragana + dakuten it does) I believe if you copy it though... it should copy the normalized bytes... Try opening the javascript console in your browser, pasting the below bytes, then overwriting the "pastehere" with the normalized (post-fix) then see the results, then try again but this time copy paste the old words. see if the bytes that come out differ. var nstring = "pastehere";
var bytes = [];
for (var i = 0; i < nstring.length; ++i) {
bytes.push(nstring.charCodeAt(i));
}
bytes; |
@dabura667 Right you are, I see Array [ 12354, 12362, 12382, 12425 ] and then the extra byte apparent in the normalised one [ 12354, 12362, 12381, 12441, 12425 ] using あおぞら |
Although I don't have a strong opinion I vote ACK. I don't think that using non-English wordlist is a good idea in general, but if there is a demand for this, let's do it the correct way. |
Although MultiBit HD does not support Japanese wordlists yet we will very likely introduce them once they arrive in bitcoinj. We may offer a pull request for this at an undefined point in the future. Speaking personally I feel that people are more likely to use a word list in their native language so this will help with Bitcoin adoption globally. ACK on this clarification from MultiBit HD. |
BIP39: Clarify necessity for ideographic spaces.
The ideographic space issue confused me for test vectors. Is it correct to say only Japanese uses this? |
@simcity4242 in my code, I have the type WordList with all the words, but I also pass the "space" character to it, so it can adapt to future language. |
@simcity4242 so far JP is the only word list that has to factor in ideographic spaces yes, but if in doubt just use ASCII spaces it's only really an issue if you want to display the mnemonic in a GUI, so output the mnemonic to display with wide spaces. |
The issue is ideographic spaces being used by default when a user inputs the phrase. The wallet has to anticipate and correctly handle it. |
Negative @voisine the normalization process converts ideographic to ASCII space so it doesn't matter to the wallet which one is input. Only display of mnemonic to user in GUI is issue so again I'd recommend OS localization to detect correct spaces, user using JP device, you know to make spaces ideographic. |
as a non-Japanese developer, I was initially unaware of ideographic spaces and had to make code changes to account for them during input, prior to normalization, so I found the notes useful. |
What code changes did you have to make? Were you splitting the string by spaces or something? |
Initially |
@Thashiznets The trouble is, the JP test vectors use ideographic spaces, so if you're unit testing the |
@simcity4242 what specifically is the problem? You do realise a mnemonic with ideographic or normal space produces identical seed bytes right? |
@simcity4242 https://github.com/trezor/python-mnemonic/blob/master/mnemonic/mnemonic.py#L135 Reference implementation joins using ideographic space. Works with test vectors fine. |
@Thashiznets mate, I do realise that. It's the @dabura667 If the unit test uses code which was written prior to this 3 month old commit, as I did, it's based on very ambiguous recommendations, eg:
I'm not saying that the test vectors fail when using Anyway, thanks everyone because I got it working :D |
What is the process to include a new wordlist for a different language? Create it and make a pull request? Discuss it on a list first? If that's the case, which list? |
isso ajuda? https://www.youtube.com/watch?v=qT_Zqe_E_QE |
eu to lendo sobre o sha3 ( https://www.youtube.com/watch?v=opT6pIfyGUs) On 12 April 2016 at 17:11, Manobfra notifications@github.com wrote:
|
Fxied typo in taproot_sign_script()
I left it unclear / open to interpretation on whether to use ideographic
spaces, but realized that without being specific on its necessity,
developers may implement something that would cause trouble with the
Japanese user. (two words looking like one word, or phrase verification
failing because it can't handle ideographic spaces, etc.)
Hopefully, this will clear up any mis-understandings.