Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIP39: Clarify necessity for ideographic spaces. #130

Merged
merged 2 commits into from
May 26, 2015

Conversation

bip39JP
Copy link
Contributor

@bip39JP bip39JP commented Dec 20, 2014

I left it unclear / open to interpretation on whether to use ideographic
spaces, but realized that without being specific on its necessity,
developers may implement something that would cause trouble with the
Japanese user. (two words looking like one word, or phrase verification
failing because it can't handle ideographic spaces, etc.)

Hopefully, this will clear up any mis-understandings.

@bip39JP bip39JP changed the title Clarify necessity for ideographic spaces. BIP39: Clarify necessity for ideographic spaces. Dec 20, 2014
@realindiahotel
Copy link

Dude, you're making it WAY more complex then it needs to be. Make the test vectors with ASCII space as is expected by the output and simply leave the comment about displaying the mnemonic with ideographic space there. I understood what you meant about displaying with ideographic before I even got around to implementing the test vectors and was displaying the wider spaces in my UI. By providing not normalised test vectors and all these comment you are deliberately making things harder than they need be! Code to spec will output ASCII spaces, it has too. All the developer needs to know is that when displaying they should use u3000 and you already had that covered initially! No need to provide test data that needs manipulation then write more comments etc, the fact remains that test data should always be constant. What if whatever they use to normalise the test data does it incorrectly? There is too many ways for error when you require the test data to be altered for tests to pass! To be honest, I have not known one person who would allow test data that requires manipulation before use....and I'm usually the one who gets busted for not doing the right thing but I'd never ever do that with test data.

@bip39JP
Copy link
Contributor Author

bip39JP commented Dec 20, 2014

Dude

Don't call me dude.

you're making it WAY more complex then it needs to be

I am making it in the way it needs to be to protect user funds.

Make the test vectors with ASCII space as is expected by the output and simply leave the comment about displaying the mnemonic with ideographic space there.

If you are generating a Japanese mnemonic phrase with ASCII spaces, it is a failure to follow spec. As of this pull request, Japanese phrases that are separated by ASCII spaces is considered a failure.

I understood what you meant about displaying with ideographic before I even got to the test vectors.

If you understood, then you would understand the need for the generated phrase output to be not NFKD normalized. If I placed the normalized phrase (with ASCII spaces) in there to begin with, then how would you know if your normalization function is working properly?

By providing not normalised test vectors and all these comment you are deliberately making things harder than they need be!

Writing an if-then statement saying "if (output should be Japanese) then (split with ideographic space) else (split with ASCII space)" is not exactly what i would consider "hard". If more languages require these types of considerations, and you don't want to deal with them, then don't support those languages.

@realindiahotel
Copy link

"Don't call me dude."

Ok man :p

"If you are generating a Japanese mnemonic phrase with ASCII spaces, it is a failure to follow spec. As of this pull request, Japanese phrases that are separated by ASCII spaces is considered a failure."

No, it is not considered a failure. It would be considered a failure to display the words in the UI without ideographic spaces but I'm assuming that is known by someone who speaks Japanese is it not? A Japanese developer would be used to taking ASCII space and replacing with ideographic I'm sure this would not be the first place they do it?

"If you understood, then you would understand the need for the generated phrase output to be not NFKD normalized. If I placed the normalized phrase (with ASCII spaces) in there to begin with, then how would you know if your normalization function is working properly?"

Does the reference code output not NFKD normalized output? NO IT DOES NOT so regardless of how you feel it should look in the UI.....this is independent of EXPECTED OUTPUT.

"Writing an if-then statement saying "if (output should be Japanese) then (split with ideographic space) else (split with ASCII space)" is not exactly what i would consider "hard". If more languages require these types of considerations, and you don't want to deal with them, then don't support those languages."

No it is really easy.....I am in fact doing it and that's what I'm saying.....devs will know to do that, the test data should not be deliberately made to fail to get a point across.....THE TEST DATA NEEDS TO MATCH EXPECTED OUTPUT FROM THE CODE, NOT HOW YOU THINK THE OUTPUT SHOULD LOOK IN A UI. How is it even remotely logical to have test data that will knowingly fail when compared against output from the reference code? it is not logical. If you want to make sure people display Japanese mnemonics with u3000 I completely agree and understand, however providing dud test data is not the way to do it!!!! Your comment about the UI considerations is enough to get the message across! Again as I have said, provide test data that works to the reference code else I will.

@bip39JP
Copy link
Contributor Author

bip39JP commented Dec 20, 2014

If the code does not output ideographic spaces for Japanese, it is incorrect code and should be changed immediately.

Generating a Japanese phrase using ASCII spaces will risk user funds, and should not be done.

The reference code you speak of was made for the English wordlist, and the creators of the reference code were not aware of this problem.

That was the whole reason the "special considerations" section was added. So that devs would know what to do for each wordlist.

If your code does use ideographic spaces, then your code is correct and will pass my tests. If someone else wants to copy/paste the reference code and slap Japanese.txt into it without reading the special considerations, then they will be generating incorrect phrases (with ASCII spaces) and that is not a PASS for the Japanese word list.

@realindiahotel
Copy link

"If the code does not output ideographic spaces for Japanese, it is incorrect code and should be changed immediately."

Why? Why would a specification that explicitly states a normalised wordlist with ASCII space between words have to output different for a different culture? Globalization, fonts, etc are the responsibility of the UI designer. Japanese people know to use ideographic spaces in their own language I'm sure.....

"Generating a Japanese phrase using ASCII spaces will risk user funds, and should not be done."

Care to explain how so?

"The reference code you speak of was made for the English wordlist, and the creators of the reference code were not aware of this problem."

It's not a problem, if a localization requires wide spaces, the person displaying the mnemonic will add wide spaces? Not a problem at all.

"That was the whole reason the "special considerations" section was added. So that devs would know what to do for each wordlist."

I agree, you listed in the special considerations that when displaying the Japanese mnemonic, if the words were not spaced enough we use wide space. I literally learned that from your special consideration and put this in my code accordingly, that is the correct place for it, the special considerations, not the test data.

"If someone else wants to copy/paste the reference code and slap Japanese.txt into it without reading the special considerations, then they will be generating incorrect phrases (with ASCII spaces) and that is not a PASS for the Japanese word list."

No....that is the whole point of test data....someone SHOULD be able to copy and paste it into the reference code and it should pass without modification. You do not use the test data to make a point, the test data should ALWAYS represent EXPECTED OUTPUT.

@bip32JP
Copy link

bip32JP commented Dec 20, 2014

@Thashiznets ok, this conversation needs to continue somewhere else, and preferrably in one place and not 3.

This pull request is about clarifying that phrase generation should use ideographic spaces.

Your claim is dealing with the test vectors and is not directly related to this pull request.

The clarification stands. Generate phrases with ideographic spaces. Whether that be an add-on on top of the reference code (like in your case) or a modification of the reference code to accommodate Japanese.

If you would like to discuss the test vectors being split with ideographic spaces, and you would like to discuss it in a more visible place (bitcoin/bips) then make a pull request changing the link to my repo to a link to your repo.

If the discussion there seems to give consensus that the test vectors should "pass" phrases that are dangerous to the Japanese user, then I will change my repo. I am not unreasonable, I am just trying to keep the interests of an under-represented population on github in mind.

P.S.

Care to explain how so?

なのか かえる was reported as being mistaken for なの かかえる when using ASCII
なのか かえる As you can compare here, it would be hard to miss the ideographic space.

ひめじし すあし was reported as being mistaken for ひめじ しす あし
ひめじし すあし again, ideographic space is much less confusing.

I think this happens less so in English... but this is why I like Mycelium's iPhone app just came out with showing one word at a time separately.

@realindiahotel
Copy link

Ok, I still think the phrase that is used to derive seed bytes is what would be stored and compared against in test results, and I am very surprised that you would prefer test data that fails the reference code rather than to modify the reference code. But I guess it is out of my hands.

I won't submit new test vectors because I feel it is a bit rude of me to steal your thunder as it were, and my motivation was just to help other devs in the future who maybe scratch their head wondering why seemingly similar looking phrases fail their unit tests. I have got the tests modified and working in my code so I guess at least someone can look to my code and grab the phrases from there if they go looking.

Yea the thing with the spaces example you show is that the phrase has to be a minimum of 12 words and in increments of three words. If the user gets one or two spaces wrong, the sentence will no longer be a multiple of 3 and will fail code that checks for this! (Note I am unsure if the reference code does this checking, I am doing it in my code). It is however feasible if it is a big phrase so say 24 words and they miss 3 spaces so they turn 6 words into 3 and end up with 21 words instead of 24 and I agree that would cause key loss.

I would strongly suggest you fork and change the reference code to detect Japanese words in use and force ideographic space on phrase output, especially as it is normalised by the reference code on the way in I think it makes sense and see no issue to do that. And I don't see it breaking current implementations because ideographic or ascii in they all get normalised regardless in the reference code. (I have to normalise in from GUI layer because PCL doesn't do normalisation unfortunately).

Is there a reason you don't want to alter the code? I mean e.g. in my case I don't really know Python as a language.

@NicolasDorier
Copy link
Contributor

I don't know if it is of any help in your debate.
But what I ended up doing in NBitcoin is the following :

For normalization, I do nothing different between ideographic or ascii space. I just pass what the user pass as-is. (It works fine because ideographic is transformed into ascii during the normalization)
Thanks to that I am confident that it will not expose to any lost of fund for the user.

The only problem is when I need to split the words to extract the indices of the mnemonic.
What I ended up doing is adding a property called Wordlist.Space where the space is ideographic for Japanese wordlist and ASCII for other.
When I split the list, I split by using both : the ASCII space + the wordlist space, and remove empty strings.
Also, I split BEFORE the normalization. In fact my Mnemonic class do not store the normalized version of the mnemonic at all, I store exactly what the user passed. I normalize only for the seed generation process.

With my solution, the user is not forced to use ideographic space for Japanese, and even if he does, the generated seed or list of indices will be the same anyway.

When I generate the mnemonic, I use the worldlist space, since it should be the "convention" of the language. I have a basic level in Japanese, and I agree that using ASCII space is harder to read for this language.

@realindiahotel
Copy link

@NicolasDorier You should normalise the the Mnemonic when it is input by the user!!! The reason for this I believe was to allow for reinputting the mnemonic with a keyboard that may not have accented character sets etc. What you are doing I think is dangerous. Any user input needs to be normalised!!!!!

@dabura667
Copy link

@Thashiznets I don't think @NicolasDorier is doing anything bad or dangerous.

For normalization, I do nothing different between ideographic or ascii space. I just pass what the user pass as-is. (It works fine because ideographic is transformed into ascii during the normalization)
Thanks to that I am confident that it will not expose to any lost of fund for the user.

So he says for seed generation (which is what matters, really) he just normalizes, and due to ideographic turns into ASCII after NFKD normalization, this is correct.

When I split the list, I split by using both : the ASCII space + the wordlist space, and remove empty strings.
Also, I split BEFORE the normalization. In fact my Mnemonic class do not store the normalized version of the mnemonic at all, I store exactly what the user passed. I normalize only for the seed generation process.

I think he is saying that:

  1. User inputs string
  2. He tries to split the string using both ASCII and the wordlist space
  3. He figures out which one worked (actually results in an array of proper length)
  4. He stores the array in the Mnemonic class as an array of words.
  5. However, the string that gets normalized and sent to PBFKD2 is the original string that the user input as is.

@NicolasDorier Am I right? Also, subsequent generations of seed will draw the mnemonic phrase and join them with wordlist.space, then normalize and PBFKD2?

It doesn't seem dangerous, but I could see where someone trying to mimic that method might be more prone to coding mistakes.

@realindiahotel
Copy link

@dabura667 @NicolasDorier ah yes appologies guys, I missed the part where it states that nicolas is actually normalising before seed generation, that is fine. My error.

@NicolasDorier
Copy link
Contributor

@dabura667 you are right. I don't use the array of words internally after splitting (only for getting indices), I expose the array in Mnemonic just for UI developers, so they can show it nicely to the user.

@andre-amorim
Copy link

HI Nicolas,
Please I don't want to cause any cognitive breakdown, and to be honest I
don't know if is the right time to ask about it.
But,
Do you intent, in the near future ... to refactoring crypto primitive "Ring
Signature" code ( like Bytecoin schema ?)
Thanks
Kind Regards,
AA

:. Andre Amorim .:

Key-ID:

A70B444E

Fingerprint:

FAAD3B6556A8877B938EAB2FC05E2CB0A70B444E

On 31 January 2015 at 13:58, Nicolas Dorier notifications@github.com
wrote:

@dabura667 https://github.com/dabura667 you are right. I don't use the
array of words internally after splitting, I expose that just so UI
developers can show it nicely to the user.


Reply to this email directly or view it on GitHub
#130 (comment).

@NicolasDorier
Copy link
Contributor

@andre-amorim I think this would be better to talk about that in NBitcoin page, I do not know what a Ring Signature is. I am not a cryptographer, so if you have time to explain why it would help bitcoin and point out some link so I understand the why, what and how, I may implement it.

@realindiahotel
Copy link

@NicolasDorier @andre-amorim agreed, tag me in any discussions as well as I have not heard of such thing either but keen to learn.

@dabura667
Copy link

Ping authors @voisine @prusnak

@voisine
Copy link
Contributor

voisine commented Mar 12, 2015

I don't see any reason why bip39 couldn't be used to encode a seed for deriving ring signature private keys.

@dabura667
Copy link

@voisine Sorry they got off topic.

The topic of the pull request is specifying that Japanese phrases should be split with ideographic spaces to prevent users from mistaking words that might appear to run together and make different words.

@voisine
Copy link
Contributor

voisine commented Mar 12, 2015

I think it's useful to explain ideographic space handling to help non-japanese developers. I only learned of it when testing to see how they are handled by NFKD normalization.

If it's the case that the reference implementation fails to handle japanese test cases, then we should correct the reference implementation concurrently with including the test cases.

There are no spaces between characters in the same word in the japanese word list, so you can normalize the phrase first, then separate words on ASCII space, and use the words to index into the wordlist to calculate the checksum. Is this correct?

@realindiahotel
Copy link

@voisine Correct, so programmatically the reference code would just generate a mnemonic with an ASCII space between each Japanese word. And this is why I had an issue with the test vectors containing ideographic spaces instead of ASCII, because the mnemonic generated by the reference code was ASCII spaced, so I was failing against the Jap test data unless I explicitly added ideographic space after building the mnemonic. I didn't like this, however it has been decided (and I believe popped into the reference code) that any mnemonic that comes out that is Japanese comes out with ideographic spaces and I have followed suit in my .NET implementation.

@dabura667
Copy link

I think it's useful to explain ideographic space handling to help non-japanese developers. I only learned of it when testing to see how they are handled by NFKD normalization.

If it's the case that the reference implementation fails to handle japanese test cases, then we should correct the reference implementation concurrently with including the test cases.

Trezor Python Mnemonic supports Japanese and ideographic spaces now.

There are no spaces between characters in the same word in the japanese word list, so you can normalize the phrase first, then separate words on ASCII space, and use the words to index into the wordlist to calculate the checksum. Is this correct?

This is incorrect, as most interpreters will view them as different strings and will not be able to find the word in the list. (non-normalized string != normalized string on a per-byte basis... though depending on the interpreter perhaps it will accommodate for normalized strings when comparing... but I would rather not rely on that.)

Example:
れきだい
UTF8: e3828c e3818d e381a0 e38184 (twelve bytes for 4 characters)
Normalized: e3828c e3818d e3819f e38299 e38184 (notice how byte-wise it adds a new character, this is because it separates the た and the ゛ but when next to each other it would be shown as だ

@realindiahotel
Copy link

@dabura667 All words on the wordlist are already Normalised as per the spec tho right? So unless the normalisation process created a space between characters of words there should be no spaces between characters and I believe that is what @voisine is asking. So basically what my interpretation of what @voisine is asking is can you take each word from the word list, add an ASCII space between them and generate a mnemonic. Correct me if I'm wrong but I think the answer to that is yes. The Ideographic is purely trying to account for user input and GUI display right?

@voisine
Copy link
Contributor

voisine commented Mar 12, 2015

If I understand @dabura667 correctly, then the japanese word list itself contains non-NFKD normalized words? If so, that should be corrected. Since key derivation involves normalization, then we can normalize the word list without affecting already generated wallets.

I left it unclear / open to interpretation on whether to use ideograpic
spaces, but realized that without being specific on its necessity,
developers may implement something that would cause trouble with the
Japanese user. (two words looking like one word, or phrase verification
failing because it can't handle ideographic spaces, etc.)
Misunderstood specification.
@voisine
Copy link
Contributor

voisine commented Mar 12, 2015

@bip39JP excellent, BIP39 actually specifies that wordlists must be NFKD normalized to begin with. I will try to remember to verify this is the case when future wordlists are submitted. Thanks for including a fix.

@realindiahotel
Copy link

@bip39JP I'm assuming the test vectors remain the same as we should be normalising before generating the seed anyway correct?

@dabura667
Copy link

@voisine @Thashiznets I have normalized the actual words on the wordlist now. (as well as the test vectors on my other repo) All I need to do now is make a pull request with python-mnemonic.

This should not cause problems, though, as all implementations should normalize the phrases before hashing, not to mention the only places using the Japanese list currently are my bip32jp site and Electrum (which is no longer BIP39 anyways, and has decided to normalize >> then remove all spaces for CJK languages... to I think I'll leave Electrum as is.)

@realindiahotel
Copy link

@dabura667 I fear that github is displaying the wordlist the same either normalised or not, they certainly look the same I mean I don't know if copy and pasting from either/or results in different actual bytes between them, so it may be worth while forcing devs to normalise the list themselves rather than copy and paste? I believe I did this when putting the list in my implementation. As you say tho, given we normalise pre seed generation anyway I don't think anything is going to break as a result.

@voisine
Copy link
Contributor

voisine commented Mar 12, 2015

@Thashiznets normalization should result in identical looking results in most cases (ideographic spaces appear to be the odd exception)

ACK, this is important to get in.

@dabura667
Copy link

https://github.com/trezor/python-mnemonic/pull/25/files

Yes. If you view it in a browser it shows normalized text as the same as it's non-normalized partner. (or at least in the case of hiragana + dakuten it does)

I believe if you copy it though... it should copy the normalized bytes...

Try opening the javascript console in your browser, pasting the below bytes, then overwriting the "pastehere" with the normalized (post-fix) then see the results, then try again but this time copy paste the old words. see if the bytes that come out differ.

var nstring = "pastehere";
var bytes = [];
for (var i = 0; i < nstring.length; ++i) {
  bytes.push(nstring.charCodeAt(i));
}
bytes;

@dabura667
Copy link

@laanwj @prusnak Notice: Aaron gave an ACK.

@realindiahotel
Copy link

@dabura667 Right you are, I see Array [ 12354, 12362, 12382, 12425 ] and then the extra byte apparent in the normalised one [ 12354, 12362, 12381, 12441, 12425 ] using あおぞら

@prusnak
Copy link
Contributor

prusnak commented Mar 12, 2015

Although I don't have a strong opinion I vote ACK. I don't think that using non-English wordlist is a good idea in general, but if there is a demand for this, let's do it the correct way.

@gary-rowe
Copy link

Although MultiBit HD does not support Japanese wordlists yet we will very likely introduce them once they arrive in bitcoinj. We may offer a pull request for this at an undefined point in the future. Speaking personally I feel that people are more likely to use a word list in their native language so this will help with Bitcoin adoption globally.

ACK on this clarification from MultiBit HD.

laanwj added a commit that referenced this pull request May 26, 2015
BIP39: Clarify necessity for ideographic spaces.
@laanwj laanwj merged commit deec9f1 into bitcoin:master May 26, 2015
@wizardofozzie
Copy link

The ideographic space issue confused me for test vectors. Is it correct to say only Japanese uses this?

@NicolasDorier
Copy link
Contributor

@simcity4242 in my code, I have the type WordList with all the words, but I also pass the "space" character to it, so it can adapt to future language.

@realindiahotel
Copy link

@simcity4242 so far JP is the only word list that has to factor in ideographic spaces yes, but if in doubt just use ASCII spaces it's only really an issue if you want to display the mnemonic in a GUI, so output the mnemonic to display with wide spaces.

@voisine
Copy link
Contributor

voisine commented Jun 11, 2015

The issue is ideographic spaces being used by default when a user inputs the phrase. The wallet has to anticipate and correctly handle it.

@realindiahotel
Copy link

Negative @voisine the normalization process converts ideographic to ASCII space so it doesn't matter to the wallet which one is input. Only display of mnemonic to user in GUI is issue so again I'd recommend OS localization to detect correct spaces, user using JP device, you know to make spaces ideographic.

@voisine
Copy link
Contributor

voisine commented Jun 12, 2015

as a non-Japanese developer, I was initially unaware of ideographic spaces and had to make code changes to account for them during input, prior to normalization, so I found the notes useful.

@realindiahotel
Copy link

What code changes did you have to make? Were you splitting the string by spaces or something?

@voisine
Copy link
Contributor

voisine commented Jun 12, 2015

Initially

@wizardofozzie
Copy link

@Thashiznets The trouble is, the JP test vectors use ideographic spaces, so if you're unit testing the entropy_to_mnemonic function, you will get a fail otherwise. Perhaps this is specific to Python, but FWIW I'm importing the JSON file

@realindiahotel
Copy link

@simcity4242 what specifically is the problem? You do realise a mnemonic with ideographic or normal space produces identical seed bytes right?

@dabura667
Copy link

@simcity4242 https://github.com/trezor/python-mnemonic/blob/master/mnemonic/mnemonic.py#L135

Reference implementation joins using ideographic space. Works with test vectors fine.

@wizardofozzie
Copy link

@Thashiznets mate, I do realise that. It's the entropy_to_MNEMONIC I am referring to. Seed generation is not what confused me.

@dabura667 If the unit test uses code which was written prior to this 3 month old commit, as I did, it's based on very ambiguous recommendations, eg:

  1. I left it unclear / open to interpretation on whether to use ideograpic spaces, but realized that without being specific on its necessity, developers may implement something that would cause trouble with the
    Japanese user. (two words looking like one word, or phrase verification failing because it can't handle ideographic spaces, etc.)
  2. [Note the change here too, as it suggests I'm not alone in the confusion]f0dd2d5#diff-77888683310fe50a2218bbb62ec5cd68R13

I'm not saying that the test vectors fail when using \u3000 or \xe3\x80\8x0 as seperators; I'm just saying the it was confusing to me. You're right, re reference implementation code; I just checked and my local copy was 2-3 versions old.

Anyway, thanks everyone because I got it working :D

@fczuardi
Copy link

fczuardi commented Sep 2, 2015

What is the process to include a new wordlist for a different language? Create it and make a pull request? Discuss it on a list first? If that's the case, which list?

@manobra
Copy link

manobra commented Apr 12, 2016

isso ajuda? https://www.youtube.com/watch?v=qT_Zqe_E_QE

@andre-amorim
Copy link

eu to lendo sobre o sha3 ( https://www.youtube.com/watch?v=opT6pIfyGUs)
... nao importa o que eu diga vao dizer que e' OFF TOPIC que coisa mais
previsivel que isso ?

On 12 April 2016 at 17:11, Manobfra notifications@github.com wrote:

isso ajuda? https://www.youtube.com/watch?v=qT_Zqe_E_QE


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#130 (comment)

jonatack pushed a commit to jonatack/bips that referenced this pull request Nov 12, 2019
Fxied typo in taproot_sign_script()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet