Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base64 decode of Japanese UTF-8 Encoded text causes URI malformed Error #1

Closed
arwyn opened this issue Dec 19, 2017 · 7 comments
Closed

Comments

@arwyn
Copy link

arwyn commented Dec 19, 2017

A quick review of the code shows the error occur here:

(outputEncoding === OUTPUT_STRING) ? decodeURIComponent(escape(arr2str(decode(base64Str)))) : decode(base64Str)

Why are you escaping, then decoding as a URI? The body is not guaranteed to be URI formatted.

The bug seems to of been introduced here:
101c606

The following Javascript can reproduce the issue:
decodeURIComponent(escape("日本語"))

@felixhammerl
Copy link
Contributor

Here's an explanation what's going on:
http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
http://ecmanaut.blogspot.de/2006/07/encoding-decoding-utf8-in-javascript.html

Is it valid base64?
Can you add a test for the breaking base64?

@arwyn
Copy link
Author

arwyn commented Dec 20, 2017

The Base64 decode itself works correctly. The decode function returns correct data. The issues is the escape->de-escape that causes a malformed uri error. If you run the code I gave above you will get the error.

In Base64 it is would be something like this: decodeURIComponent(escape(arr2str(decode("5pel5pys6KqeCg=="))))

@arwyn
Copy link
Author

arwyn commented Dec 20, 2017

The following line is from an actual email I'm parsing. I added a console.log in a try/catch in the decode function to get the base64str. I would prefer if the test case is not used as-is, since it is from an actual email, even though it does not contain any private information.

expect(decode('4pSB4pSB4pSB4pSB4pSB4pSB4pSB4pSB4pSBCuacrOODoeODvOODq+OBr+OAgeODnuOCpOODiuOD')).to.deep.equal("━━━━━━━━━\n 本メールは、マイナ")

@felixhammerl
Copy link
Contributor

felixhammerl commented Dec 20, 2017

yes, you're of course right. This has no business being there. This only makes sense if the encoded data is from charset utf-8 ... which it probably wasn't. Otherwise it'll fail. Can you tell me what the charset was?

@felixhammerl
Copy link
Contributor

p.s. had to use your example as test data, couldn't come across non-utf8 base64-encoded data in the mean time. if you want that changed, we can do that. i figured the fix is more important though :)

@arwyn
Copy link
Author

arwyn commented Dec 21, 2017

Thank you for your quick response.
There is no private info or identifying information in that snippet, so no big issue i guess.

The mime node has the following header block. It is part of a multipart/mixed message, which in turn is part of a message/rfc822 section of a multipart/report message sent by a remote Postfix server.

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: base64

The text seems to decode fine in other systems, but they might be doing error correction. Not all lines cause the issue. The base64 block contains multiple lines. One thing I can think of is that since utf8 is a multi-byte character set(1-4 bytes), maybe one of the bytes is on the next/previous line?

Next week I will be running my code through a large subset of actual mails, lots of japanese, chinese and korean text and encodings. So far your parser has worked quite well, very happy with it. I will raise any other issues I find.

@felixhammerl
Copy link
Contributor

Yes, please raise a ticket if anything comes up.

Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants