-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve and fix header encoding #294
Improve and fix header encoding #294
Conversation
Wow, @Maria-12648430 another impressive piece of work! Thank you! |
Thanks for the praise, and you're welcome 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I like the changes, but added some relatively minor suggestions/questions.
src/mimemail.erl
Outdated
Value; | ||
false -> | ||
Size = byte_size(Value), | ||
FilteredSize = byte_size(<< <<X>> || <<X>> <= Value, ?is_rfc2047_q_allowed(X) >>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use || <<X/utf8>> <- ..
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use
|| <<X/utf8>> <- ..?
It doesn't matter here, all bytes we are filtering on are <127. UTF-8 multi-byte codepoints are all >127, and UTF-8 bytes in a multi-byte sequence are also all >127. Anyway, see below ;)
also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.
Yes, we only need counts actually. I'll create a function that takes a predicate and returns {NTrue, NFalse}
, that can be used in both places.
src/mimemail.erl
Outdated
@@ -227,7 +227,7 @@ tokenize_header(Value, Acc) -> | |||
case Type of | |||
<<"q">> -> | |||
%% RFC 2047 #5. (3) | |||
decode_quoted_printable(re:replace(Data, "_", " ", [{return, binary}, global])); | |||
decode_quoted_printable(re:replace(Data, "_", "=20", [{return, binary}, global])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: (since we are already touching this) I wonder if binary:replace
wil lbe slightly faster, since it does not have to parse and compile the regexp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'll change that.
case is_ascii_printable(Value) of | ||
true -> | ||
% don't encode if all characters are printable ASCII | ||
Value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that it is the same in old and new implementation, but should we do something with this value if it is longer than 1000 bytes?.
I see in proper tests we generate "printable ascii" headers, but maybe we don't generate ones which are long enough?
gen_smtp/test/prop_mimemail.erl
Line 253 in fe4f164
printable_ascii()])}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... But I'm not sure this is the right place to check this, as what passes through here may be only a part of the header value. Like, in a From
header, what is passed through this function is only the phrase preceding the address, so even if we add a <998 check here, the phrase may just pass, but when the address gets tacked on it could exceed the length limit.
src/mimemail.erl
Outdated
|
||
rfc2047_utf8_encode(Enc, <<>>, Acc, WordAcc, _Left) -> | ||
rfc2047_append_word(Acc, WordAcc, Enc); | ||
rfc2047_utf8_encode(Enc, All = <<2#11110:5, Rest:27, More/binary>>, Acc, WordAcc, Left) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I'm not an utf-8 expert, but could we use <<C/utf8, More/binary>>
here? To be honest, I don't really understand what's going on in this function..
If we want to later find out how much bytes is needed to encode C
, we probably can check the range in which C
falls, eg https://github.com/proper-testing/proper/blob/9f6a6501430479bed66d08cd795cd34d36ec83aa/src/proper_unicode.erl#L91-L100
UPD: I even see we had utf_char_bytes
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👩🏫 It's pretty simple: An UTF-8 multi-byte sequence begins with a byte where there are as many ones in the upper bits as there are bytes making up the sequence, followed by a zero bit. So, 11110xxx
starts a 4-byte sequence, 1110xxxx
starts a 3-byte sequence, 110xxxxx
starts a 2-byte one. The subsequent bytes of the sequence all look like 10xxxxxx
, but I assumed valid UTF-8 so I didn't check if they really look like that. The x
are the payload bits, in which I'm also not interested here. 👩🏫
Anyway, your question put me up to an idea how this could all be implemented in a much more simple way than I have it here (and I even noticed a bug on the way). Wait for me to update the PR.
src/mimemail.erl
Outdated
<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?Q?", (rfc2047_q_encode(Word))/binary, "?=">>; | ||
rfc2047_append_word(Acc, Word, b) -> | ||
% subsequent word in Acc, append LWSP and word | ||
<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?B?", (base64:encode(Word))/binary, "?=">>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: do you think this function can be DRY-ed to not have that much repetitive code for b
and q
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 😅
src/mimemail.erl
Outdated
@@ -1012,67 +1012,151 @@ fix_encoding(Encoding) -> | |||
|
|||
%% @doc Encode a binary or list according to RFC 2047. Input is | |||
%% assumed to be in UTF-8 encoding bytes; not codepoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this coment is for rfc2047_utf8_encode
. And I doubt it would be picked up by edoc/erlang_ls when there is a macro definition between this comment and the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're right, will fix.
rfc2047_utf8_encode(Enc, Value, <<>>) | ||
end; | ||
rfc2047_utf8_encode(Value) -> | ||
rfc2047_utf8_encode(list_to_binary(Value)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here we assume that the list is, as before, the list of bytes, not codepoints?
Do you think it might make sense to change this to accpet codepoints instead (so use unicode:characters_to_binary
)? Or it would break backwards-compatibility in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't touch the calls. Before, rfc2047_utf8_encode
would accept either a list, or a binary which it would convert to a list via binary_to_list
. It follows that the list must have been assumed to be list of bytes. If we change that to codepoints, we have to be sure the calls use lists of codepoints, if they do use lists. I wouldn't want to go into that just now, TBH, maybe in another PR ;)
a7e7d96
to
6651448
Compare
Updated the PR considering the changes suggested by @seriyps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 👍
src/mimemail.erl
Outdated
rfc2047_append_word(Acc, WordAcc, Enc); | ||
rfc2047_utf8_encode(Enc, All = <<C/utf8, More/binary>>, Acc, WordAcc, Left) -> | ||
% convert codepoint back to UTF-8 encoded bytes | ||
<<Bytes/binary>> = <<C/utf8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I guess it could be just Bytes = <<C/utf8>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh... what was I thinking, I wonder... 🙄 I'll fix it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Maria-12648430 you have been obsessing over binary matching too much lately, I guess =^^=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juhlig <<"I am NOT!">> 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seriyps Done
* Choose Q- or B-Encoding depending on percentage of printable characters * Encode all characters not listed in RFC2047 Section 5 (3) in Q-Encoding * Encode Spaces as _ in Q-Encoding * Fix header decoding for trailing _ in Q-Encoded text Co-Authored-By: Jan Uhlig <juhlig@hnc-agency.org>
6651448
to
ed3a777
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks! @mworrell I think it's good to merge.
I agree. Proceeding. Thanks again @Maria-12648430 ! |
You're welcome 😄 |
Having done #292 before, I expected this to be smooth sailing, but, boy... 🤯 Thanks @juhlig for helping 🤗
Ok, here goes.
The current implementation uses list operations, a lot of list reversals and expensive concatenations (
++
).This PR uses binary operations and works without reversing.
The RFC2047 header encoding is incorrect in the current implementation. As it is used, among others, to encode phrases preceding an address (like in
From
-headers), the restrictions of RFC2047 Section 5 (3) should apply, and so only lower- and uppercase ASCII characters, decimal digits, and the characters!
,*
,+
,-
and-
are allowed to appear (unescaped) in a Q-Encoded word text there.This PR encodes characters according to the mentioned RFC2047 Section 5 (3) when Q-Encoding is used.
The current implementation encodes spaces as
=20
in Q-Encoding, which is allowed, but for reasons of readability, it is also allowed (and more common) to encode spaces as_
(underscore).This PR encodes spaces as
_
when Q-Encoding is used.The current implementation encodes everything with Q-Encoding, which is good for readability. However, if most of the value to be encoded is not printable ASCII, the B-Encoding is a better option.
This PR selects Q- or B-Encoding, for the entire header value. Though it is possible to decide encoding for each encoded word individually, it is more confusing than it is worth.
The current implementation has a bug in Q-Decoding headers. It replacses
_
with spaces before passing it to the Quoted-Printable decoder. If the_
was in the last position of an Q-Encoded word text, the Quoted-Printable decoder will drop the space(s), as it is supposed to do.This PR replaces
_
with=20
before decoding.