Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and fix header encoding #294

Merged

Conversation

Maria-12648430
Copy link
Contributor

Having done #292 before, I expected this to be smooth sailing, but, boy... 🤯 Thanks @juhlig for helping 🤗
Ok, here goes.

  • The current implementation uses list operations, a lot of list reversals and expensive concatenations (++).
    This PR uses binary operations and works without reversing.

  • The RFC2047 header encoding is incorrect in the current implementation. As it is used, among others, to encode phrases preceding an address (like in From-headers), the restrictions of RFC2047 Section 5 (3) should apply, and so only lower- and uppercase ASCII characters, decimal digits, and the characters !, *, +, - and - are allowed to appear (unescaped) in a Q-Encoded word text there.
    This PR encodes characters according to the mentioned RFC2047 Section 5 (3) when Q-Encoding is used.

  • The current implementation encodes spaces as =20 in Q-Encoding, which is allowed, but for reasons of readability, it is also allowed (and more common) to encode spaces as _ (underscore).
    This PR encodes spaces as _ when Q-Encoding is used.

  • The current implementation encodes everything with Q-Encoding, which is good for readability. However, if most of the value to be encoded is not printable ASCII, the B-Encoding is a better option.
    This PR selects Q- or B-Encoding, for the entire header value. Though it is possible to decide encoding for each encoded word individually, it is more confusing than it is worth.

  • The current implementation has a bug in Q-Decoding headers. It replacses _ with spaces before passing it to the Quoted-Printable decoder. If the _ was in the last position of an Q-Encoded word text, the Quoted-Printable decoder will drop the space(s), as it is supposed to do.
    This PR replaces _ with =20 before decoding.

@mworrell mworrell requested a review from seriyps October 16, 2021 21:51
@mworrell
Copy link
Collaborator

Wow, @Maria-12648430 another impressive piece of work! Thank you!

@Maria-12648430
Copy link
Contributor Author

Thanks for the praise, and you're welcome 😄

Copy link
Collaborator

@seriyps seriyps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I like the changes, but added some relatively minor suggestions/questions.

src/mimemail.erl Outdated
Value;
false ->
Size = byte_size(Value),
FilteredSize = byte_size(<< <<X>> || <<X>> <= Value, ?is_rfc2047_q_allowed(X) >>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use || <<X/utf8>> <- ..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use || <<X/utf8>> <- ..?

It doesn't matter here, all bytes we are filtering on are <127. UTF-8 multi-byte codepoints are all >127, and UTF-8 bytes in a multi-byte sequence are also all >127. Anyway, see below ;)

also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.

Yes, we only need counts actually. I'll create a function that takes a predicate and returns {NTrue, NFalse}, that can be used in both places.

src/mimemail.erl Outdated
@@ -227,7 +227,7 @@ tokenize_header(Value, Acc) ->
case Type of
<<"q">> ->
%% RFC 2047 #5. (3)
decode_quoted_printable(re:replace(Data, "_", " ", [{return, binary}, global]));
decode_quoted_printable(re:replace(Data, "_", "=20", [{return, binary}, global]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: (since we are already touching this) I wonder if binary:replace wil lbe slightly faster, since it does not have to parse and compile the regexp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change that.

case is_ascii_printable(Value) of
true ->
% don't encode if all characters are printable ASCII
Value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it is the same in old and new implementation, but should we do something with this value if it is longer than 1000 bytes?.
I see in proper tests we generate "printable ascii" headers, but maybe we don't generate ones which are long enough?

printable_ascii()])}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe... But I'm not sure this is the right place to check this, as what passes through here may be only a part of the header value. Like, in a From header, what is passed through this function is only the phrase preceding the address, so even if we add a <998 check here, the phrase may just pass, but when the address gets tacked on it could exceed the length limit.

src/mimemail.erl Outdated

rfc2047_utf8_encode(Enc, <<>>, Acc, WordAcc, _Left) ->
rfc2047_append_word(Acc, WordAcc, Enc);
rfc2047_utf8_encode(Enc, All = <<2#11110:5, Rest:27, More/binary>>, Acc, WordAcc, Left) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I'm not an utf-8 expert, but could we use <<C/utf8, More/binary>> here? To be honest, I don't really understand what's going on in this function..
If we want to later find out how much bytes is needed to encode C, we probably can check the range in which C falls, eg https://github.com/proper-testing/proper/blob/9f6a6501430479bed66d08cd795cd34d36ec83aa/src/proper_unicode.erl#L91-L100
UPD: I even see we had utf_char_bytes function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👩‍🏫 It's pretty simple: An UTF-8 multi-byte sequence begins with a byte where there are as many ones in the upper bits as there are bytes making up the sequence, followed by a zero bit. So, 11110xxx starts a 4-byte sequence, 1110xxxx starts a 3-byte sequence, 110xxxxx starts a 2-byte one. The subsequent bytes of the sequence all look like 10xxxxxx, but I assumed valid UTF-8 so I didn't check if they really look like that. The x are the payload bits, in which I'm also not interested here. 👩‍🏫

Anyway, your question put me up to an idea how this could all be implemented in a much more simple way than I have it here (and I even noticed a bug on the way). Wait for me to update the PR.

src/mimemail.erl Outdated
<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?Q?", (rfc2047_q_encode(Word))/binary, "?=">>;
rfc2047_append_word(Acc, Word, b) ->
% subsequent word in Acc, append LWSP and word
<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?B?", (base64:encode(Word))/binary, "?=">>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: do you think this function can be DRY-ed to not have that much repetitive code for b and q?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 😅

src/mimemail.erl Outdated
@@ -1012,67 +1012,151 @@ fix_encoding(Encoding) ->

%% @doc Encode a binary or list according to RFC 2047. Input is
%% assumed to be in UTF-8 encoding bytes; not codepoints.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this coment is for rfc2047_utf8_encode. And I doubt it would be picked up by edoc/erlang_ls when there is a macro definition between this comment and the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're right, will fix.

rfc2047_utf8_encode(Enc, Value, <<>>)
end;
rfc2047_utf8_encode(Value) ->
rfc2047_utf8_encode(list_to_binary(Value)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here we assume that the list is, as before, the list of bytes, not codepoints?
Do you think it might make sense to change this to accpet codepoints instead (so use unicode:characters_to_binary)? Or it would break backwards-compatibility in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't touch the calls. Before, rfc2047_utf8_encode would accept either a list, or a binary which it would convert to a list via binary_to_list. It follows that the list must have been assumed to be list of bytes. If we change that to codepoints, we have to be sure the calls use lists of codepoints, if they do use lists. I wouldn't want to go into that just now, TBH, maybe in another PR ;)

@Maria-12648430
Copy link
Contributor Author

Updated the PR considering the changes suggested by @seriyps.

Copy link
Collaborator

@seriyps seriyps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

src/mimemail.erl Outdated
rfc2047_append_word(Acc, WordAcc, Enc);
rfc2047_utf8_encode(Enc, All = <<C/utf8, More/binary>>, Acc, WordAcc, Left) ->
% convert codepoint back to UTF-8 encoded bytes
<<Bytes/binary>> = <<C/utf8>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I guess it could be just Bytes = <<C/utf8>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh... what was I thinking, I wonder... 🙄 I'll fix it tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maria-12648430 you have been obsessing over binary matching too much lately, I guess =^^=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juhlig <<"I am NOT!">> 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seriyps Done

* Choose Q- or B-Encoding depending on percentage of printable characters
* Encode all characters not listed in RFC2047 Section 5 (3) in Q-Encoding
* Encode Spaces as _ in Q-Encoding
* Fix header decoding for trailing _ in Q-Encoded text

Co-Authored-By: Jan Uhlig <juhlig@hnc-agency.org>
Copy link
Collaborator

@seriyps seriyps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! @mworrell I think it's good to merge.

@mworrell
Copy link
Collaborator

I agree. Proceeding.

Thanks again @Maria-12648430 !

@mworrell mworrell merged commit a896938 into gen-smtp:master Oct 19, 2021
@Maria-12648430
Copy link
Contributor Author

You're welcome 😄

@seriyps seriyps mentioned this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants