Improve and fix header encoding #294

Maria-12648430 · 2021-10-16T18:59:35Z

Having done #292 before, I expected this to be smooth sailing, but, boy... 🤯 Thanks @juhlig for helping 🤗
Ok, here goes.

The current implementation uses list operations, a lot of list reversals and expensive concatenations (++).
This PR uses binary operations and works without reversing.
The RFC2047 header encoding is incorrect in the current implementation. As it is used, among others, to encode phrases preceding an address (like in From-headers), the restrictions of RFC2047 Section 5 (3) should apply, and so only lower- and uppercase ASCII characters, decimal digits, and the characters !, *, +, - and - are allowed to appear (unescaped) in a Q-Encoded word text there.
This PR encodes characters according to the mentioned RFC2047 Section 5 (3) when Q-Encoding is used.
The current implementation encodes spaces as =20 in Q-Encoding, which is allowed, but for reasons of readability, it is also allowed (and more common) to encode spaces as _ (underscore).
This PR encodes spaces as _ when Q-Encoding is used.
The current implementation encodes everything with Q-Encoding, which is good for readability. However, if most of the value to be encoded is not printable ASCII, the B-Encoding is a better option.
This PR selects Q- or B-Encoding, for the entire header value. Though it is possible to decide encoding for each encoded word individually, it is more confusing than it is worth.
The current implementation has a bug in Q-Decoding headers. It replacses _ with spaces before passing it to the Quoted-Printable decoder. If the _ was in the last position of an Q-Encoded word text, the Quoted-Printable decoder will drop the space(s), as it is supposed to do.
This PR replaces _ with =20 before decoding.

mworrell · 2021-10-16T21:52:18Z

Wow, @Maria-12648430 another impressive piece of work! Thank you!

Maria-12648430 · 2021-10-17T09:17:02Z

Thanks for the praise, and you're welcome 😄

seriyps

Thanks! I like the changes, but added some relatively minor suggestions/questions.

seriyps · 2021-10-17T15:07:03Z

src/mimemail.erl

+			Value;
+		false ->
+			Size = byte_size(Value),
+			FilteredSize = byte_size(<< <<X>> || <<X>> <= Value, ?is_rfc2047_q_allowed(X) >>),


Should this use || <<X/utf8>> <- ..?

also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.

Should this use || <<X/utf8>> <- ..?

It doesn't matter here, all bytes we are filtering on are <127. UTF-8 multi-byte codepoints are all >127, and UTF-8 bytes in a multi-byte sequence are also all >127. Anyway, see below ;)

also, this is the 2nd time we are using this pattern to calculate the ratio of characters. Maybe define some high-order function for that (that takes predicate fun). And this function could be optimized to use tail recursion in order to not build throw-away binary.

Yes, we only need counts actually. I'll create a function that takes a predicate and returns {NTrue, NFalse}, that can be used in both places.

seriyps · 2021-10-17T15:10:55Z

src/mimemail.erl

@@ -227,7 +227,7 @@ tokenize_header(Value, Acc) ->
 				case Type of
 					<<"q">> ->
 						%% RFC 2047 #5. (3)
-						decode_quoted_printable(re:replace(Data, "_", " ", [{return, binary}, global]));
+						decode_quoted_printable(re:replace(Data, "_", "=20", [{return, binary}, global]));


minor: (since we are already touching this) I wonder if binary:replace wil lbe slightly faster, since it does not have to parse and compile the regexp?

You're right, I'll change that.

seriyps · 2021-10-17T15:11:56Z

src/mimemail.erl

+	case is_ascii_printable(Value) of
+		true ->
+			% don't encode if all characters are printable ASCII
+			Value;


I see that it is the same in old and new implementation, but should we do something with this value if it is longer than 1000 bytes?.
I see in proper tests we generate "printable ascii" headers, but maybe we don't generate ones which are long enough?

gen_smtp/test/prop_mimemail.erl

Line 253 in fe4f164

printable_ascii()])}.

Maybe... But I'm not sure this is the right place to check this, as what passes through here may be only a part of the header value. Like, in a From header, what is passed through this function is only the phrase preceding the address, so even if we add a <998 check here, the phrase may just pass, but when the address gets tacked on it could exceed the length limit.

seriyps · 2021-10-17T15:22:09Z

src/mimemail.erl

+
+rfc2047_utf8_encode(Enc, <<>>, Acc, WordAcc, _Left) ->
+	rfc2047_append_word(Acc, WordAcc, Enc);
+rfc2047_utf8_encode(Enc, All = <<2#11110:5, Rest:27, More/binary>>, Acc, WordAcc, Left) ->


sorry, I'm not an utf-8 expert, but could we use <<C/utf8, More/binary>> here? To be honest, I don't really understand what's going on in this function..
If we want to later find out how much bytes is needed to encode C, we probably can check the range in which C falls, eg https://github.com/proper-testing/proper/blob/9f6a6501430479bed66d08cd795cd34d36ec83aa/src/proper_unicode.erl#L91-L100
UPD: I even see we had utf_char_bytes function

👩‍🏫 It's pretty simple: An UTF-8 multi-byte sequence begins with a byte where there are as many ones in the upper bits as there are bytes making up the sequence, followed by a zero bit. So, 11110xxx starts a 4-byte sequence, 1110xxxx starts a 3-byte sequence, 110xxxxx starts a 2-byte one. The subsequent bytes of the sequence all look like 10xxxxxx, but I assumed valid UTF-8 so I didn't check if they really look like that. The x are the payload bits, in which I'm also not interested here. 👩‍🏫

Anyway, your question put me up to an idea how this could all be implemented in a much more simple way than I have it here (and I even noticed a bug on the way). Wait for me to update the PR.

seriyps · 2021-10-17T15:34:51Z

src/mimemail.erl

+	<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?Q?", (rfc2047_q_encode(Word))/binary, "?=">>;
+rfc2047_append_word(Acc, Word, b) ->
+	% subsequent word in Acc, append LWSP and word
+	<<Acc/binary, $\r, $\n, $\s, "=?UTF-8?B?", (base64:encode(Word))/binary, "?=">>.


minor: do you think this function can be DRY-ed to not have that much repetitive code for b and q?

seriyps · 2021-10-17T15:57:55Z

src/mimemail.erl

@@ -1012,67 +1012,151 @@ fix_encoding(Encoding) ->

 %% @doc Encode a binary or list according to RFC 2047. Input is
 %% assumed to be in UTF-8 encoding bytes; not codepoints.


minor: this coment is for rfc2047_utf8_encode. And I doubt it would be picked up by edoc/erlang_ls when there is a macro definition between this comment and the function.

Oops, you're right, will fix.

seriyps · 2021-10-17T16:06:26Z

src/mimemail.erl

+			rfc2047_utf8_encode(Enc, Value, <<>>)
+	end;
+rfc2047_utf8_encode(Value) ->
+	rfc2047_utf8_encode(list_to_binary(Value)).


And here we assume that the list is, as before, the list of bytes, not codepoints?
Do you think it might make sense to change this to accpet codepoints instead (so use unicode:characters_to_binary)? Or it would break backwards-compatibility in some way?

Well, I didn't touch the calls. Before, rfc2047_utf8_encode would accept either a list, or a binary which it would convert to a list via binary_to_list. It follows that the list must have been assumed to be list of bytes. If we change that to codepoints, we have to be sure the calls use lists of codepoints, if they do use lists. I wouldn't want to go into that just now, TBH, maybe in another PR ;)

Maria-12648430 · 2021-10-18T09:00:26Z

Updated the PR considering the changes suggested by @seriyps.

seriyps

Thanks! 👍

seriyps · 2021-10-18T14:58:28Z

src/mimemail.erl

+	rfc2047_append_word(Acc, WordAcc, Enc);
+rfc2047_utf8_encode(Enc, All = <<C/utf8, More/binary>>, Acc, WordAcc, Left) ->
+	% convert codepoint back to UTF-8 encoded bytes
+	<<Bytes/binary>> = <<C/utf8>>,


minor: I guess it could be just Bytes = <<C/utf8>>?

Huh... what was I thinking, I wonder... 🙄 I'll fix it tomorrow.

@Maria-12648430 you have been obsessing over binary matching too much lately, I guess =^^=

@juhlig <<"I am NOT!">> 😁

@seriyps Done

* Choose Q- or B-Encoding depending on percentage of printable characters * Encode all characters not listed in RFC2047 Section 5 (3) in Q-Encoding * Encode Spaces as _ in Q-Encoding * Fix header decoding for trailing _ in Q-Encoded text Co-Authored-By: Jan Uhlig <juhlig@hnc-agency.org>

seriyps

Cool, thanks! @mworrell I think it's good to merge.

mworrell · 2021-10-19T08:09:41Z

I agree. Proceeding.

Thanks again @Maria-12648430 !

Maria-12648430 · 2021-10-19T10:35:51Z

You're welcome 😄

mworrell requested a review from seriyps October 16, 2021 21:51

seriyps reviewed Oct 17, 2021

View reviewed changes

Maria-12648430 force-pushed the improve_fix_header_encoding branch from a7e7d96 to 6651448 Compare October 18, 2021 08:59

seriyps approved these changes Oct 18, 2021

View reviewed changes

Maria-12648430 force-pushed the improve_fix_header_encoding branch from 6651448 to ed3a777 Compare October 19, 2021 06:41

seriyps approved these changes Oct 19, 2021

View reviewed changes

mworrell merged commit a896938 into gen-smtp:master Oct 19, 2021

seriyps mentioned this pull request Jan 27, 2022

Content type params utf8 #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and fix header encoding #294

Improve and fix header encoding #294

Maria-12648430 commented Oct 16, 2021

mworrell commented Oct 16, 2021

Maria-12648430 commented Oct 17, 2021

seriyps left a comment

seriyps Oct 17, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

seriyps Oct 17, 2021

Maria-12648430 Oct 18, 2021

Maria-12648430 commented Oct 18, 2021

seriyps left a comment

seriyps Oct 18, 2021

Maria-12648430 Oct 18, 2021

juhlig Oct 19, 2021

Maria-12648430 Oct 19, 2021

Maria-12648430 Oct 19, 2021

seriyps left a comment

mworrell commented Oct 19, 2021

Maria-12648430 commented Oct 19, 2021

		@@ -1012,67 +1012,151 @@ fix_encoding(Encoding) ->

		%% @doc Encode a binary or list according to RFC 2047. Input is
		%% assumed to be in UTF-8 encoding bytes; not codepoints.

Improve and fix header encoding #294

Improve and fix header encoding #294

Conversation

Maria-12648430 commented Oct 16, 2021

mworrell commented Oct 16, 2021

Maria-12648430 commented Oct 17, 2021

seriyps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maria-12648430 commented Oct 18, 2021

seriyps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seriyps left a comment

Choose a reason for hiding this comment

mworrell commented Oct 19, 2021

Maria-12648430 commented Oct 19, 2021