Improve Quoted-Printable encoding #292

Maria-12648430 · 2021-10-11T13:14:36Z

The current implementation of encode_quoted_printable/3 is a bit crude (no offense, really 😜), especially when it comes to lines reaching the limit of 76 characters, resorting to string functions to retrieve the last line and finding whitespaces in there.

The improved version encode_quoted_printable/6 which I'm proposing here works by remembering the occurences of linear whitespaces (including HTABs), and otherwise only look-aheads in the original data to see if a CRLF follows the current character. I also refactored the encoding of characters itself to work via integer operations only, thereby getting rid of all the io:format and lists:flatten calls.

I admit that my implementation isn't exactly easy on the eye, but that is because QP-encoding itself is tricky. And IMO, the current implementation isn't that easy to figure out, either 😉

I left the current tests untouched in what they test, only replaced the calls to the helper function encode_quoted_printable/3 (of which I don't see the point) with calls to the API function encode_quoted_printable/1. I also removed one test, "newline craziness", of which I also fail to see the point 🤔

Finally, I put in a micro-optimization in choose_transformation/1 which avoids division and rounding and only uses multiplication to figure out if the percentage of printable characters in the sample is above 80.

[EDIT]: Just noticed an edge case in which a line could exceed the length limit 😆 Will fix and update the PR tomorrow.

seriyps

The current quoted_printable has some kind of bug where it strips some trailing(?) whitespaces and tabs. Do you think the new implementation fixes that (so Trim calls could be removed)?

gen_smtp/test/prop_mimemail.erl

Line 121 in 1a4f389

?assertEqual(Trim(Body), Trim(mimemail:decode_quoted_printable(QPEncoded))),

Maria-12648430 · 2021-10-12T07:41:10Z

The current quoted_printable has some kind of bug where it strips some trailing(?) whitespaces and tabs. Do you think the new implementation fixes that (so Trim calls could be removed)?

Yes, indeed 😄 The problem with the current encoder is that it does not fulfill the requirements of (3) of RFC 2045 Section 6.7 fully: while it does encode WSPs that are followed by CRLF, it does not encode them when they are the last character in the body. The new implementation handles that properly.

mworrell · 2021-10-12T08:45:15Z

Nice work!

@seriyps do you want to address the Trim in that one test or shall we accept this as-is?

Maria-12648430 · 2021-10-12T08:47:53Z

@mworrell working on an update for the PR, almost done. I'll remove the Trim along with that, it's not needed any more, encoding and decoding will match up 😄

Maria-12648430 · 2021-10-12T10:37:55Z

Last commit fixes the line length bug I had in my first implementation, adds some tests for a few edge cases, augments another test to ensure that whitespaces are encoded when last character, and removes the trimming from the proptest.

Maria-12648430 · 2021-10-12T11:13:46Z

One last micro-optimization before I leave this in your hands, in choose_transformation: The current implementation determines the number of allowed ASCII characters by calling length/1 on the result of a filtering list comprehension. I changed this to call byte_size/1 on the result of an equivalent binary comprehension. byte_size is a constant-time operation, while length is proportional to the number of list elements.

seriyps

Great, thanks!

mworrell · 2021-10-12T15:48:04Z

I am very happy with this.
Tests are also looking good.

@seriyps shall we merge?

seriyps · 2021-10-12T15:53:37Z

Yes, looks good!

mworrell · 2021-10-12T16:51:13Z

Merged, thank you @Maria-12648430 !

juhlig · 2021-10-12T18:02:58Z

Impressive 😮

seriyps approved these changes Oct 11, 2021

View reviewed changes

Maria-12648430 force-pushed the improve_qp_encoding branch 2 times, most recently from 6264766 to cde436f Compare October 12, 2021 10:27

Maria-12648430 requested a review from seriyps October 12, 2021 10:38

Maria-12648430 force-pushed the improve_qp_encoding branch from cde436f to 73c27ca Compare October 12, 2021 11:07

seriyps approved these changes Oct 12, 2021

View reviewed changes

Improve Quoted-Printable encoding

c84712e

Maria-12648430 force-pushed the improve_qp_encoding branch from 73c27ca to c84712e Compare October 12, 2021 12:41

mworrell merged commit a39c02e into gen-smtp:master Oct 12, 2021

Maria-12648430 mentioned this pull request Oct 16, 2021

Improve and fix header encoding #294

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Quoted-Printable encoding #292

Improve Quoted-Printable encoding #292

Maria-12648430 commented Oct 11, 2021 •

edited

Loading

seriyps left a comment

Maria-12648430 commented Oct 12, 2021

mworrell commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

seriyps left a comment

mworrell commented Oct 12, 2021

seriyps commented Oct 12, 2021

mworrell commented Oct 12, 2021

juhlig commented Oct 12, 2021

Improve Quoted-Printable encoding #292

Improve Quoted-Printable encoding #292

Conversation

Maria-12648430 commented Oct 11, 2021 • edited Loading

seriyps left a comment

Choose a reason for hiding this comment

Maria-12648430 commented Oct 12, 2021

mworrell commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

Maria-12648430 commented Oct 12, 2021

seriyps left a comment

Choose a reason for hiding this comment

mworrell commented Oct 12, 2021

seriyps commented Oct 12, 2021

mworrell commented Oct 12, 2021

juhlig commented Oct 12, 2021

Maria-12648430 commented Oct 11, 2021 •

edited

Loading