Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Quoted-Printable encoding #292

Merged
merged 1 commit into from
Oct 12, 2021

Conversation

Maria-12648430
Copy link
Contributor

@Maria-12648430 Maria-12648430 commented Oct 11, 2021

The current implementation of encode_quoted_printable/3 is a bit crude (no offense, really 😜), especially when it comes to lines reaching the limit of 76 characters, resorting to string functions to retrieve the last line and finding whitespaces in there.

The improved version encode_quoted_printable/6 which I'm proposing here works by remembering the occurences of linear whitespaces (including HTABs), and otherwise only look-aheads in the original data to see if a CRLF follows the current character. I also refactored the encoding of characters itself to work via integer operations only, thereby getting rid of all the io:format and lists:flatten calls.

I admit that my implementation isn't exactly easy on the eye, but that is because QP-encoding itself is tricky. And IMO, the current implementation isn't that easy to figure out, either 😉

I left the current tests untouched in what they test, only replaced the calls to the helper function encode_quoted_printable/3 (of which I don't see the point) with calls to the API function encode_quoted_printable/1. I also removed one test, "newline craziness", of which I also fail to see the point 🤔

Finally, I put in a micro-optimization in choose_transformation/1 which avoids division and rounding and only uses multiplication to figure out if the percentage of printable characters in the sample is above 80.

[EDIT]: Just noticed an edge case in which a line could exceed the length limit 😆 Will fix and update the PR tomorrow.

Copy link
Collaborator

@seriyps seriyps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current quoted_printable has some kind of bug where it strips some trailing(?) whitespaces and tabs. Do you think the new implementation fixes that (so Trim calls could be removed)?

?assertEqual(Trim(Body), Trim(mimemail:decode_quoted_printable(QPEncoded))),

@Maria-12648430
Copy link
Contributor Author

The current quoted_printable has some kind of bug where it strips some trailing(?) whitespaces and tabs. Do you think the new implementation fixes that (so Trim calls could be removed)?

Yes, indeed 😄 The problem with the current encoder is that it does not fulfill the requirements of (3) of RFC 2045 Section 6.7 fully: while it does encode WSPs that are followed by CRLF, it does not encode them when they are the last character in the body. The new implementation handles that properly.

@mworrell
Copy link
Collaborator

Nice work!

@seriyps do you want to address the Trim in that one test or shall we accept this as-is?

@Maria-12648430
Copy link
Contributor Author

@mworrell working on an update for the PR, almost done. I'll remove the Trim along with that, it's not needed any more, encoding and decoding will match up 😄

@Maria-12648430 Maria-12648430 force-pushed the improve_qp_encoding branch 2 times, most recently from 6264766 to cde436f Compare October 12, 2021 10:27
@Maria-12648430
Copy link
Contributor Author

Last commit fixes the line length bug I had in my first implementation, adds some tests for a few edge cases, augments another test to ensure that whitespaces are encoded when last character, and removes the trimming from the proptest.

@Maria-12648430
Copy link
Contributor Author

One last micro-optimization before I leave this in your hands, in choose_transformation: The current implementation determines the number of allowed ASCII characters by calling length/1 on the result of a filtering list comprehension. I changed this to call byte_size/1 on the result of an equivalent binary comprehension. byte_size is a constant-time operation, while length is proportional to the number of list elements.

Copy link
Collaborator

@seriyps seriyps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@mworrell
Copy link
Collaborator

I am very happy with this.
Tests are also looking good.

@seriyps shall we merge?

@seriyps
Copy link
Collaborator

seriyps commented Oct 12, 2021

Yes, looks good!

@mworrell mworrell merged commit a39c02e into gen-smtp:master Oct 12, 2021
@mworrell
Copy link
Collaborator

Merged, thank you @Maria-12648430 !

@juhlig
Copy link
Contributor

juhlig commented Oct 12, 2021

Impressive 😮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants