Add RFC2047-compliant MIME text decoder#9313
Draft
dmsnell wants to merge 1 commit intoWordPress:trunkfrom
Draft
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
42a1358 to
eca673c
Compare
ae3f2bd to
4b4ef54
Compare
f09a528 to
69fc308
Compare
4965e92 to
3efffd0
Compare
a3fdc53 to
e974fdf
Compare
0c0f4e7 to
0d97c25
Compare
0d97c25 to
3676c71
Compare
Questions arise around unspecified failure behaviors. - What if the syntax is obviously supposed to be an encoding but technically isn’t? For example, it’s missing a closing '?' It may be computationally heavy to _guess_ if something is broken syntax, so some failures are ambiguous if they should copy the input plaintext or return null. - What do other high-quality libraries do with errors?
3676c71 to
39fb139
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket: Core-63864
Status
Please feel free to ignore this for now.
Description
The existing
wp_iso_descrambler()was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:But even so, it’s only likely to truly work with
US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.The above image illustrates how the bytes print as an invalid UTF-8 sequence in
trunkafter decoding. The 0x80 byte was chosen for this demonstration because inlatin1it’s a control character, incp1252and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is not
ISO-8859-1(latin1) then the decoding is wrong for all character sets.This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.
With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.
Supported encodings
This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.
If
mb_convert_encoding()is available it will be preferred, followed byiconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.Error handling
Unfortunately, even where
iconv_mime_decode()is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:preserve-errorsflag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.replace-errorswill remove the entire encoded word and replace it with the replacement character U+FFFD�. This discards information from the input, but leaves a placemarker indicating that it was there before.bail-on-errorwill cause the function to return early and returnnull, effectively the same as thestrictmode in other decoders.There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for encoders to make the job of decoding simpler, but otherwise does not speak to the well-formedness of the encoding).
BandQare supported).=.or=6f(only upper-case hex digits are allowed).Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.
Notes