Skip to content

Commit

Permalink
Extra cleanup on cleanUTF8.
Browse files Browse the repository at this point in the history
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
  • Loading branch information
ezyang committed Mar 7, 2017
1 parent 9195cb7 commit 4047a62
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 4 deletions.
3 changes: 3 additions & 0 deletions NEWS
Expand Up @@ -19,6 +19,9 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
- Deleted some asserts to avoid linters from choking (#97)
- Rework Serializer cache behavior to avoid chmod'ing if possible (#32)
- Embedded semicolons in strings in CSS are now handled correctly!
- We accidentally dropped certain Unicode characters if there was
one or more invalid characters. This has been fixed, thanks
to mpyw <ryosuke_i_628@yahoo.co.jp>
# By default, when a link has a target attribute associated
with it, we now also add rel="noopener" in order to
prevent the new window from being able to overwrite
Expand Down
14 changes: 10 additions & 4 deletions library/HTMLPurifier/Encoder.php
Expand Up @@ -101,6 +101,14 @@ public static function iconv($in, $out, $text, $max_chunk_size = 8000)
* It will parse according to UTF-8 and return a valid UTF8 string, with
* non-SGML codepoints excluded.
*
* Specifically, it will permit:
* \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}
* Source: https://www.w3.org/TR/REC-xml/#NT-Char
* Arguably this function should be modernized to the HTML5 set
* of allowed characters:
* https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream
* which simultaneously expand and restrict the set of allowed characters.
*
* @param string $str The string to clean
* @param bool $force_php
* @return string
Expand All @@ -122,15 +130,12 @@ public static function iconv($in, $out, $text, $max_chunk_size = 8000)
* function that needs to be able to understand UTF-8 characters.
* As of right now, only smart lossless character encoding converters
* would need that, and I'm probably not going to implement them.
* Once again, PHP 6 should solve all our problems.
*/
public static function cleanUTF8($str, $force_php = false)
{
// UTF-8 validity is checked since PHP 4.3.5
// This is an optimization: if the string is already valid UTF-8, no
// need to do PHP stuff. 99% of the time, this will be the case.
// The regexp matches the XML char production, as well as well as excluding
// non-SGML codepoints U+007F to U+009F
if (preg_match(
'/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
$str
Expand Down Expand Up @@ -255,7 +260,8 @@ public static function cleanUTF8($str, $force_php = false)
// 7F-9F is not strictly prohibited by XML,
// but it is non-SGML, and thus we don't allow it
(0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
(0xE000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
(0xE000 <= $mUcs4 && 0xFFFD >= $mUcs4) ||
(0x10000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
)
) {
$out .= $char;
Expand Down

0 comments on commit 4047a62

Please sign in to comment.