Item5990, Item9170, Item761, Item2231: Fix character encoding issues …

…with WysiwygPlugin * As far as I can tell, unicode characters and entities are now converted correctly. * Numeric entities in ordinary text are converted to characters in the site charset (if the site charset can represent the character) or named entities (if there is a named entity for the character) which should improve readability of TML. The same conversion is also applied to UTF-8 characters not represented in the site charset, and numeric entities are used where necessary (for characters for which there are no named entities). * Entities are now preserved (i.e. not modified at all) inside sticky and verbatim blocks. There are several changes here, but I cannot do this in small steps without breaking things in between. Each time I fixed one problem, another (lurking) problem popped up somewhere else. HTML::Entities::_decode_entities converts numeric entities to characters. The numbers always correspond to Unicode codepoints (see http://en.wikipedia.org/wiki/Html_entities#HTML_character_references). Foswiki uses HTML::Entities::_decode_entities to convert named entities to characters. I changed the named-entity conversion to convert to Unicode codepoints, too (it was converting to site charset, which can cause data corruption for numeric entities in the range 127 to 255 for charsets other than UTF-8, ISO-8859-1). This meant that I had to change the text to Unicode characters (not encoded as UTF-8) before decoding entities, which meant extra conversions, including a step to convert characters that cannot be represented in the site charset to entities. There was code to do that in RESTParameter2SiteCharSet, but it used PERLQQ encoding, which corrupted text (converting to perl escape sequences, e.g. \x{2460}, which surprises everyone who encounters this behaviour). That was fixed, too. Many browsers (including Firefox) interpret pages identified as ISO-8859-1 as if they were encoded with Windows-1252. When posting (e.g. saving) in response to such pages, they also encode data in the same way. This is why mapUnicode2HighBit (and its opposite, mapHighBit2Unicode) were needed. However, those functions complicate the conversion to entities of characters that cannot be represented in the site charset. Perl's standards-compliant Encode to the rescue! If you tell Encode to use the Windows-1252 encoding instead of ISO-8859-1, then it does exactly what we want, and those mapping functions are not necessary. The WysiwygPluginTests test the conversions for various site charsets using ranges of character codes. I could not determine what charset(s) those character codes referred to, so I changed the tests to be explicit - either unicode codepoints or codes in the site charset (given as parameter to the test function). I removed the tests for Unicode codepoints 127 to 159 because they are control characters, which (as far as I am aware) Foswiki does not use. Instead, I added tests for the Unicode codepoints for the Windows-1252 characters with codes 127 to 159. Foswiki::Plugins::WysiwygPlugin::Constants stores computed data that is derived from %Foswiki::cfg. Some of the WysiwygPlugin unit tests that depend on that data change %Foswiki::cfg temporarily, so the stored data in Foswiki::Plugins::WysiwygPlugin::Constants must be reset before running unit tests that depend on that data. I tested this with the following site charsets: '' (default value), 'ISO-8859-1', 'ISO-8859-15', 'utf-8' git-svn-id: http://svn.foswiki.org/trunk@7854 0b4bb1d4-4e5a-0410-9cc4-b2b747904278
foswiki · Jun 19, 2010 · a820f29 · a820f29
1 parent 6fc1ff2
commit a820f29
Show file tree

Hide file tree

Showing 8 changed files with 553 additions and 169 deletions.
diff --git a/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/Constants.pm b/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/Constants.pm
@@ -4,6 +4,8 @@ package Foswiki::Plugins::WysiwygPlugin::Constants;
 use strict;
 use warnings;
 
+use Encode;
+
 # HTML elements that are strictly block type, as defined by
 # http://www.htmlhelp.com/reference/html40/block.html.
 # Block type elements do not require
@@ -202,72 +204,51 @@ our %HTML2TML_COLOURMAP = (
 
 ############ Encodings ###############
 
-# Mapping high-bit characters from unicode back to iso-8859-1
-# (a.k.a Windows 1252 a.k.a "ANSI") - http://www.alanwood.net/demos/ansi.html
-our %unicode2HighBit = (
-    chr(8364) => chr(128),
-    chr(8218) => chr(130),
-    chr(402)  => chr(131),
-    chr(8222) => chr(132),
-    chr(8230) => chr(133),
-    chr(8224) => chr(134),
-    chr(8225) => chr(135),
-    chr(710)  => chr(136),
-    chr(8240) => chr(137),
-    chr(352)  => chr(138),
-    chr(8249) => chr(139),
-    chr(338)  => chr(140),
-    chr(381)  => chr(142),
-    chr(8216) => chr(145),
-    chr(8217) => chr(146),
-    chr(8220) => chr(147),
-    chr(8221) => chr(148),
-    chr(8226) => chr(149),
-    chr(8211) => chr(150),
-    chr(8212) => chr(151),
-    chr(732)  => chr(152),
-    chr(8482) => chr(153),
-    chr(353)  => chr(154),
-    chr(8250) => chr(155),
-    chr(339)  => chr(156),
-    chr(382)  => chr(158),
-    chr(376)  => chr(159),
-);
-
-# Reverse mapping
-our %highBit2Unicode = map { $unicode2HighBit{$_} => $_ } keys %unicode2HighBit;
-
-our $unicode2HighBitChars = join( '', keys %unicode2HighBit );
-our $highBit2UnicodeChars = join( '', keys %highBit2Unicode );
 our $encoding;
 
 sub encoding {
     unless ($encoding) {
         $encoding =
           Encode::resolve_alias( $Foswiki::cfg{Site}{CharSet} || 'iso-8859-1' );
+
+        $encoding = 'windows-1252' if $encoding =~ /^iso-8859-1$/i;
     }
     return $encoding;
 }
 
-# Map selected unicode characters back to high-bit chars if
-# iso-8859-1 is selected. This is required because the same characters
-# have different code points in unicode and iso-8859-1. For example,
-# &euro; is 128 in iso-8859-1 and 8364 in unicode.
-sub mapUnicode2HighBit {
-    if ( encoding() eq 'iso-8859-1' ) {
+my $siteCharsetRepresentable;
 
-        # Map unicode back to iso-8859 high-bit chars
-        $_[0] =~ s/([$unicode2HighBitChars])/$unicode2HighBit{$1}/ge;
+# Convert characters (unicode codepoints) that cannot be represented in
+# the site charset to entities. Prefer named entities to numeric entities.
+sub convertNotRepresentabletoEntity {
+    if ( encoding() =~ /^utf-?8/ ) {
+        # UTF-8 can represent all characters, so no entities needed
     }
-}
+    else {
+        unless ($siteCharsetRepresentable) {
+            # Produce a string of unicode characters that contains all of the
+            # characters representable in the site charset
+            $siteCharsetRepresentable = '';
+            for my $code (0 .. 255) {
+                my $unicodeChar = Encode::decode(encoding(), chr($code), Encode::FB_PERLQQ);
+                if ($unicodeChar =~ /^\\x/) {
+                    # code is not valid, so skip it
+                }
+                else {
+                    # Escape codes in the standard ASCII range, as necessary,
+                    # to avoid special interpretation by perl
+                    $unicodeChar = quotemeta($unicodeChar) if ord($unicodeChar) <= 127;
 
-# Map selected high-bit chars to unicode if
-# iso-8859-1 is selected.
-sub mapHighBit2Unicode {
-    if ( encoding() eq 'iso-8859-1' ) {
+                    $siteCharsetRepresentable .= $unicodeChar;
+                }
+            }
+        }
 
-        # Map unicode back to iso-8859 high-bit chars
-        $_[0] =~ s/([$highBit2UnicodeChars])/$highBit2Unicode{$1}/ge;
+        require HTML::Entities;
+        $_[0] = HTML::Entities::encode_entities($_[0], "^$siteCharsetRepresentable");
+        # All characters that cannot be represented in the site charset are now encoded as entities
+        # Named entities are used if available, otherwise numeric entities,
+        # because named entities produce more readable TML
     }
 }
 
@@ -283,26 +264,23 @@ our @safeEntities = qw(
   ETH    Ntilde Ograve Oacute Ocirc  Otilde Ouml   times
   Oslash Ugrave Uacute Ucirc  Uuml   Yacute THORN  szlig
   agrave aacute acirc  atilde auml   aring  aelig  ccedil
-  egrave eacute ecirc  uml    igrave iacute icirc  iuml
+  egrave eacute ecirc  euml   igrave iacute icirc  iuml
   eth    ntilde ograve oacute ocirc  otilde ouml   divide
   oslash ugrave uacute ucirc  uuml   yacute thorn  yuml
 );
 
 # Mapping from entity names to characters
 our $safe_entities;
 
-# Get a hash that maps the safe entities values to characters
-# in the site charset.
+# Get a hash that maps the safe entities values to unicode characters
 sub safeEntities {
     unless ($safe_entities) {
         foreach my $entity (@safeEntities) {
 
             # Decode the entity name to unicode
             my $unicode = HTML::Entities::decode_entities("&$entity;");
 
-            # Map unicode back to iso-8859 high-bit chars if required
-            mapUnicode2HighBit($unicode);
-            $safe_entities->{$entity} = Encode::encode( encoding(), $unicode );
+            $safe_entities->{"$entity"} = $unicode;
         }
     }
     return $safe_entities;
@@ -324,6 +302,13 @@ sub chCodes {
     return $s;
 }
 
+# Allow the unit tests to force re-initialisation of 
+# %Foswiki::cfg-dependent cached data
+sub reinitialiseForTesting {
+    undef $encoding;
+    undef $siteCharsetRepresentable;
+}
+
 # Create shorter alias for other modules
 no strict 'refs';
 *{'WC::'} = \*{'Foswiki::Plugins::WysiwygPlugin::Constants::'};

diff --git a/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/HTML2TML.pm b/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/HTML2TML.pm
@@ -91,6 +91,13 @@ Convert a block of HTML text into TML.
 
 =cut
 
+sub debugEncode {
+    my $text = shift;
+    $text = WC::debugEncode($text);
+    $text =~ s/([^\x20-\x7E])/sprintf '\\x{%X}', ord($1)/ge;
+    return $text;
+}
+
 sub convert {
     my ( $this, $text, $options ) = @_;
 
@@ -100,11 +107,62 @@ sub convert {
     $opts = $WC::VERY_CLEAN
       if ( $options->{very_clean} );
 
-    # If the text is UTF8-encoded we have to decode it first, otherwise
-    # the HTML parser will barf.
+    # $text is octets, encoded as per the $Foswiki::cfg{Site}{CharSet}
+    #print STDERR "input     [". debugEncode($text). "]\n\n";
+
+    # Convert (safe) named entities back to the
+    # site charset. Numeric entities are mapped straight to the
+    # corresponding code point unless their value overflow.
+    # HTML::Entities::_decode_entities converts numeric entities 
+    # to Unicode codepoints, so first convert the text to Unicode
+    # characters
     if ( WC::encoding() =~ /^utf-?8/ ) {
+        # text is already UTF-8, so just decode
         $text = Encode::decode_utf8($text);
     }
+    else {
+        # convert to unicode codepoints
+        $text = Encode::decode(WC::encoding(), $text);
+    }
+    # $text is now Unicode characters
+    #print STDERR "unicoded  [". debugEncode($text). "]\n\n";
+
+    # Make sure that & < > ' and " remain encoded, because the parser depends
+    # on it. The safe-entities does not include the corresponding named
+    # entities, so convert numeric entities for these characters to the named 
+    # entity.
+    $text =~ s/\&\#38;/\&amp;/go;
+    $text =~ s/\&\#x26;/\&amp;/goi;
+    $text =~ s/\&\#60;/\&lt;/go;
+    $text =~ s/\&\#x3c;/\&lt;/goi;
+    $text =~ s/\&\#62;/\&gt;/go;
+    $text =~ s/\&\#x3e;/\&gt;/goi;
+    $text =~ s/\&\#39;/\&apos;/go;
+    $text =~ s/\&\#x27;/\&apos;/goi;
+    $text =~ s/\&\#34;/\&quot;/go;
+    $text =~ s/\&\#x22;/\&quot;/goi;
+
+    require HTML::Entities;
+    HTML::Entities::_decode_entities( $text, WC::safeEntities() );
+    #print STDERR "decodedent[". debugEncode($text). "]\n\n";
+
+    # HTML::Entities::_decode_entities is NOT aware of the site charset
+    # so it converts numeric entities to characters willy-nilly.
+    # Some of those were entities in the first place because the
+    # site character set cannot represent them.
+    # Convert them back to entities:
+    WC::convertNotRepresentabletoEntity($text);
+    #print STDERR "notrep2ent[". debugEncode($text). "]\n\n";
+
+    # $text is now Unicode characters that are representable
+    # in the site charset. Convert to the site charset:
+    if ( WC::encoding() =~ /^utf-?8/ ) {
+        # nothing to do, already in unicode
+    }
+    else {
+        $text = Encode::encode(WC::encoding(), $text);
+    }
+    #print STDERR "sitechrset[". debugEncode($text). "]\n\n";
 
     # get rid of nasties
     $text =~ s/\r//g;
@@ -119,21 +177,15 @@ sub convert {
     $this->_apply(undef);
     $text = $this->{stackTop}->rootGenerate($opts);
 
+    #print STDERR "parsed    [". debugEncode($text). "]\n\n";
+
     # If the site charset is UTF8, we need to recode
     if ( WC::encoding() =~ /^utf-?8/ ) {
         $text = Encode::encode_utf8($text);
+        #print STDERR "re-encoded[". debugEncode($text). "]\n\n";
     }
 
-    # Convert (safe) named entities back to the
-    # site charset. Numeric entities are mapped straight to the
-    # corresponding code point unless their value overflow.
-    require HTML::Entities;
-    HTML::Entities::_decode_entities( $text, WC::safeEntities() );
-
-    # After decoding entities, we have to map unicode characters
-    # back to high bit
-    WC::mapUnicode2HighBit($text);
-
+    # $text is octets, encoded as per the $Foswiki::cfg{Site}{CharSet}
     return $text;
 }
 

diff --git a/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm b/WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm
@@ -226,7 +226,7 @@ generate TML)
 sub rootGenerate {
     my ( $this, $opts ) = @_;
 
-    #print STDERR "Raw       [", WC::debugEncode($this->stringify()), "\n\n";
+    #print STDERR "Raw       [", WC::debugEncode($this->stringify()), "]\n\n";
     $this->cleanParseTree();
 
     #print STDERR "Cleaned   [", WC::debugEncode($this->stringify()), "]\n\n";