Skip to content

Commit

Permalink
Item5990, Item9170, Item761, Item2231: Fix character encoding issues …
Browse files Browse the repository at this point in the history
…with WysiwygPlugin

   * As far as I can tell, unicode characters and entities are now converted correctly.
   * Numeric entities in ordinary text are converted to characters in the site charset (if the site charset can represent the character) or named entities (if there is a named entity for the character) which should improve readability of TML. The same conversion is also applied to UTF-8 characters not represented in the site charset, and numeric entities are used where necessary (for characters for which there are no named entities).
   * Entities are now preserved (i.e. not modified at all) inside sticky and verbatim blocks.

There are several changes here, but I cannot do this in small steps without breaking things in between. Each time I fixed one problem, another (lurking) problem popped up somewhere else.

HTML::Entities::_decode_entities converts numeric entities to characters. The numbers always correspond to Unicode codepoints (see http://en.wikipedia.org/wiki/Html_entities#HTML_character_references). Foswiki uses HTML::Entities::_decode_entities to convert named entities to characters. I changed the named-entity conversion to convert to Unicode codepoints, too (it was converting to site charset, which can cause data corruption for numeric entities in the range 127 to 255 for charsets other than UTF-8, ISO-8859-1). 

This meant that I had to change the text to Unicode characters (not encoded as UTF-8) before decoding entities, which meant extra conversions, including a step to convert characters that cannot be represented in the site charset to entities.

There was code to do that in RESTParameter2SiteCharSet, but it used PERLQQ encoding, which corrupted text (converting to perl escape sequences, e.g. \x{2460}, which surprises everyone who encounters this behaviour). That was fixed, too.

Many browsers (including Firefox) interpret pages identified as ISO-8859-1 as if they were encoded with Windows-1252. When posting (e.g. saving) in response to such pages, they also encode data in the same way. This is why mapUnicode2HighBit (and its opposite, mapHighBit2Unicode) were needed. However, those functions complicate the conversion to entities of characters that cannot be represented in the site charset. Perl's standards-compliant Encode to the rescue! If you tell Encode to use the Windows-1252 encoding instead of ISO-8859-1, then it does exactly what we want, and those mapping functions are not necessary.

The WysiwygPluginTests test the conversions for various site charsets using ranges of character codes. I could not determine what charset(s) those character codes referred to, so I changed the tests to be explicit - either unicode codepoints or codes in the site charset (given as parameter to the test function). I removed the tests for Unicode codepoints 127 to 159 because they are control characters, which (as far as I am aware) Foswiki does not use. Instead, I added tests for the Unicode codepoints for the Windows-1252 characters with codes 127 to 159. 

Foswiki::Plugins::WysiwygPlugin::Constants stores computed data that is derived from %Foswiki::cfg. Some of the WysiwygPlugin unit tests that depend on that data change %Foswiki::cfg temporarily, so the stored data in Foswiki::Plugins::WysiwygPlugin::Constants must be reset before running unit tests that depend on that data.

I tested this with the following site charsets: '' (default value), 'ISO-8859-1', 'ISO-8859-15', 'utf-8'


git-svn-id: http://svn.foswiki.org/trunk@7854 0b4bb1d4-4e5a-0410-9cc4-b2b747904278
  • Loading branch information
MichaelTempest authored and MichaelTempest committed Jun 19, 2010
1 parent 6fc1ff2 commit a820f29
Show file tree
Hide file tree
Showing 8 changed files with 553 additions and 169 deletions.
101 changes: 43 additions & 58 deletions WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/Constants.pm
Expand Up @@ -4,6 +4,8 @@ package Foswiki::Plugins::WysiwygPlugin::Constants;
use strict;
use warnings;

use Encode;

# HTML elements that are strictly block type, as defined by
# http://www.htmlhelp.com/reference/html40/block.html.
# Block type elements do not require
Expand Down Expand Up @@ -202,72 +204,51 @@ our %HTML2TML_COLOURMAP = (

############ Encodings ###############

# Mapping high-bit characters from unicode back to iso-8859-1
# (a.k.a Windows 1252 a.k.a "ANSI") - http://www.alanwood.net/demos/ansi.html
our %unicode2HighBit = (
chr(8364) => chr(128),
chr(8218) => chr(130),
chr(402) => chr(131),
chr(8222) => chr(132),
chr(8230) => chr(133),
chr(8224) => chr(134),
chr(8225) => chr(135),
chr(710) => chr(136),
chr(8240) => chr(137),
chr(352) => chr(138),
chr(8249) => chr(139),
chr(338) => chr(140),
chr(381) => chr(142),
chr(8216) => chr(145),
chr(8217) => chr(146),
chr(8220) => chr(147),
chr(8221) => chr(148),
chr(8226) => chr(149),
chr(8211) => chr(150),
chr(8212) => chr(151),
chr(732) => chr(152),
chr(8482) => chr(153),
chr(353) => chr(154),
chr(8250) => chr(155),
chr(339) => chr(156),
chr(382) => chr(158),
chr(376) => chr(159),
);

# Reverse mapping
our %highBit2Unicode = map { $unicode2HighBit{$_} => $_ } keys %unicode2HighBit;

our $unicode2HighBitChars = join( '', keys %unicode2HighBit );
our $highBit2UnicodeChars = join( '', keys %highBit2Unicode );
our $encoding;

sub encoding {
unless ($encoding) {
$encoding =
Encode::resolve_alias( $Foswiki::cfg{Site}{CharSet} || 'iso-8859-1' );

$encoding = 'windows-1252' if $encoding =~ /^iso-8859-1$/i;
}
return $encoding;
}

# Map selected unicode characters back to high-bit chars if
# iso-8859-1 is selected. This is required because the same characters
# have different code points in unicode and iso-8859-1. For example,
# € is 128 in iso-8859-1 and 8364 in unicode.
sub mapUnicode2HighBit {
if ( encoding() eq 'iso-8859-1' ) {
my $siteCharsetRepresentable;

# Map unicode back to iso-8859 high-bit chars
$_[0] =~ s/([$unicode2HighBitChars])/$unicode2HighBit{$1}/ge;
# Convert characters (unicode codepoints) that cannot be represented in
# the site charset to entities. Prefer named entities to numeric entities.
sub convertNotRepresentabletoEntity {
if ( encoding() =~ /^utf-?8/ ) {
# UTF-8 can represent all characters, so no entities needed
}
}
else {
unless ($siteCharsetRepresentable) {
# Produce a string of unicode characters that contains all of the
# characters representable in the site charset
$siteCharsetRepresentable = '';
for my $code (0 .. 255) {
my $unicodeChar = Encode::decode(encoding(), chr($code), Encode::FB_PERLQQ);
if ($unicodeChar =~ /^\\x/) {
# code is not valid, so skip it
}
else {
# Escape codes in the standard ASCII range, as necessary,
# to avoid special interpretation by perl
$unicodeChar = quotemeta($unicodeChar) if ord($unicodeChar) <= 127;

# Map selected high-bit chars to unicode if
# iso-8859-1 is selected.
sub mapHighBit2Unicode {
if ( encoding() eq 'iso-8859-1' ) {
$siteCharsetRepresentable .= $unicodeChar;
}
}
}

# Map unicode back to iso-8859 high-bit chars
$_[0] =~ s/([$highBit2UnicodeChars])/$highBit2Unicode{$1}/ge;
require HTML::Entities;
$_[0] = HTML::Entities::encode_entities($_[0], "^$siteCharsetRepresentable");
# All characters that cannot be represented in the site charset are now encoded as entities
# Named entities are used if available, otherwise numeric entities,
# because named entities produce more readable TML
}
}

Expand All @@ -283,26 +264,23 @@ our @safeEntities = qw(
ETH Ntilde Ograve Oacute Ocirc Otilde Ouml times
Oslash Ugrave Uacute Ucirc Uuml Yacute THORN szlig
agrave aacute acirc atilde auml aring aelig ccedil
egrave eacute ecirc uml igrave iacute icirc iuml
egrave eacute ecirc euml igrave iacute icirc iuml
eth ntilde ograve oacute ocirc otilde ouml divide
oslash ugrave uacute ucirc uuml yacute thorn yuml
);

# Mapping from entity names to characters
our $safe_entities;

# Get a hash that maps the safe entities values to characters
# in the site charset.
# Get a hash that maps the safe entities values to unicode characters
sub safeEntities {
unless ($safe_entities) {
foreach my $entity (@safeEntities) {

# Decode the entity name to unicode
my $unicode = HTML::Entities::decode_entities("&$entity;");

# Map unicode back to iso-8859 high-bit chars if required
mapUnicode2HighBit($unicode);
$safe_entities->{$entity} = Encode::encode( encoding(), $unicode );
$safe_entities->{"$entity"} = $unicode;
}
}
return $safe_entities;
Expand All @@ -324,6 +302,13 @@ sub chCodes {
return $s;
}

# Allow the unit tests to force re-initialisation of
# %Foswiki::cfg-dependent cached data
sub reinitialiseForTesting {
undef $encoding;
undef $siteCharsetRepresentable;
}

# Create shorter alias for other modules
no strict 'refs';
*{'WC::'} = \*{'Foswiki::Plugins::WysiwygPlugin::Constants::'};
Expand Down
76 changes: 64 additions & 12 deletions WysiwygPlugin/lib/Foswiki/Plugins/WysiwygPlugin/HTML2TML.pm
Expand Up @@ -91,6 +91,13 @@ Convert a block of HTML text into TML.
=cut

sub debugEncode {
my $text = shift;
$text = WC::debugEncode($text);
$text =~ s/([^\x20-\x7E])/sprintf '\\x{%X}', ord($1)/ge;
return $text;
}

sub convert {
my ( $this, $text, $options ) = @_;

Expand All @@ -100,11 +107,62 @@ sub convert {
$opts = $WC::VERY_CLEAN
if ( $options->{very_clean} );

# If the text is UTF8-encoded we have to decode it first, otherwise
# the HTML parser will barf.
# $text is octets, encoded as per the $Foswiki::cfg{Site}{CharSet}
#print STDERR "input [". debugEncode($text). "]\n\n";

# Convert (safe) named entities back to the
# site charset. Numeric entities are mapped straight to the
# corresponding code point unless their value overflow.
# HTML::Entities::_decode_entities converts numeric entities
# to Unicode codepoints, so first convert the text to Unicode
# characters
if ( WC::encoding() =~ /^utf-?8/ ) {
# text is already UTF-8, so just decode
$text = Encode::decode_utf8($text);
}
else {
# convert to unicode codepoints
$text = Encode::decode(WC::encoding(), $text);
}
# $text is now Unicode characters
#print STDERR "unicoded [". debugEncode($text). "]\n\n";

# Make sure that & < > ' and " remain encoded, because the parser depends
# on it. The safe-entities does not include the corresponding named
# entities, so convert numeric entities for these characters to the named
# entity.
$text =~ s/\&\#38;/\&amp;/go;
$text =~ s/\&\#x26;/\&amp;/goi;
$text =~ s/\&\#60;/\&lt;/go;
$text =~ s/\&\#x3c;/\&lt;/goi;
$text =~ s/\&\#62;/\&gt;/go;
$text =~ s/\&\#x3e;/\&gt;/goi;
$text =~ s/\&\#39;/\&apos;/go;
$text =~ s/\&\#x27;/\&apos;/goi;
$text =~ s/\&\#34;/\&quot;/go;
$text =~ s/\&\#x22;/\&quot;/goi;

require HTML::Entities;
HTML::Entities::_decode_entities( $text, WC::safeEntities() );
#print STDERR "decodedent[". debugEncode($text). "]\n\n";

# HTML::Entities::_decode_entities is NOT aware of the site charset
# so it converts numeric entities to characters willy-nilly.
# Some of those were entities in the first place because the
# site character set cannot represent them.
# Convert them back to entities:
WC::convertNotRepresentabletoEntity($text);
#print STDERR "notrep2ent[". debugEncode($text). "]\n\n";

# $text is now Unicode characters that are representable
# in the site charset. Convert to the site charset:
if ( WC::encoding() =~ /^utf-?8/ ) {
# nothing to do, already in unicode
}
else {
$text = Encode::encode(WC::encoding(), $text);
}
#print STDERR "sitechrset[". debugEncode($text). "]\n\n";

# get rid of nasties
$text =~ s/\r//g;
Expand All @@ -119,21 +177,15 @@ sub convert {
$this->_apply(undef);
$text = $this->{stackTop}->rootGenerate($opts);

#print STDERR "parsed [". debugEncode($text). "]\n\n";

# If the site charset is UTF8, we need to recode
if ( WC::encoding() =~ /^utf-?8/ ) {
$text = Encode::encode_utf8($text);
#print STDERR "re-encoded[". debugEncode($text). "]\n\n";
}

# Convert (safe) named entities back to the
# site charset. Numeric entities are mapped straight to the
# corresponding code point unless their value overflow.
require HTML::Entities;
HTML::Entities::_decode_entities( $text, WC::safeEntities() );

# After decoding entities, we have to map unicode characters
# back to high bit
WC::mapUnicode2HighBit($text);

# $text is octets, encoded as per the $Foswiki::cfg{Site}{CharSet}
return $text;
}

Expand Down
Expand Up @@ -226,7 +226,7 @@ generate TML)
sub rootGenerate {
my ( $this, $opts ) = @_;

#print STDERR "Raw [", WC::debugEncode($this->stringify()), "\n\n";
#print STDERR "Raw [", WC::debugEncode($this->stringify()), "]\n\n";
$this->cleanParseTree();

#print STDERR "Cleaned [", WC::debugEncode($this->stringify()), "]\n\n";
Expand Down

0 comments on commit a820f29

Please sign in to comment.