Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Item5990, Item9170, Item761, Item2231: Fix character encoding issues …
…with WysiwygPlugin * As far as I can tell, unicode characters and entities are now converted correctly. * Numeric entities in ordinary text are converted to characters in the site charset (if the site charset can represent the character) or named entities (if there is a named entity for the character) which should improve readability of TML. The same conversion is also applied to UTF-8 characters not represented in the site charset, and numeric entities are used where necessary (for characters for which there are no named entities). * Entities are now preserved (i.e. not modified at all) inside sticky and verbatim blocks. There are several changes here, but I cannot do this in small steps without breaking things in between. Each time I fixed one problem, another (lurking) problem popped up somewhere else. HTML::Entities::_decode_entities converts numeric entities to characters. The numbers always correspond to Unicode codepoints (see http://en.wikipedia.org/wiki/Html_entities#HTML_character_references). Foswiki uses HTML::Entities::_decode_entities to convert named entities to characters. I changed the named-entity conversion to convert to Unicode codepoints, too (it was converting to site charset, which can cause data corruption for numeric entities in the range 127 to 255 for charsets other than UTF-8, ISO-8859-1). This meant that I had to change the text to Unicode characters (not encoded as UTF-8) before decoding entities, which meant extra conversions, including a step to convert characters that cannot be represented in the site charset to entities. There was code to do that in RESTParameter2SiteCharSet, but it used PERLQQ encoding, which corrupted text (converting to perl escape sequences, e.g. \x{2460}, which surprises everyone who encounters this behaviour). That was fixed, too. Many browsers (including Firefox) interpret pages identified as ISO-8859-1 as if they were encoded with Windows-1252. When posting (e.g. saving) in response to such pages, they also encode data in the same way. This is why mapUnicode2HighBit (and its opposite, mapHighBit2Unicode) were needed. However, those functions complicate the conversion to entities of characters that cannot be represented in the site charset. Perl's standards-compliant Encode to the rescue! If you tell Encode to use the Windows-1252 encoding instead of ISO-8859-1, then it does exactly what we want, and those mapping functions are not necessary. The WysiwygPluginTests test the conversions for various site charsets using ranges of character codes. I could not determine what charset(s) those character codes referred to, so I changed the tests to be explicit - either unicode codepoints or codes in the site charset (given as parameter to the test function). I removed the tests for Unicode codepoints 127 to 159 because they are control characters, which (as far as I am aware) Foswiki does not use. Instead, I added tests for the Unicode codepoints for the Windows-1252 characters with codes 127 to 159. Foswiki::Plugins::WysiwygPlugin::Constants stores computed data that is derived from %Foswiki::cfg. Some of the WysiwygPlugin unit tests that depend on that data change %Foswiki::cfg temporarily, so the stored data in Foswiki::Plugins::WysiwygPlugin::Constants must be reset before running unit tests that depend on that data. I tested this with the following site charsets: '' (default value), 'ISO-8859-1', 'ISO-8859-15', 'utf-8' git-svn-id: http://svn.foswiki.org/trunk@7854 0b4bb1d4-4e5a-0410-9cc4-b2b747904278
- Loading branch information