Skip to content

Conversation

podarcis
Copy link

@podarcis podarcis commented Sep 18, 2025

Previously, TableDiff::createDocumentWithHtml() attempted to normalize text for DOMDocument::loadHTML() by passing it through htmlentities() and iconv('UTF-8', 'ISO-8859-1//IGNORE', ...). This caused non-ASCII characters (e.g. emoji, Cyrillic, CJK) that are neither representable HTML named entities nor in ISO-8859-1 to be dropped entirely.

This patch replaces that logic with a call to mb_encode_numericentity() to convert all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) into decimal HTML numeric entities. This ensures that the full Unicode range is preserved and parsed correctly by DOMDocument, which otherwise defaults to ISO-8859-1.

Resolves #136

…eric entities

Previously, `TableDiff::createDocumentWithHtml()` attempted to normalize
text for `DOMDocument::loadHTML()` by passing it through `htmlentities()`
and `iconv('UTF-8', 'ISO-8859-1//IGNORE', ...)`. This caused non-ASCII
characters (e.g. emoji, Cyrillic, CJK) that are not representable in
ISO-8859-1 to be dropped entirely.

This patch replaces that logic with a call to `mb_encode_numericentity()`
to convert all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) into
decimal HTML numeric entities. This ensures that the full Unicode range
is preserved and parsed correctly by `DOMDocument`, which otherwise
defaults to ISO-8859-1.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes Unicode handling in table diff functionality by replacing the problematic character encoding approach with numeric HTML entities. The change ensures that non-ASCII characters (emoji, Cyrillic, CJK, etc.) are preserved instead of being dropped during HTML processing.

  • Replaces htmlentities() and iconv() chain with mb_encode_numericentity() for better Unicode support
  • Converts all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) to decimal HTML numeric entities
  • Simplifies the HTML loading process by removing the complex encoding/decoding chain

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +631 to +636
$convmap = [
0x80, // Leave ASCII range intact
0x10FFFF, // Convert the rest of the Unicode range
0,
0xFFFFFF,
];
Copy link
Preview

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convmap array uses magic numbers without clear documentation. Consider defining named constants for these Unicode range boundaries (0x80 for ASCII_END, 0x10FFFF for UNICODE_MAX) to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.

@jschroed91
Copy link
Member

jschroed91 commented Sep 18, 2025

Looks like our tests are failing not because of the changes here but because we're using outdated composer version which no longer works as of Sept 1st 2025. Could be other reasons as well, I only glanced at the error logs in scrutinizer. Will need to come back to it

Will make an issue to migrate to github actions for CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TableDiff: Unicode characters are silently dropped from diff
2 participants