Fix Unicode handling in table diff by encoding non-ASCII chars as numeric entities #137

podarcis · 2025-09-18T09:47:57Z

Previously, TableDiff::createDocumentWithHtml() attempted to normalize text for DOMDocument::loadHTML() by passing it through htmlentities() and iconv('UTF-8', 'ISO-8859-1//IGNORE', ...). This caused non-ASCII characters (e.g. emoji, Cyrillic, CJK) that are neither representable HTML named entities nor in ISO-8859-1 to be dropped entirely.

This patch replaces that logic with a call to mb_encode_numericentity() to convert all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) into decimal HTML numeric entities. This ensures that the full Unicode range is preserved and parsed correctly by DOMDocument, which otherwise defaults to ISO-8859-1.

Resolves #136

…eric entities Previously, `TableDiff::createDocumentWithHtml()` attempted to normalize text for `DOMDocument::loadHTML()` by passing it through `htmlentities()` and `iconv('UTF-8', 'ISO-8859-1//IGNORE', ...)`. This caused non-ASCII characters (e.g. emoji, Cyrillic, CJK) that are not representable in ISO-8859-1 to be dropped entirely. This patch replaces that logic with a call to `mb_encode_numericentity()` to convert all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) into decimal HTML numeric entities. This ensures that the full Unicode range is preserved and parsed correctly by `DOMDocument`, which otherwise defaults to ISO-8859-1.

Copilot

Pull Request Overview

This PR fixes Unicode handling in table diff functionality by replacing the problematic character encoding approach with numeric HTML entities. The change ensures that non-ASCII characters (emoji, Cyrillic, CJK, etc.) are preserved instead of being dropped during HTML processing.

Replaces htmlentities() and iconv() chain with mb_encode_numericentity() for better Unicode support
Converts all non-ASCII Unicode codepoints (U+0080 – U+10FFFF) to decimal HTML numeric entities
Simplifies the HTML loading process by removing the complex encoding/decoding chain

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-18T15:39:48Z

lib/Caxy/HtmlDiff/Table/TableDiff.php

+        $convmap = [
+            0x80,           // Leave ASCII range intact
+            0x10FFFF,       // Convert the rest of the Unicode range
+            0,
+            0xFFFFFF,
+        ];


The convmap array uses magic numbers without clear documentation. Consider defining named constants for these Unicode range boundaries (0x80 for ASCII_END, 0x10FFFF for UNICODE_MAX) to improve code readability and maintainability.

jschroed91 · 2025-09-18T15:44:50Z

Looks like our tests are failing not because of the changes here but because we're using outdated composer version which no longer works as of Sept 1st 2025. Could be other reasons as well, I only glanced at the error logs in scrutinizer. Will need to come back to it

Will make an issue to migrate to github actions for CI

podarcis mentioned this pull request Sep 18, 2025

TableDiff: Unicode characters are silently dropped from diff #136

Open

jschroed91 requested a review from Copilot September 18, 2025 15:39

Copilot AI reviewed Sep 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Unicode handling in table diff by encoding non-ASCII chars as numeric entities #137

Fix Unicode handling in table diff by encoding non-ASCII chars as numeric entities #137

Uh oh!

podarcis commented Sep 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

jschroed91 commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix Unicode handling in table diff by encoding non-ASCII chars as numeric entities #137

Are you sure you want to change the base?

Fix Unicode handling in table diff by encoding non-ASCII chars as numeric entities #137

Uh oh!

Conversation

podarcis commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jschroed91 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

podarcis commented Sep 18, 2025 •

edited

Loading

jschroed91 commented Sep 18, 2025 •

edited

Loading