Skip to content

Conversation

@adamziel
Copy link
Collaborator

@adamziel adamziel commented Oct 31, 2025

Replaces two instances of the old UTF-8 decoding utilities with the new utf-8.php toolkit by @dmsnell:

This PR only touches two tactical usages of the old tools:

  • Blueprint validation now uses wp_is_valid_utf8
  • CSSProcessor now uses wp_scrub_utf8 instead of _wp_scrub_utf8_fallback

More refactoring is coming once there's a faster alternative to _wp_scan_utf8, see https://core.trac.wordpress.org/ticket/63863#comment:51

Related to #196.

Follows up on #199 and #197.

Testing instructions

If the CI passes, we're good. Unicode-related scenarios are covered by tests.

@adamziel adamziel added the enhancement New feature or request label Oct 31, 2025
} else {
$is_valid_utf8 = ! _wp_has_noncharacters_fallback( $blueprint_string );
}
$is_valid_utf8 = ! wp_has_noncharacters( $blueprint_string );
Copy link
Member

@dmsnell dmsnell Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a note in the other PR making this change, but it’s possible there is a misunderstanding here between valid UTF-8 and noncharacters, as noncharacters are valid UTF-8, and invalid UTF-8 cannot form noncharacters.

there is, however, wp_is_valid_utf8()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, I mixed it all up. Thank you for catching this.

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 1, 2025

_wp_scan_utf8 became a major performance bottleneck in CI with 20x slowdown – the jobs were all timing out. I've explained the details in the UTF-8 trac ticket https://core.trac.wordpress.org/ticket/63863#comment:51. More involved migration to the new wp_ functions is blocked until there's a way to keep the UTF-8 decoding speed closer to utf8_codepoint_at. I'll focus this PR on bringing over the regular utf8.php file and using the public API instead of private fallbacks.

@adamziel adamziel changed the title Migrate to the new utf8.php decoder Use wp_is_valid_utf8() and wp_scrub_utf8() from the new utf8.php decoder Nov 1, 2025
@adamziel adamziel merged commit e434a3a into trunk Nov 1, 2025
22 checks passed
adamziel added a commit that referenced this pull request Nov 1, 2025
Replaces the `utf8_codepoint_at()` call in `CSSProcessor` with
`utf8_ord()`. This removes the last reference to `utf8_codepoint_at()`
from the codebase so the next PR will remove that function, see
#200 for prior context.

cc @dmsnell
adamziel added a commit that referenced this pull request Nov 2, 2025
Speeds up XMLProcessor by consuming any ASCII bytes with `strspn` and
avoiding calls to the utf8 decoder for most tags out there. The PHPUnit
test suite for WXR files It speeds up parsing the 10MB WXR file in the
test set from ~1.7s on average to ~0.6s on average.

This PR also moves from `utf8_codepoint_at` to `_wp_scan_utf8` for UTF-8
decoding without any speed penalty – see
#200 for prior context.

cc @dmsnell
adamziel added a commit that referenced this pull request Nov 2, 2025
Removes the `utf8_codepoint_at()` function. It is no longer used as all
the php-toolkit classes are migrated to the `_wp_scan_utf8` function
shipped with WordPress 6.9.

Follows up on #200
Solves #196 

#201 and #202 must be merged before this PR.
adamziel added a commit to WordPress/wordpress-importer that referenced this pull request Nov 4, 2025
Adds support for rewriting URLs inside CSS syntax, e.g. here:

```html
<div style="background-image:url(/wp-content/uploads/2025/09/image-2-766x1024.jpeg)">
```

Before this PR, the `style` attributes in, e.g., the cover block were skipped by the URL rewriter and continued pointing to the old site.

Fixes #223

## Implementation details

This PR backports `CSSProcessor`, `CSSURLProcessor`, and a few related PRs around Unicode handling from the WordPress/php-toolkit repo:

* WordPress/php-toolkit#197
* WordPress/php-toolkit#195
* WordPress/php-toolkit#199
* WordPress/php-toolkit#200
* WordPress/php-toolkit#201
* WordPress/php-toolkit#202

Note the CSSProcessor and CSSURLProcessor are tested against 300 test cases containing various tricky inputs, quoted and unquoted URLs, strings, comments, unicode escape sequences, and more.

## Testing instructions

This PR comes with a new test case specifically for various tricky CSS inputs. You're also welcome to try and import a WXR file that contains an inline background-image reference and confirm the URL is correctly rewritten.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants