Skip to content

Conversation

@adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 1, 2025

Speeds up XMLProcessor by consuming any ASCII bytes with strspn and avoiding calls to the utf8 decoder for most tags out there. The PHPUnit test suite for WXR files It speeds up parsing the 10MB WXR file in the test set from ~1.7s on average to ~0.6s on average.

This PR also moves from utf8_codepoint_at to _wp_scan_utf8 for UTF-8 decoding without any speed penalty – see #200 for prior context.

cc @dmsnell

@adamziel adamziel added the enhancement New feature or request label Nov 1, 2025
Added a blank line for better readability in comments.
@adamziel adamziel merged commit 7241121 into trunk Nov 2, 2025
22 checks passed
adamziel added a commit that referenced this pull request Nov 2, 2025
Removes the `utf8_codepoint_at()` function. It is no longer used as all
the php-toolkit classes are migrated to the `_wp_scan_utf8` function
shipped with WordPress 6.9.

Follows up on #200
Solves #196 

#201 and #202 must be merged before this PR.
adamziel added a commit to WordPress/wordpress-importer that referenced this pull request Nov 4, 2025
Adds support for rewriting URLs inside CSS syntax, e.g. here:

```html
<div style="background-image:url(/wp-content/uploads/2025/09/image-2-766x1024.jpeg)">
```

Before this PR, the `style` attributes in, e.g., the cover block were skipped by the URL rewriter and continued pointing to the old site.

Fixes #223

## Implementation details

This PR backports `CSSProcessor`, `CSSURLProcessor`, and a few related PRs around Unicode handling from the WordPress/php-toolkit repo:

* WordPress/php-toolkit#197
* WordPress/php-toolkit#195
* WordPress/php-toolkit#199
* WordPress/php-toolkit#200
* WordPress/php-toolkit#201
* WordPress/php-toolkit#202

Note the CSSProcessor and CSSURLProcessor are tested against 300 test cases containing various tricky inputs, quoted and unquoted URLs, strings, comments, unicode escape sequences, and more.

## Testing instructions

This PR comes with a new test case specifically for various tricky CSS inputs. You're also welcome to try and import a WXR file that contains an inline background-image reference and confirm the URL is correctly rewritten.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants