CSSTokenizer #197

adamziel · 2025-10-29T15:18:04Z

Overview

Ships a CSSTokenizer class that tokenizes CSS syntax. It follows the CSS Syntax Level 3 spec, giving Data Liberation tooling a reliable foundation for discovering and rewriting URLs without leaning on brittle regular expressions or partial parsers.

This branch is tested using the @rmenke/css-tokenizer-tests test corpus and an extra set of tests designed to cover nuances.

Design choices:

On-the-fly normalization

The CSS Spec requires the following normalization step:

Replace any U+000D CARRIAGE RETURN (CR) code points, U+000C FORM FEED (FF)
code points, or pairs of U+000D CARRIAGE RETURN (CR) followed by U+000A LINE
FEED (LF) in input by a single U+000A LINE FEED (LF) code point.
Replace any U+0000 NULL or surrogate code points in input with U+FFFD REPLACEMENT
CHARACTER (�).

This processor delays normalization as much as possible, rather than preprocessing
the entire input upfront. This avoids the upfront allocation cost for clean CSS
and preserves original byte positions for accurate raw token extraction. A part
of the normalization is performed on-the-fly as the tokens are consumed. The rest
of it is done once the token value is requested.

No EOF token

The EOF token is a CSS parsing concept, not CSS tokenization concept. Therefore,
this processor does not produce it.

UTF-8 handling

Only UTF-8 strings are supported. In case an invalid UTF-8 sequence is encountered,
it is replaced with a U+FFFD REPLACEMENT CHARACTER (�) using the maximal subpart
approach described in https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf,
section 3.9 Best Practices for Using U+FFFD.

Usage

The next_token() method is the main entry point for tokenizing a CSS string.
It will consume the next token from the input stream and return true if a token
was found. Otherwise, it will return false:

$css = 'width: 10px;';
$processor = CSSProcessor::create( $css );
while ( $processor->next_token() ) {
    echo $processor->get_normalized_token();
}
// Outputs:
// width: 10px;

…lly compose values

adamziel · 2025-10-30T18:38:54Z

I've applied your feedback @dmsnell, thank you! This PR seems in a pretty good place now. I'll draft a second one with set_token_value() before merging this, just to make sure we have the right APIs.

adamziel · 2025-10-30T20:19:46Z

I think we're good API-wise. Even if not, we can always adjust since it's the php-toolkit repo.

adamziel · 2025-10-30T20:21:52Z

CC @sirreal – you may like this

@dmsnell

Follows up on #197 by adding a `set_token_value( $new_value )` method to allow rewriting CSS. At the moment, it only supports updating the URL token value. ```php $css = 'background: url(old.jpg);'; $processor = CSSProcessor::create( $css ); while ( $processor->next_token() ) { if ( CSSProcessor::TOKEN_URL === $processor->get_token_type() ) { // URL with safe characters: letters, digits, hyphens, underscores, dots, slashes. $processor->set_token_value( "image😀.jpg ("special")" ); } } echo $processor->get_updated_css(); // background: url("image😀.jpg (\22 special\22 )"); ``` ## Implementation details `set_token_value( $new_value )` always uses the quoted URL syntax to encode the new URL. It only escapes quotes newline characters, backslashes, and double quotes. All other bytes are preserved as-is. cc @dmsnell

@dmsnell

Follows up on #197 by adding a `set_token_value( $new_value )` method to allow rewriting CSS. At the moment, it only supports updating the URL token value. ```php $css = 'background: url(old.jpg);'; $processor = CSSProcessor::create( $css ); while ( $processor->next_token() ) { if ( CSSProcessor::TOKEN_URL === $processor->get_token_type() ) { // URL with safe characters: letters, digits, hyphens, underscores, dots, slashes. $processor->set_token_value( "image😀.jpg ("special")" ); } } echo $processor->get_updated_css(); // background: url("image😀.jpg (\22 special\22 )"); ``` ## Implementation details `set_token_value( $new_value )` always uses the quoted URL syntax to encode the new URL. It only escapes quotes newline characters, backslashes, and double quotes. All other bytes are preserved as-is. PR #198 was supposed to merge this into trunk, but I never updated the base. cc @dmsnell

components/Blueprints/class-runner.php

@dmsnell

…der (#200) Replaces two instances of the old UTF-8 decoding utilities with the new utf-8.php toolkit by @dmsnell: * https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/compat-utf8.php * https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/utf8.php This PR only touches two tactical usages of the old tools: * Blueprint validation now uses `wp_is_valid_utf8` * CSSProcessor now uses `wp_scrub_utf8` instead of `_wp_scrub_utf8_fallback` More refactoring is coming once there's a faster alternative to `_wp_scan_utf8`, see https://core.trac.wordpress.org/ticket/63863#comment:51 Related to #196. Follows up on #199 and #197. ## Testing instructions If the CI passes, we're good. Unicode-related scenarios are covered by tests.

Adds support for rewriting URLs inside CSS syntax, e.g. here: ```html <div style="background-image:url(/wp-content/uploads/2025/09/image-2-766x1024.jpeg)"> ``` Before this PR, the `style` attributes in, e.g., the cover block were skipped by the URL rewriter and continued pointing to the old site. Fixes #223 ## Implementation details This PR backports `CSSProcessor`, `CSSURLProcessor`, and a few related PRs around Unicode handling from the WordPress/php-toolkit repo: * WordPress/php-toolkit#197 * WordPress/php-toolkit#195 * WordPress/php-toolkit#199 * WordPress/php-toolkit#200 * WordPress/php-toolkit#201 * WordPress/php-toolkit#202 Note the CSSProcessor and CSSURLProcessor are tested against 300 test cases containing various tricky inputs, quoted and unquoted URLs, strings, comments, unicode escape sequences, and more. ## Testing instructions This PR comes with a new test case specifically for various tricky CSS inputs. You're also welcome to try and import a WXR file that contains an inline background-image reference and confirm the URL is correctly rewritten.

adamziel added 30 commits October 21, 2025 21:38

Kickoff migrating URLs in CSS

d332205

Support Unicode escapes

adb07a9

Simplify the replacements, format the code

40380e5

Improve clarity of the CSSUrlProcessor

f6710aa

Test CSS unicode escapes decoder

ff59ffd

Ditch regexp

0813667

PHPCS

3b69730

Do not allocate memory for every match optimistically

95a1302

Test for data URI

e98c3ba

Skip data URIs in the replacement logic

3bdbda6

Optimize get_parsed_url() for data uris

8a5e734

Simplify the CSS URL Processor

3739a95

Move URL parsing from CSS processor to BlockMarkupURLProcessor

0d5d95f

Use wp.org as a test domain

5feafb5

Simplify the css processor integration

c387bd5

Add a generic CSS Processor

2b2170b

Simplify consume_string()

ee3ed64

Pass most CSS tokenizer test cases

cd32ab2

Less failures

4b75739

1 last failure

d3d1b07

Remove the offending fuzzer test

0245453

Adjust details

8996fd4

Use codepoints instead of bytes for decoding idents

38f89af

Use the bundled unicode decoder

2382057

Do not concat to repr when consuming numeric values

663db21

Comments, renaming for clarity

c647699

Simplify consume_ident_sequence()

8227327

Fix inconsistencies in CSSProcessor

2182023

Simplify is_valid_escape

20947cb

Simplify would_next_3_code_points_start_an_ident

43301e6

adamziel added 6 commits October 30, 2025 12:48

Remove this->token_value_needs_decoding

b2da468

Simplify ident sequence consumption – use indexes, do not optimistica…

82c32a5

…lly compose values

PHP 7.2 syntax adjustment

d7ef9b1

Brush up the test set

5cbf7f2

code style

2f008a8

PHP 7.2 syntax adjustment

c474de1

adamziel marked this pull request as ready for review October 30, 2025 13:32

adamziel added 9 commits October 30, 2025 14:32

remove the generator script

394a789

Add attribution

8a72b92

Use _wp_scan_utf8

bc2efc9

Adjust tests

bf2140a

Test invalid utf-8 sequences

b1e3d77

Test invalid utf-8 sequences

e06f08d

Brush up tests

b1a9283

phpcs

f0c0ec7

more usage examples

1b7b65b

adamziel mentioned this pull request Oct 30, 2025

CSSProcessor::set_token_value( $new_value ) #198

Merged

adamziel merged commit ef1c5e3 into trunk Oct 30, 2025
22 checks passed

adamziel mentioned this pull request Oct 31, 2025

CSSProcessor::set_token_value( $new_value ) #199

Merged

adamziel mentioned this pull request Oct 31, 2025

Use wp_is_valid_utf8() and wp_scrub_utf8() from the new utf8.php decoder #200

Merged

dmsnell reviewed Nov 1, 2025

View reviewed changes

components/Blueprints/class-runner.php Show resolved Hide resolved

This was referenced Nov 2, 2025

[BlockMarkupURLProcessor] Support URLs in CSS #195

Merged

Rewrite CSS URLs (e.g. cover block) WordPress/wordpress-importer#243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSSTokenizer #197

CSSTokenizer #197

Uh oh!

adamziel commented Oct 29, 2025 •

edited

Loading

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CSSTokenizer #197

CSSTokenizer #197

Uh oh!

Conversation

adamziel commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design choices:

On-the-fly normalization

No EOF token

UTF-8 handling

Usage

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

Uh oh!

adamziel commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamziel commented Oct 29, 2025 •

edited

Loading