Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 10, 2025

Trac ticket: Core-63863
See: #9825, #9830, #9498, #9826, #9827, #9798, (#9828), #9829

Update the polyfill of mb_strlen() to rely on the new UTF-8 pipeline.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch 9 times, most recently from 780e7db to e8a698a Compare September 16, 2025 13:27
@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch 7 times, most recently from aafad31 to 63c1de7 Compare September 23, 2025 07:44
@dmsnell dmsnell marked this pull request as ready for review September 23, 2025 07:45
@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch 3 times, most recently from 8e3f1ea to 9af3d0f Compare September 24, 2025 21:40
@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch 4 times, most recently from 1c9fd2c to de25439 Compare October 1, 2025 23:14
@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch 6 times, most recently from 205a4a0 to 3ffa129 Compare October 9, 2025 00:48
@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch from 3ffa129 to a9a9561 Compare October 9, 2025 20:53
@dmsnell dmsnell force-pushed the utf8/update-mb-strlen branch from a9a9561 to 7beab43 Compare October 9, 2025 23:39
pento pushed a commit that referenced this pull request Oct 16, 2025
The existing polyfill for `mb_strlen()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of character to count them (1,000 at a time, iterating until complete), and entirely gives up when the Unicode support is missing.

This patch provides an updated polyfill which will reliably count code points in a UTF-8 string, even in the presence of sequences of invalid bytes. It scans through the input with zero allocations. Additionally, the underlying fallback extends the behavior of `mb_strlen()` to provide character counts for substrings within a larger input without extracting the substring (it can counts characters within a byte offset and length of a larger string).

This change improves the reliability of UTF-8 string length calculations and removes behavioral variability based on the runtime system.

Developed in #9828
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@60949 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Oct 16, 2025
The existing polyfill for `mb_strlen()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of character to count them (1,000 at a time, iterating until complete), and entirely gives up when the Unicode support is missing.

This patch provides an updated polyfill which will reliably count code points in a UTF-8 string, even in the presence of sequences of invalid bytes. It scans through the input with zero allocations. Additionally, the underlying fallback extends the behavior of `mb_strlen()` to provide character counts for substrings within a larger input without extracting the substring (it can counts characters within a byte offset and length of a larger string).

This change improves the reliability of UTF-8 string length calculations and removes behavioral variability based on the runtime system.

Developed in WordPress/wordpress-develop#9828
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

Built from https://develop.svn.wordpress.org/trunk@60949


git-svn-id: http://core.svn.wordpress.org/trunk@60285 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Oct 16, 2025
The existing polyfill for `mb_strlen()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of character to count them (1,000 at a time, iterating until complete), and entirely gives up when the Unicode support is missing.

This patch provides an updated polyfill which will reliably count code points in a UTF-8 string, even in the presence of sequences of invalid bytes. It scans through the input with zero allocations. Additionally, the underlying fallback extends the behavior of `mb_strlen()` to provide character counts for substrings within a larger input without extracting the substring (it can counts characters within a byte offset and length of a larger string).

This change improves the reliability of UTF-8 string length calculations and removes behavioral variability based on the runtime system.

Developed in WordPress/wordpress-develop#9828
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

Built from https://develop.svn.wordpress.org/trunk@60949


git-svn-id: https://core.svn.wordpress.org/trunk@60285 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented Oct 16, 2025

Merged in 8508427
[60949]

@dmsnell dmsnell closed this Oct 16, 2025
@dmsnell dmsnell deleted the utf8/update-mb-strlen branch October 16, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant