Skip to content

Fix: Make seems_utf8 function RFC 3629 compliant#7463

Closed
Debarghya-Banerjee wants to merge 4 commits intoWordPress:trunkfrom
Debarghya-Banerjee:fix/make-seems_utf8-function-rfc-3629-compliant
Closed

Fix: Make seems_utf8 function RFC 3629 compliant#7463
Debarghya-Banerjee wants to merge 4 commits intoWordPress:trunkfrom
Debarghya-Banerjee:fix/make-seems_utf8-function-rfc-3629-compliant

Conversation

@Debarghya-Banerjee
Copy link

Trac Ticket: Core-38044

Overview

  • This pull request introduces the seems_utf8 function, which validates UTF-8 encoded strings according to the specifications outlined in RFC 3629. This implementation ensures that only valid UTF-8 sequences are accepted, effectively safeguarding applications against invalid input.

Key Features:

  • UTF-8 Encoding Compliance:

    • The function adheres strictly to the UTF-8 encoding rules defined in RFC 3629, which allows for a maximum of 4 bytes per character.

    • Handling of Single and Multi-byte Sequences:

      • It correctly identifies and processes single-byte (0x00 to 0x7F) and multi-byte sequences (2 to 4 bytes), ensuring that each byte in a multi-byte sequence begins with the appropriate prefix.
    • Validation of Leading Bytes:

      • The function checks leading bytes to determine the number of continuation bytes required:

        • 0xC0 for 2-byte sequences
        • 0xE0 for 3-byte sequences
        • 0xF0 for 4-byte sequences
      • It explicitly rejects any leading bytes starting with 0xF8 or 0xFC, as these indicate sequences that exceed the valid UTF-8 range.

    • Control Over Overlong Sequences:

      • The function rejects overlong sequences, ensuring that the encoding does not use more bytes than necessary to represent a character, thereby preventing potential security issues.
    • Surrogate Pair Handling:

      • It prevents the inclusion of invalid surrogate pairs (U+D800 to U+DFFF) in the encoded string, in compliance with the restrictions specified in RFC 3629.
    • Zero Byte Validation:

      • The function checks for invalid overlong sequences specifically for U+0000, adhering to best practices for UTF-8 validation.
    • Comprehensive Error Handling:

      • Each check returns false for invalid cases, ensuring that any non-compliant string is effectively filtered out, thereby providing robustness against various encoding issues.

Conclusion

  • The seems_utf8 function is a comprehensive implementation that ensures full compliance with RFC 3629 standards. By validating UTF-8 strings effectively, it enhances the integrity and security of applications that rely on proper character encoding. This pull request aims to integrate this functionality, providing developers with a reliable tool for UTF-8 validation.

@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props debarghyabanerjee.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@Debarghya-Banerjee
Copy link
Author

Hi @desrosj , can you please take a look into this PR. Thanks.

@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60630
GitHub commit: bb6ed3b

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant