Skip to content

Commit 85fb018

Browse files
committed
Charset: Add wp_utf8_chunks() to iterate through strings.
This new generator function iterates through valid and invalid spans of bytes in a UTF-8 string. It is a convenience wrapper around the new `_wp_scan_utf8()` which aids a number of operations which should work on strings containing invalid spans of UTF-8, such as: - Identifying where the invalid spans are. - Operating on the valid portions while preserving the invalid spans. - Displaying broken strings. These operations can be useful during validation, sanitization, debugging, and processing uncontrolled inputs which may contain malformed sequences.
1 parent d1e7f56 commit 85fb018

File tree

1 file changed

+59
-0
lines changed

1 file changed

+59
-0
lines changed

src/wp-includes/utf8.php

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,3 +133,62 @@ function wp_scrub_utf8( $text ) {
133133
return _wp_scrub_utf8_fallback( $text );
134134
}
135135
endif;
136+
137+
/**
138+
* Iterate through a string returning lengths of valid or invalid byte spans.
139+
*
140+
* This is not likely an often-needed function, but it can be used to build
141+
* interesting views of a string containing invalid bytes, or to operate on
142+
* the valid portions of a UTF-8 string while preserving the existing spans
143+
* of invalid bytes.
144+
*
145+
* This is convenience wrapper around {@see _wp_scan_utf8()}. For non-allocating
146+
* and non-yielding functionality consider calling that function directly.
147+
*
148+
* Example:
149+
*
150+
* $text = "test\x90wp\E2\x80\xC0test";
151+
*
152+
* $chunks = iterator_to_array( wp_utf8_chunks( $text ) );
153+
* array( 'test', "\x90", 'wp', "\xE2\x80", "\xC0", 'test' ) === $chunks;
154+
*
155+
* $is_valid = false;
156+
* foreach ( wp_utf8_chunks( $text, $is_valid ) as $chunk ) {
157+
* if ( $is_valid ) {
158+
* echo $chunk;
159+
* } else {
160+
* $bytes = implode( ' ', array_map( 'bin2hex', str_split( $chunk ) ) );
161+
* echo "({$bytes})";
162+
* }
163+
* }
164+
* // test(90)wp(e2 80)(c0)test
165+
*
166+
* @param string $text Iterate through this string.
167+
* @param bool|null $is_valid Optional. If passed, set to whether the currently yielded
168+
* chunk is a valid span of UTF-8 bytes.
169+
* @return Generator Spans of valid or invalid UTF-8 text; check `$is_valid` to determine
170+
* whether the yielded span is valid.
171+
*/
172+
function wp_utf8_chunks( string $text, ?bool &$is_valid = null ): Generator {
173+
$at = 0;
174+
$was_at = 0;
175+
$end = strlen( $text );
176+
$invalid_length = 0;
177+
178+
while ( $at < $end ) {
179+
_wp_scan_utf8( $text, $at, $invalid_length );
180+
181+
if ( $at > $was_at ) {
182+
$is_valid = true;
183+
yield substr( $text, $was_at, $at - $was_at );
184+
}
185+
186+
if ( $invalid_length > 0 ) {
187+
$is_valid = false;
188+
yield substr( $text, $at, $invalid_length );
189+
}
190+
191+
$at += $invalid_length;
192+
$was_at = $at;
193+
}
194+
}

0 commit comments

Comments
 (0)