A header-only UTF-8 library in C implementing validation, decoding, and transcoding conforming to the Unicode and ISO/IEC 10646 specifications.
#include "utf8_dfa32.h" // or utf8_dfa64.h for the 64-bit variant
#include "utf8_valid.h"
// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);
// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
// cursor is the byte offset of the first ill-formed sequence
}
// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);- C99 or later
All routines in this library are driven by a Deterministic Finite Automaton (DFA) that consumes one byte at a time and transitions between a small set of states. Four backend headers are provided; you choose one or two based on what the consumer headers you include actually need.
The forward DFA scans bytes left-to-right starting from the ACCEPT state.
Each byte indexes into a 256-entry lookup table and produces the next state.
Returning to ACCEPT marks the end of a complete valid sequence. Reaching
REJECT means an ill-formed byte was encountered; that state is a permanent
trap and no further input can leave it.
The 9 forward DFA states are:
| State | Meaning |
|---|---|
ACCEPT |
Start / valid sequence boundary |
REJECT |
Ill-formed input (absorbing trap) |
TAIL1 |
Awaiting 1 more continuation byte |
TAIL2 |
Awaiting 2 more continuation bytes |
TAIL3 |
Awaiting 3 more continuation bytes |
E0 |
After 0xE0; next must be A0–BF (reject non-shortest form) |
ED |
After 0xED; next must be 80–9F (reject surrogates) |
F0 |
After 0xF0; next must be 90–BF (reject non-shortest form) |
F4 |
After 0xF4; next must be 80–8F (reject above U+10FFFF) |
The four special states (E0, ED, F0, F4) enforce Unicode constraints
that a naive continuation-byte count cannot express: non-shortest-form
encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF
are all structurally rejected.
The reverse DFA scans bytes right-to-left and is used for backward navigation and backward decoding and moving a cursor back by N codepoints.
Going forward you see the lead byte first and know immediately how many continuation bytes follow and what constraints apply to them. Going in reverse you see continuation bytes before the lead, so the DFA tracks how many have been seen and remembers the range of the outermost continuation byte (the one adjacent to the lead), because that range determines which lead bytes are valid.
The reverse DFA uses 7 states, 2 fewer than the forward DFA:
| State | Meaning |
|---|---|
ACCEPT |
Start / valid sequence boundary |
REJECT |
Ill-formed input (absorbing trap) |
TL1 |
Seen 1 continuation byte |
TL2LO |
Seen 2 continuations; outermost was 80–9F |
TL2HI |
Seen 2 continuations; outermost was A0–BF |
TL3LO |
Seen 3 continuations; outermost was 80–8F |
TL3MH |
Seen 3 continuations; outermost was 90–BF |
The LO/HI and LO/MH split on the outermost continuation range enforces the
same constraints as the forward DFA's E0, ED, F0, and F4 states,
applied when the lead byte finally arrives rather than when it is first seen.
All four backends implement the shift-based DFA approach originally described by Per Vognsen. Each byte maps to a row in a 256-entry table. The row is a packed integer where each state's transition target is stored at a fixed bit offset. The current state value is used directly as the shift amount to extract the next state, avoiding a second lookup. The inner loop is a table load, a shift, and a mask:
state = (table[byte] >> state) & mask;The error state is fixed at offset 0. Any transition to error contributes
(0 << 0) = 0 to the row value, which is itself the error state at every
bit offset. Once entered, no byte can leave it.
The 32-bit row packing technique was derived by
Dougall Johnson;
state offsets are chosen by an SMT solver (tool/smt_solver.py) to pack all
transitions without collision inside a uint32_t.
The 64-bit forward backend uses state offsets at fixed multiples of 6
(6, 12, 18, ..., 48), packing all transitions into 54 bits of a uint64_t.
Bits 56–62 store a per-byte payload mask that the decode step uses to
accumulate codepoint bits in the same pass:
*codepoint = (*codepoint << 6) | (byte & (row >> 56));
state = (row >> state) & 63;ASCII rows carry mask 0x7F, 2/3/4-byte lead rows carry 0x1F/0x0F/0x07,
continuation rows carry 0x3F, and error rows carry 0x00.
The 64-bit reverse backend uses the same payload mask scheme. The sequence depth is encoded in the state value itself, so the bit shift for codepoint accumulation can be derived without a separate lookup:
shift = ((state >> 2) & 3) * 6;
*codepoint = *codepoint | (byte & (row >> 56)) << shift;
state = (row >> state) & 63;| Header | Row type | Table size | Decoding |
|---|---|---|---|
utf8_dfa32.h |
uint32_t |
1 KB | No |
utf8_dfa64.h |
uint64_t |
2 KB | Yes |
Use utf8_dfa32.h for validation-only work: utf8_valid.h,
utf8_valid_stream.h, utf8_distance.h, utf8_advance_forward.h.
Use utf8_dfa64.h when codepoint decoding or transcoding is needed:
utf8_decode_next.h, utf8_transcode.h.
| Header | Row type | Table size | Decoding |
|---|---|---|---|
utf8_rdfa32.h |
uint32_t |
1 KB | No |
utf8_rdfa64.h |
uint64_t |
2 KB | Yes |
Use utf8_rdfa32.h for backward navigation: utf8_advance_backward.h.
Use utf8_rdfa64.h for backward decoding: utf8_decode_prev.h.
The 64-bit variants are supersets of the 32-bit variants — include exactly one forward backend and, if you need backward traversal, exactly one reverse backend.
| You need | Forward | Reverse |
|---|---|---|
| Validate | utf8_dfa{32,64}.h |
— |
| Count codepoints | utf8_dfa{32,64}.h |
— |
| Streaming validation | utf8_dfa{32,64}.h |
— |
| Skip forward N codepoints | utf8_dfa{32,64}.h |
— |
| Decode forward | utf8_dfa64.h |
— |
| Transcode to UTF-16/32 | utf8_dfa64.h |
— |
| Skip backward N codepoints | — | utf8_rdfa{32,64}.h |
| Decode backward | — | utf8_rdfa64.h |
The library is split into independent header files. The safe API headers each require exactly one DFA backend to be included first. The unsafe variants require no DFA backend and operate on pre-validated input only.
Requires utf8_dfa32.h or utf8_dfa64.h.
bool utf8_valid(const char *src, size_t len);
bool utf8_valid_ascii(const char *src, size_t len);
bool utf8_check(const char *src, size_t len, size_t *cursor);
bool utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);utf8_valid returns true if src[0..len) is valid UTF-8. Internally
uses dual-stream validation: the input is split at a sequence boundary near
the midpoint and two independent DFA chains run in a single interleaved loop,
exploiting instruction-level parallelism on wide-issue cores.
utf8_check returns true if src[0..len) is valid UTF-8. On failure,
if cursor is non-NULL, sets *cursor to the byte offset of the first
ill-formed sequence (the length of the maximal valid prefix). Uses the same
dual-stream strategy as utf8_valid.
utf8_valid_ascii and utf8_check_ascii are drop-in replacements
that use a single DFA stream with a 16-byte ASCII fast path. On each
iteration the fast path checks whether the next 16 bytes are all ASCII; if
so, it skips them without entering the DFA. When a non-ASCII byte is
encountered the DFA processes bytes until it returns to the ACCEPT state,
at which point the fast path is re-entered. Behaviour is identical to
utf8_valid and utf8_check. Throughput advantage depends on content mix
and microarchitecture; see the Performance section.
utf8_maximal_subpart returns the length of the maximal subpart of the
ill-formed sequence starting at src[0..len), as defined by Unicode
(see Error handling and U+FFFD replacement).
The return value is always >= 1. Call this after utf8_check reports failure,
with src advanced to the cursor position.
Requires utf8_dfa32.h or utf8_dfa64.h.
typedef enum {
UTF8_VALID_STREAM_OK, // src fully consumed, no errors
UTF8_VALID_STREAM_PARTIAL, // src fully consumed, ends mid-sequence
UTF8_VALID_STREAM_ILLFORMED, // stopped at an ill-formed subsequence
UTF8_VALID_STREAM_TRUNCATED, // eof is true and src ends mid-sequence
} utf8_valid_stream_status_t;
typedef struct {
utf8_valid_stream_status_t status;
size_t consumed; // bytes read from src
size_t pending; // bytes in an incomplete trailing sequence, else 0
size_t advance; // bytes to skip on ILLFORMED or TRUNCATED, else 0
size_t carried; // bytes from a previous chunk that belong to the same subpart
} utf8_valid_stream_result_t;
typedef struct {
utf8_dfa_state_t state;
size_t pending;
} utf8_valid_stream_t;
void utf8_valid_stream_init(utf8_valid_stream_t *s);
utf8_valid_stream_result_t utf8_valid_stream_check(utf8_valid_stream_t *s,
const char *src,
size_t len,
bool eof);utf8_valid_stream_init initialises a stream validator. Call this before
the first utf8_valid_stream_check.
utf8_valid_stream_check validates src[0..len) as the next chunk of a
UTF-8 byte stream. eof should be true only for the final chunk. The DFA
state is carried in utf8_valid_stream_t across calls.
If the chunk is well-formed and ends on a sequence boundary, status is OK.
If the chunk ends in the middle of a sequence and eof is false, status is
PARTIAL; pending is the number of trailing bytes in that incomplete
sequence.
If validation stops at an ill-formed sequence, status is ILLFORMED. If
eof is true and the chunk ends in the middle of a sequence, status is
TRUNCATED.
On ILLFORMED or TRUNCATED, the stream state resets to ACCEPT
automatically so the caller can continue without reinitialising.
On ILLFORMED or TRUNCATED, pending is 0. advance is the number of
bytes in the current chunk that belong to the subpart, and carried is the
number of bytes from previous chunks that belong to the same subpart. Resume
validation at src[consumed + advance].
If the current chunk starts at absolute offset stream_offset, then
stream_offset + consumed - carried is the error position,
carried + advance is the subpart length, and
stream_offset + consumed + advance is the resume position.
utf8_valid_stream_t s;
utf8_valid_stream_init(&s);
size_t stream_offset = 0; // absolute offset of current chunk start
bool valid = true;
while (valid && (len = read_chunk(buf, sizeof buf)) > 0) {
bool eof = len < sizeof buf;
utf8_valid_stream_result_t r = utf8_valid_stream_check(&s, buf, len, eof);
switch (r.status) {
case UTF8_VALID_STREAM_OK:
case UTF8_VALID_STREAM_PARTIAL:
stream_offset += len;
break;
case UTF8_VALID_STREAM_ILLFORMED:
case UTF8_VALID_STREAM_TRUNCATED: {
size_t error_pos = stream_offset + r.consumed - r.carried;
size_t subpart_len = r.carried + r.advance;
size_t resume_pos = stream_offset + r.consumed + r.advance;
handle_error(error_pos, subpart_len);
stream_offset = resume_pos;
valid = false;
break;
}
}
}Requires utf8_dfa32.h or utf8_dfa64.h.
size_t utf8_distance(const char *src, size_t len);
size_t utf8_distance_ascii(const char *src, size_t len);utf8_distance returns the number of Unicode codepoints in
src[0..len), or (size_t)-1 if the input contains ill-formed UTF-8.
utf8_distance_ascii is a drop-in replacement with an 8-byte ASCII
fast path that skips the DFA for chunks containing only ASCII bytes.
Behaviour is identical to utf8_distance.
Requires utf8_dfa32.h or utf8_dfa64.h.
size_t utf8_advance_forward(const char *src,
size_t len,
size_t distance,
size_t *advanced);
size_t utf8_advance_forward_ascii(const char *src,
size_t len,
size_t distance,
size_t *advanced);utf8_advance_forward returns the byte offset of the codepoint
distance positions ahead within src[0..len). Returns len if distance
exceeds the number of codepoints in the buffer. Returns (size_t)-1 if the
input contains ill-formed UTF-8.
If advanced is non-NULL, sets *advanced to the number of codepoints
actually skipped before stopping.
utf8_advance_forward_ascii is a drop-in replacement with an 8-byte
ASCII fast path that skips the DFA for chunks containing only ASCII bytes.
Behaviour is identical to utf8_advance_forward.
Requires utf8_rdfa32.h or utf8_rdfa64.h.
size_t utf8_advance_backward(const char *src,
size_t len,
size_t distance,
size_t *advanced);
size_t utf8_advance_backward_ascii(const char *src,
size_t len,
size_t distance,
size_t *advanced);utf8_advance_backward returns the byte offset of the codepoint
distance positions before the end of src[0..len). Returns 0 if
distance exceeds the number of codepoints in the buffer. Returns
(size_t)-1 if the input contains ill-formed UTF-8.
If advanced is non-NULL, sets *advanced to the number of codepoints
actually skipped before stopping.
utf8_advance_backward_ascii is a drop-in replacement with an 8-byte
ASCII fast path that skips the DFA for chunks containing only ASCII bytes.
Behaviour is identical to utf8_advance_backward.
Requires utf8_dfa64.h.
int utf8_decode_next(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_next_replace(const char *src, size_t len, uint32_t *codepoint);utf8_decode_next decodes the codepoint starting at src[0].
- Success: returns bytes consumed (1–4) and writes the codepoint to
*codepoint. - End of input (
len == 0): returns0;*codepointis unchanged. - Ill-formed sequence: returns the negated length of the maximal subpart
(see Error handling and U+FFFD replacement),
in the range
-1..-3. Advance by-return_valuebytes and call again.*codepointis unchanged.
utf8_decode_next_replace is identical but on error writes U+FFFD to
*codepoint and returns the maximal subpart length as a positive value.
Never returns a negative value; returns 0 only when len is 0.
uint32_t cp;
while (len > 0) {
int n = utf8_decode_next_replace(src, len, &cp);
process(cp); // U+FFFD for any ill-formed sequence
src += n;
len -= n;
}Requires utf8_rdfa64.h.
int utf8_decode_prev(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_prev_replace(const char *src, size_t len, uint32_t *codepoint);utf8_decode_prev decodes the codepoint ending at src[len-1].
- Success: returns bytes consumed (1–4) and writes the codepoint to
*codepoint. Step back by the return value: next call usessrc[0..len-return_value). - End of input (
len == 0): returns0;*codepointis unchanged. - Ill-formed sequence: returns the negated number of bytes to step back,
in the range
-1..-3. Step back by-return_valuebytes and call again.*codepointis unchanged.
utf8_decode_prev_replace is identical but on error writes U+FFFD to
*codepoint and returns the step-back distance as a positive value.
Never returns a negative value; returns 0 only when len is 0.
The reverse DFA sees continuation bytes before the lead byte. For some
ill-formed inputs (e.g. two lone continuation bytes \x80\x80),
utf8_decode_prev may produce fewer U+FFFD substitutions than
utf8_decode_next over the same bytes, because all bytes consumed before
rejection are reported as one ill-formed unit rather than individually.
Requires utf8_dfa64.h.
typedef enum {
UTF8_TRANSCODE_OK, // src fully consumed, no errors
UTF8_TRANSCODE_EXHAUSTED, // dst full before src was consumed
UTF8_TRANSCODE_ILLFORMED, // stopped at an ill-formed sequence
UTF8_TRANSCODE_TRUNCATED, // src ends mid-sequence
} utf8_transcode_status_t;
typedef struct {
utf8_transcode_status_t status;
size_t consumed; // bytes read from src
size_t decoded; // codepoints decoded from src
size_t written; // code units written to dst
size_t advance; // bytes to skip on ILLFORMED or TRUNCATED, else 0
} utf8_transcode_result_t;
utf8_transcode_result_t utf8_transcode_utf16(const char *src, size_t src_len,
uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_replace(const char *src, size_t src_len,
uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32(const char *src, size_t src_len,
uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32_replace(const char *src, size_t src_len,
uint32_t *dst, size_t dst_len);utf8_transcode_utf16 and utf8_transcode_utf32 transcode
src[0..src_len) into dst[0..dst_len), stopping at the first ill-formed
or truncated sequence. Codepoints above U+FFFF are encoded as surrogate
pairs in UTF-16 and consume two code units.
utf8_transcode_utf16_replace and utf8_transcode_utf32_replace
replace each ill-formed sequence with U+FFFD and continue. Status is always
OK or EXHAUSTED; advance is always 0.
On ILLFORMED or TRUNCATED, consumed is the byte offset of the
ill-formed sequence and advance is the length of its maximal subpart
(see Error handling and U+FFFD replacement).
Resume transcoding at src[consumed + advance].
// Replace ill-formed sequences with U+FFFD
uint16_t dst[256];
utf8_transcode_result_t r;
do {
r = utf8_transcode_utf16_replace(src, src_len, dst, sizeof dst / sizeof *dst);
flush(dst, r.written);
src += r.consumed;
src_len -= r.consumed;
} while (r.status == UTF8_TRANSCODE_EXHAUSTED);
// Strict: report each ill-formed sequence to the caller
while (src_len > 0) {
r = utf8_transcode_utf16(src, src_len, dst, sizeof dst / sizeof *dst);
flush(dst, r.written);
src += r.consumed;
src_len -= r.consumed;
if (r.status == UTF8_TRANSCODE_ILLFORMED || r.status == UTF8_TRANSCODE_TRUNCATED) {
handle_error(src, r.advance);
src += r.advance;
src_len -= r.advance;
if (r.status == UTF8_TRANSCODE_TRUNCATED)
break;
}
else if (r.status == UTF8_TRANSCODE_OK)
break;
}For UTF-32, result.written always equals result.decoded since each
codepoint maps to exactly one code unit. For UTF-16, result.written may
exceed result.decoded due to surrogate pairs for codepoints above U+FFFF.
The following headers provide validation-free implementations for use on
input that has already been validated. They skip the DFA entirely and decode
or count by inspecting lead/continuation byte structure directly. The caller
must guarantee that src contains well-formed UTF-8; behaviour is
undefined otherwise.
These are not drop-in replacements for the safe API — they never return error
indicators. Use them on the output side of a validation boundary (e.g. after
utf8_valid or utf8_check has accepted the input).
No DFA backend required.
size_t utf8_distance_unsafe(const char *src, size_t len);utf8_distance_unsafe returns the number of codepoints in src[0..len).
Uses SIMD (when available) or SWAR to process blocks in bulk. Cannot fail.
No DFA backend required.
size_t utf8_advance_forward_unsafe(const char *src,
size_t len,
size_t distance,
size_t *advanced);utf8_advance_forward_unsafe returns the byte offset of the codepoint
distance positions ahead in src[0..len), or len if distance exceeds
the number of codepoints in the buffer. If advanced is non-NULL, writes the
number of codepoints actually skipped. Uses SIMD/SWAR bulk counting internally.
No DFA backend required.
size_t utf8_advance_backward_unsafe(const char *src,
size_t len,
size_t distance,
size_t *advanced);utf8_advance_backward_unsafe returns the byte offset of the codepoint
distance positions before the end of src[0..len), or 0 if distance
exceeds the number of codepoints in the buffer. If advanced is non-NULL,
writes the number of codepoints actually skipped. Uses SIMD/SWAR bulk
counting internally.
No DFA backend required.
int utf8_decode_next_unsafe(const char *src, size_t len, uint32_t *codepoint);utf8_decode_next_unsafe decodes the codepoint starting at src[0].
Returns bytes consumed (1–4) and writes the codepoint to *codepoint.
Returns 0 when len is 0.
No DFA backend required.
int utf8_decode_prev_unsafe(const char *src, size_t len, uint32_t *codepoint);utf8_decode_prev_unsafe decodes the codepoint ending at src[len-1].
Returns bytes consumed (1–4) and writes the codepoint to *codepoint.
Returns 0 when len is 0.
No DFA backend required.
utf8_transcode_result_t utf8_transcode_utf32_unsafe(const char *src, size_t src_len,
uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_unsafe(const char *src, size_t src_len,
uint16_t *dst, size_t dst_len);utf8_transcode_utf32_unsafe and utf8_transcode_utf16_unsafe
transcode src[0..src_len) into dst[0..dst_len). Status is always OK or
EXHAUSTED; advance is always 0. The result struct is identical to the
safe variants for easy interchangeability.
Both functions process 8 bytes at a time when all 8 are ASCII (widening without decoding) and batch consecutive sequences of the same length class to avoid re-entering the lead-byte classifier. For UTF-16, codepoints above U+FFFF are encoded as surrogate pairs.
When processing untrusted input, any ill-formed sequence should be replaced with U+FFFD (REPLACEMENT CHARACTER). Skipping ill-formed sequences can have security implications; see Unicode Technical Report #36, Unicode Security Considerations. Ill-formed sequences should be replaced with U+FFFD or the input rejected outright.
The number of U+FFFD characters to emit per ill-formed sequence is determined by the maximal subpart rule, defined in Unicode 17.0 Table 3-8. A maximal subpart is the longest prefix of an ill-formed sequence that is either the start of an otherwise well-formed sequence, or a single byte. Each maximal subpart produces exactly one U+FFFD.
For example, the byte sequence \xF0\x80\x80 is a truncated 4-byte sequence:
\xF0 is a valid 4-byte lead followed by two valid continuation bytes, but a
third continuation byte is missing. All three bytes form one maximal subpart
and produce one U+FFFD. Two lone continuation bytes \x80\x80, on the other
hand, each form their own maximal subpart of length 1 and produce two U+FFFD.
Unicode 17.0 §3.9 recommends,
and the WHATWG Encoding Standard
requires, that decoders replace each maximal subpart of an ill-formed
sequence with exactly one U+FFFD. This is the behaviour implemented by
utf8_decode_next_replace,
utf8_transcode_utf16_replace, and utf8_transcode_utf32_replace.
The non-replace variants (utf8_decode_next, utf8_transcode_utf16,
utf8_transcode_utf32) stop and report the error position instead. These are
intended for applications that need to handle ill-formed input explicitly,
such as validating input at a trust boundary, logging the error location, or
applying a custom substitution policy. If you have no such requirement, prefer
the _replace variants.
The guarantees below apply to the safe (DFA-based) API. The _unsafe
variants perform no validation and assume the caller has already verified
the input; passing ill-formed UTF-8 to an unsafe function is undefined
behaviour.
All decoded codepoints are within the Unicode scalar value range. The DFA structurally rejects non-shortest-form encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF. These cannot appear in the output of any safe decoding or transcoding function.
No dynamic allocation. All functions (safe and unsafe) operate on
caller-supplied buffers with no heap allocation, no global mutable state, and
no use of errno.
No data-dependent branches on byte value. The DFA step is a table lookup
and bitwise shift with no conditional branches that depend on the input byte.
Execution time scales with the number of bytes processed, not their values.
Note that the _ascii variants are exceptions; they branch on whether a
chunk contains only ASCII bytes, making their execution time
content-dependent.
The _unsafe variants also use content-dependent branching (lead-byte
classification) since they are intended for performance-critical paths where
the input is already trusted.
Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.
| File | Size | Code points | Distribution | Best utf8_valid |
Best utf8_valid_ascii |
|---|---|---|---|---|---|
| ar.txt | 25 KB | 14K | 19% ASCII, 81% 2-byte | 6725 MB/s | 5349 MB/s |
| el.txt | 102 KB | 59K | 23% ASCII, 77% 2-byte | 6637 MB/s | 5254 MB/s |
| en.txt | 80 KB | 82K | 99.9% ASCII | 6582 MB/s | 41071 MB/s |
| ja.txt | 176 KB | 65K | 11% ASCII, 89% 3-byte | 6584 MB/s | 5478 MB/s |
| lv.txt | 135 KB | 127K | 92% ASCII, 7% 2-byte | 6600 MB/s | 6445 MB/s |
| ru.txt | 148 KB | 85K | 23% ASCII, 77% 2-byte | 6601 MB/s | 4154 MB/s |
| sv.txt | 94 KB | 93K | 96% ASCII, 4% 2-byte | 6646 MB/s | 9199 MB/s |
Best numbers from -O2 -march=x86-64-v3 (Raptor Lake). utf8_valid uses
dual-stream validation; see Observations.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
4107 MB/s | 2986 MB/s | dual-stream ILP effective without BMI2 |
-O2 -march=x86-64-v3 |
6394 MB/s | 4175 MB/s | BMI2 SHRX on two ports |
-O3 -march=x86-64-v3 |
6471 MB/s | 5349 MB/s | fast path not profitable on multibyte |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 29–35 GB/s at all optimization levels.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
3714 MB/s | 2774 MB/s | dual-stream ILP effective without BMI2 |
-O2 -march=x86-64-v3 |
6725 MB/s | 4125 MB/s | BMI2 SHRX on two ports |
-O3 -march=x86-64-v3 |
6489 MB/s | 4164 MB/s | fast path not profitable on multibyte |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 36–41 GB/s at all optimization levels.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2370 MB/s | 1756 MB/s | narrow backend limits ILP gain |
-O2 -march=x86-64-v3 |
3674 MB/s | 2626 MB/s | BMI2 SHRX |
-O3 -march=x86-64-v3 |
3682 MB/s | 3328 MB/s | gap narrows at -O3 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 16–20 GB/s at all optimization levels.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
4498 MB/s | 2877 MB/s | dual-stream ILP on wide Firestorm |
-O2 -mtune=apple-m1 |
4356 MB/s | 2867 MB/s | mtune negligible |
-O3 |
4445 MB/s | 5231 MB/s | -O3 unlocks NEON fast path |
-O3 -mtune=apple-m1 |
4228 MB/s | 4866 MB/s | fast path profitable on all content |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~21 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
4206 MB/s | 2742 MB/s | dual-stream ILP on wide Firestorm |
-O2 -mtune=apple-m1 |
4214 MB/s | 2738 MB/s | mtune no effect |
-O3 |
4506 MB/s | 2757 MB/s | -O3 improves over -O2 |
-O3 -mtune=apple-m1 |
4453 MB/s | 2881 MB/s | fast path not profitable on multibyte |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~36 GB/s at -O3.
utf8_validuses a dual-stream validation strategy: the input is split at a UTF-8 sequence boundary near the midpoint and two independent DFA chains run in a single interleaved loop. Since the chains have no data dependency, the CPU's out-of-order engine can overlap both shift operations on cores with multiple shift-capable execution ports.- On wide-issue cores (Raptor Lake P-cores with BMI2, Apple M1 Firestorm), dual-stream reaches approximately 1.4 bytes per clock cycle at peak throughput.
- On x86,
-march=x86-64-v3or-march=nativeenables BMI2SHRX, which removes the variable-shift dependency onCL. utf8_valid_asciiis profitable on high-ASCII content across all tested platforms. On multibyte-heavy content it is generally slower thanutf8_valid, with the exception of Apple M1 with Clang-O3where the NEON fast path keeps it competitive.
The benchmark is in benchmark/bench.c. Compile and run with:
make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench # benchmarks benchmark/corpus/
./bench -d <directory> # benchmarks all .txt files in directory
./bench -f <file> # benchmarks single file
./bench -s <MB> # resize input to <MB> before benchmarkingBenchmark mode (mutually exclusive):
| Flag | Description |
|---|---|
-t <secs> |
run each implementation for <secs> seconds (default: 20) |
-n <reps> |
run each implementation for <reps> repetitions |
-b <MB> |
run each implementation until <MB> total data processed |
Warmup (-w <n>): before timing, each implementation is run for n
iterations to warm up caches and branch predictors. By default the warmup
count is derived from input size, targeting approximately 256 MB of warmup
data, capped between 1 and 100 iterations. Use -w 0 to disable warmup
entirely.
Output format: For each file, the header line shows the filename,
byte size, code point count, and average bytes per code point
(units/point). The code point distribution breaks down the input by
Unicode range. Results are sorted slowest to fastest; the multiplier
shows throughput relative to the slowest implementation.
sv.txt: 94 KB; 93K code points; 1.04 units/point
U+0000..U+007F 90K 96.4%
U+0080..U+07FF 3K 3.5%
U+0800..U+FFFF 171 0.2%
hoehrmann 432 MB/s
utf8_valid_old 1780 MB/s (4.12x)
utf8_valid 4449 MB/s (10.29x)
utf8_valid_ascii 9968 MB/s (23.05x)
units/point is a rough content-mix indicator: 1.00 is near-pure ASCII,
~1.7–1.9 is common for 2-byte-heavy text, and ~2.7–3.0 for CJK-heavy text.
The benchmark includes two reference implementations: hoehrmann a widely used
table-driven DFA implementation and utf8_valid_old (previous scalar decoder)
to track regression and quantify gains from the current DFA approach.
MIT License. Copyright (c) 2017–2026 Christian Hansen.
- Flexible and Economical UTF-8 Decoder by Björn Höhrmann
- Branchless UTF-8 Decoder by Chris Wellons
- simdutf Unicode validation and transcoding using SIMD