c-utf8

A header-only UTF-8 library in C implementing validation, decoding, and transcoding conforming to the Unicode and ISO/IEC 10646 specifications.

Usage

#include "utf8_dfa32.h"   // or utf8_dfa64.h for the 64-bit variant
#include "utf8_valid.h"

// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);

// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
  // cursor is the byte offset of the first ill-formed sequence
}

// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);

Requirements

C99 or later

DFA backends

All routines in this library are driven by a Deterministic Finite Automaton (DFA) that consumes one byte at a time and transitions between a small set of states. Four backend headers are provided; you choose one or two based on what the consumer headers you include actually need.

Forward DFA

The forward DFA scans bytes left-to-right starting from the ACCEPT state. Each byte indexes into a 256-entry lookup table and produces the next state. Returning to ACCEPT marks the end of a complete valid sequence. Reaching REJECT means an ill-formed byte was encountered; that state is a permanent trap and no further input can leave it.

The 9 forward DFA states are:

State	Meaning
`ACCEPT`	Start / valid sequence boundary
`REJECT`	Ill-formed input (absorbing trap)
`TAIL1`	Awaiting 1 more continuation byte
`TAIL2`	Awaiting 2 more continuation bytes
`TAIL3`	Awaiting 3 more continuation bytes
`E0`	After `0xE0`; next must be `A0–BF` (reject non-shortest form)
`ED`	After `0xED`; next must be `80–9F` (reject surrogates)
`F0`	After `0xF0`; next must be `90–BF` (reject non-shortest form)
`F4`	After `0xF4`; next must be `80–8F` (reject above U+10FFFF)

The four special states (E0, ED, F0, F4) enforce Unicode constraints that a naive continuation-byte count cannot express: non-shortest-form encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF are all structurally rejected.

Reverse DFA

The reverse DFA scans bytes right-to-left and is used for backward navigation and backward decoding and moving a cursor back by N codepoints.

Going forward you see the lead byte first and know immediately how many continuation bytes follow and what constraints apply to them. Going in reverse you see continuation bytes before the lead, so the DFA tracks how many have been seen and remembers the range of the outermost continuation byte (the one adjacent to the lead), because that range determines which lead bytes are valid.

The reverse DFA uses 7 states, 2 fewer than the forward DFA:

State	Meaning
`ACCEPT`	Start / valid sequence boundary
`REJECT`	Ill-formed input (absorbing trap)
`TL1`	Seen 1 continuation byte
`TL2LO`	Seen 2 continuations; outermost was `80–9F`
`TL2HI`	Seen 2 continuations; outermost was `A0–BF`
`TL3LO`	Seen 3 continuations; outermost was `80–8F`
`TL3MH`	Seen 3 continuations; outermost was `90–BF`

The LO/HI and LO/MH split on the outermost continuation range enforces the same constraints as the forward DFA's E0, ED, F0, and F4 states, applied when the lead byte finally arrives rather than when it is first seen.

Shift-based DFA encoding

All four backends implement the shift-based DFA approach originally described by Per Vognsen. Each byte maps to a row in a 256-entry table. The row is a packed integer where each state's transition target is stored at a fixed bit offset. The current state value is used directly as the shift amount to extract the next state, avoiding a second lookup. The inner loop is a table load, a shift, and a mask:

state = (table[byte] >> state) & mask;

The error state is fixed at offset 0. Any transition to error contributes (0 << 0) = 0 to the row value, which is itself the error state at every bit offset. Once entered, no byte can leave it.

The 32-bit row packing technique was derived by Dougall Johnson; state offsets are chosen by an SMT solver (tool/smt_solver.py) to pack all transitions without collision inside a uint32_t.

The 64-bit forward backend uses state offsets at fixed multiples of 6 (6, 12, 18, ..., 48), packing all transitions into 54 bits of a uint64_t. Bits 56–62 store a per-byte payload mask that the decode step uses to accumulate codepoint bits in the same pass:

*codepoint = (*codepoint << 6) | (byte & (row >> 56));
state      = (row >> state) & 63;

ASCII rows carry mask 0x7F, 2/3/4-byte lead rows carry 0x1F/0x0F/0x07, continuation rows carry 0x3F, and error rows carry 0x00.

The 64-bit reverse backend uses the same payload mask scheme. The sequence depth is encoded in the state value itself, so the bit shift for codepoint accumulation can be derived without a separate lookup:

shift      = ((state >> 2) & 3) * 6;
*codepoint = *codepoint | (byte & (row >> 56)) << shift;
state      = (row >> state) & 63;

Choosing a backend

Forward backends

Header	Row type	Table size	Decoding
`utf8_dfa32.h`	`uint32_t`	1 KB	No
`utf8_dfa64.h`	`uint64_t`	2 KB	Yes

Use utf8_dfa32.h for validation-only work: utf8_valid.h, utf8_valid_stream.h, utf8_distance.h, utf8_advance_forward.h.

Use utf8_dfa64.h when codepoint decoding or transcoding is needed: utf8_decode_next.h, utf8_transcode.h.

Reverse backends

Header	Row type	Table size	Decoding
`utf8_rdfa32.h`	`uint32_t`	1 KB	No
`utf8_rdfa64.h`	`uint64_t`	2 KB	Yes

Use utf8_rdfa32.h for backward navigation: utf8_advance_backward.h.

Use utf8_rdfa64.h for backward decoding: utf8_decode_prev.h.

Summary

The 64-bit variants are supersets of the 32-bit variants — include exactly one forward backend and, if you need backward traversal, exactly one reverse backend.

You need	Forward	Reverse
Validate	`utf8_dfa{32,64}.h`	—
Count codepoints	`utf8_dfa{32,64}.h`	—
Streaming validation	`utf8_dfa{32,64}.h`	—
Skip forward N codepoints	`utf8_dfa{32,64}.h`	—
Decode forward	`utf8_dfa64.h`	—
Transcode to UTF-16/32	`utf8_dfa64.h`	—
Skip backward N codepoints	—	`utf8_rdfa{32,64}.h`
Decode backward	—	`utf8_rdfa64.h`

API

The library is split into independent header files. The safe API headers each require exactly one DFA backend to be included first. The unsafe variants require no DFA backend and operate on pre-validated input only.

Validation — `utf8_valid.h`

Requires utf8_dfa32.h or utf8_dfa64.h.

bool   utf8_valid(const char *src, size_t len);
bool   utf8_valid_ascii(const char *src, size_t len);
bool   utf8_check(const char *src, size_t len, size_t *cursor);
bool   utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);

utf8_valid returns true if src[0..len) is valid UTF-8. Internally uses dual-stream validation: the input is split at a sequence boundary near the midpoint and two independent DFA chains run in a single interleaved loop, exploiting instruction-level parallelism on wide-issue cores.

utf8_check returns true if src[0..len) is valid UTF-8. On failure, if cursor is non-NULL, sets *cursor to the byte offset of the first ill-formed sequence (the length of the maximal valid prefix). Uses the same dual-stream strategy as utf8_valid.

utf8_valid_ascii and utf8_check_ascii are drop-in replacements that use a single DFA stream with a 16-byte ASCII fast path. On each iteration the fast path checks whether the next 16 bytes are all ASCII; if so, it skips them without entering the DFA. When a non-ASCII byte is encountered the DFA processes bytes until it returns to the ACCEPT state, at which point the fast path is re-entered. Behaviour is identical to utf8_valid and utf8_check. Throughput advantage depends on content mix and microarchitecture; see the Performance section.

utf8_maximal_subpart returns the length of the maximal subpart of the ill-formed sequence starting at src[0..len), as defined by Unicode (see Error handling and U+FFFD replacement). The return value is always >= 1. Call this after utf8_check reports failure, with src advanced to the cursor position.

Streaming validation — `utf8_valid_stream.h`

Requires utf8_dfa32.h or utf8_dfa64.h.

typedef enum {
  UTF8_VALID_STREAM_OK,         // src fully consumed, no errors
  UTF8_VALID_STREAM_PARTIAL,    // src fully consumed, ends mid-sequence
  UTF8_VALID_STREAM_ILLFORMED,  // stopped at an ill-formed subsequence
  UTF8_VALID_STREAM_TRUNCATED,  // eof is true and src ends mid-sequence
} utf8_valid_stream_status_t;

typedef struct {
  utf8_valid_stream_status_t status;
  size_t consumed;            // bytes read from src
  size_t pending;             // bytes in an incomplete trailing sequence, else 0
  size_t advance;             // bytes to skip on ILLFORMED or TRUNCATED, else 0
  size_t carried;             // bytes from a previous chunk that belong to the same subpart
} utf8_valid_stream_result_t;

typedef struct {
  utf8_dfa_state_t state;
  size_t pending;
} utf8_valid_stream_t;

void   utf8_valid_stream_init(utf8_valid_stream_t *s);
utf8_valid_stream_result_t utf8_valid_stream_check(utf8_valid_stream_t *s,
                                                   const char *src, 
                                                   size_t len,
                                                   bool eof);

utf8_valid_stream_init initialises a stream validator. Call this before the first utf8_valid_stream_check.

utf8_valid_stream_check validates src[0..len) as the next chunk of a UTF-8 byte stream. eof should be true only for the final chunk. The DFA state is carried in utf8_valid_stream_t across calls.

If the chunk is well-formed and ends on a sequence boundary, status is OK. If the chunk ends in the middle of a sequence and eof is false, status is PARTIAL; pending is the number of trailing bytes in that incomplete sequence.

If validation stops at an ill-formed sequence, status is ILLFORMED. If eof is true and the chunk ends in the middle of a sequence, status is TRUNCATED.

On ILLFORMED or TRUNCATED, the stream state resets to ACCEPT automatically so the caller can continue without reinitialising.

On ILLFORMED or TRUNCATED, pending is 0. advance is the number of bytes in the current chunk that belong to the subpart, and carried is the number of bytes from previous chunks that belong to the same subpart. Resume validation at src[consumed + advance].

If the current chunk starts at absolute offset stream_offset, then stream_offset + consumed - carried is the error position, carried + advance is the subpart length, and stream_offset + consumed + advance is the resume position.

utf8_valid_stream_t s;
utf8_valid_stream_init(&s);

size_t stream_offset = 0;  // absolute offset of current chunk start
bool valid = true;

while (valid && (len = read_chunk(buf, sizeof buf)) > 0) {
  bool eof = len < sizeof buf;
  utf8_valid_stream_result_t r = utf8_valid_stream_check(&s, buf, len, eof);

  switch (r.status) {
  case UTF8_VALID_STREAM_OK:
  case UTF8_VALID_STREAM_PARTIAL:
    stream_offset += len;
    break;
  case UTF8_VALID_STREAM_ILLFORMED:
  case UTF8_VALID_STREAM_TRUNCATED: {
    size_t error_pos   = stream_offset + r.consumed - r.carried;
    size_t subpart_len = r.carried + r.advance;
    size_t resume_pos  = stream_offset + r.consumed + r.advance;

    handle_error(error_pos, subpart_len);
    stream_offset = resume_pos;
    valid = false;
    break;
  }
  }
}

Codepoint count — `utf8_distance.h`

Requires utf8_dfa32.h or utf8_dfa64.h.

size_t utf8_distance(const char *src, size_t len);
size_t utf8_distance_ascii(const char *src, size_t len);

utf8_distance returns the number of Unicode codepoints in src[0..len), or (size_t)-1 if the input contains ill-formed UTF-8.

utf8_distance_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_distance.

Forward navigation — `utf8_advance_forward.h`

Requires utf8_dfa32.h or utf8_dfa64.h.

size_t utf8_advance_forward(const char *src,
                            size_t len,
                            size_t distance,
                            size_t *advanced);
size_t utf8_advance_forward_ascii(const char *src,
                                  size_t len,
                                  size_t distance,
                                  size_t *advanced);

utf8_advance_forward returns the byte offset of the codepoint distance positions ahead within src[0..len). Returns len if distance exceeds the number of codepoints in the buffer. Returns (size_t)-1 if the input contains ill-formed UTF-8.

If advanced is non-NULL, sets *advanced to the number of codepoints actually skipped before stopping.

utf8_advance_forward_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_advance_forward.

Backward navigation — `utf8_advance_backward.h`

Requires utf8_rdfa32.h or utf8_rdfa64.h.

size_t utf8_advance_backward(const char *src, 
                             size_t len,
                             size_t distance, 
                             size_t *advanced);
size_t utf8_advance_backward_ascii(const char *src,
                                   size_t len,
                                   size_t distance,
                                   size_t *advanced);

utf8_advance_backward returns the byte offset of the codepoint distance positions before the end of src[0..len). Returns 0 if distance exceeds the number of codepoints in the buffer. Returns (size_t)-1 if the input contains ill-formed UTF-8.

If advanced is non-NULL, sets *advanced to the number of codepoints actually skipped before stopping.

utf8_advance_backward_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_advance_backward.

Forward decoding — `utf8_decode_next.h`

Requires utf8_dfa64.h.

int utf8_decode_next(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_next_replace(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_next decodes the codepoint starting at src[0].

Success: returns bytes consumed (1–4) and writes the codepoint to *codepoint.
End of input (len == 0): returns 0; *codepoint is unchanged.
Ill-formed sequence: returns the negated length of the maximal subpart (see Error handling and U+FFFD replacement), in the range -1..-3. Advance by -return_value bytes and call again. *codepoint is unchanged.

utf8_decode_next_replace is identical but on error writes U+FFFD to *codepoint and returns the maximal subpart length as a positive value. Never returns a negative value; returns 0 only when len is 0.

uint32_t cp;
while (len > 0) {
  int n = utf8_decode_next_replace(src, len, &cp);
  process(cp); // U+FFFD for any ill-formed sequence
  src += n;
  len -= n;
}

Backward decoding — `utf8_decode_prev.h`

Requires utf8_rdfa64.h.

int utf8_decode_prev(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_prev_replace(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_prev decodes the codepoint ending at src[len-1].

Success: returns bytes consumed (1–4) and writes the codepoint to *codepoint. Step back by the return value: next call uses src[0..len-return_value).
End of input (len == 0): returns 0; *codepoint is unchanged.
Ill-formed sequence: returns the negated number of bytes to step back, in the range -1..-3. Step back by -return_value bytes and call again. *codepoint is unchanged.

utf8_decode_prev_replace is identical but on error writes U+FFFD to *codepoint and returns the step-back distance as a positive value. Never returns a negative value; returns 0 only when len is 0.

The reverse DFA sees continuation bytes before the lead byte. For some ill-formed inputs (e.g. two lone continuation bytes \x80\x80), utf8_decode_prev may produce fewer U+FFFD substitutions than utf8_decode_next over the same bytes, because all bytes consumed before rejection are reported as one ill-formed unit rather than individually.

Transcoding — `utf8_transcode.h`

Requires utf8_dfa64.h.

typedef enum {
    UTF8_TRANSCODE_OK,         // src fully consumed, no errors
    UTF8_TRANSCODE_EXHAUSTED,  // dst full before src was consumed
    UTF8_TRANSCODE_ILLFORMED,  // stopped at an ill-formed sequence
    UTF8_TRANSCODE_TRUNCATED,  // src ends mid-sequence
} utf8_transcode_status_t;

typedef struct {
    utf8_transcode_status_t status;
    size_t consumed;   // bytes read from src
    size_t decoded;    // codepoints decoded from src
    size_t written;    // code units written to dst
    size_t advance;    // bytes to skip on ILLFORMED or TRUNCATED, else 0
} utf8_transcode_result_t;

utf8_transcode_result_t utf8_transcode_utf16(const char *src, size_t src_len,
                                             uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_replace(const char *src, size_t src_len,
                                                     uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32(const char *src, size_t src_len,
                                             uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32_replace(const char *src, size_t src_len,
                                                     uint32_t *dst, size_t dst_len);

utf8_transcode_utf16 and utf8_transcode_utf32 transcode src[0..src_len) into dst[0..dst_len), stopping at the first ill-formed or truncated sequence. Codepoints above U+FFFF are encoded as surrogate pairs in UTF-16 and consume two code units.

utf8_transcode_utf16_replace and utf8_transcode_utf32_replace replace each ill-formed sequence with U+FFFD and continue. Status is always OK or EXHAUSTED; advance is always 0.

On ILLFORMED or TRUNCATED, consumed is the byte offset of the ill-formed sequence and advance is the length of its maximal subpart (see Error handling and U+FFFD replacement). Resume transcoding at src[consumed + advance].

// Replace ill-formed sequences with U+FFFD
uint16_t dst[256];
utf8_transcode_result_t r;
do {
  r = utf8_transcode_utf16_replace(src, src_len, dst, sizeof dst / sizeof *dst);
  flush(dst, r.written);
  src     += r.consumed;
  src_len -= r.consumed;
} while (r.status == UTF8_TRANSCODE_EXHAUSTED);

// Strict: report each ill-formed sequence to the caller
while (src_len > 0) {
  r = utf8_transcode_utf16(src, src_len, dst, sizeof dst / sizeof *dst);
  flush(dst, r.written);
  src     += r.consumed;
  src_len -= r.consumed;

  if (r.status == UTF8_TRANSCODE_ILLFORMED || r.status == UTF8_TRANSCODE_TRUNCATED) {
    handle_error(src, r.advance);
    src     += r.advance;
    src_len -= r.advance;
    if (r.status == UTF8_TRANSCODE_TRUNCATED)
      break;
  }
  else if (r.status == UTF8_TRANSCODE_OK)
    break;
}

For UTF-32, result.written always equals result.decoded since each codepoint maps to exactly one code unit. For UTF-16, result.written may exceed result.decoded due to surrogate pairs for codepoints above U+FFFF.

Unsafe variants

The following headers provide validation-free implementations for use on input that has already been validated. They skip the DFA entirely and decode or count by inspecting lead/continuation byte structure directly. The caller must guarantee that src contains well-formed UTF-8; behaviour is undefined otherwise.

These are not drop-in replacements for the safe API — they never return error indicators. Use them on the output side of a validation boundary (e.g. after utf8_valid or utf8_check has accepted the input).

Unsafe codepoint count — `utf8_distance_unsafe.h`

No DFA backend required.

size_t utf8_distance_unsafe(const char *src, size_t len);

utf8_distance_unsafe returns the number of codepoints in src[0..len). Uses SIMD (when available) or SWAR to process blocks in bulk. Cannot fail.

Unsafe forward navigation — `utf8_advance_forward_unsafe.h`

No DFA backend required.

size_t utf8_advance_forward_unsafe(const char *src,
                                   size_t len,
                                   size_t distance,
                                   size_t *advanced);

utf8_advance_forward_unsafe returns the byte offset of the codepoint distance positions ahead in src[0..len), or len if distance exceeds the number of codepoints in the buffer. If advanced is non-NULL, writes the number of codepoints actually skipped. Uses SIMD/SWAR bulk counting internally.

Unsafe backward navigation — `utf8_advance_backward_unsafe.h`

No DFA backend required.

size_t utf8_advance_backward_unsafe(const char *src,
                                    size_t len,
                                    size_t distance,
                                    size_t *advanced);

utf8_advance_backward_unsafe returns the byte offset of the codepoint distance positions before the end of src[0..len), or 0 if distance exceeds the number of codepoints in the buffer. If advanced is non-NULL, writes the number of codepoints actually skipped. Uses SIMD/SWAR bulk counting internally.

Unsafe forward decoding — `utf8_decode_next_unsafe.h`

No DFA backend required.

int utf8_decode_next_unsafe(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_next_unsafe decodes the codepoint starting at src[0]. Returns bytes consumed (1–4) and writes the codepoint to *codepoint. Returns 0 when len is 0.

Unsafe backward decoding — `utf8_decode_prev_unsafe.h`

No DFA backend required.

int utf8_decode_prev_unsafe(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_prev_unsafe decodes the codepoint ending at src[len-1]. Returns bytes consumed (1–4) and writes the codepoint to *codepoint. Returns 0 when len is 0.

Unsafe transcoding — `utf8_transcode_unsafe.h`

No DFA backend required.

utf8_transcode_result_t utf8_transcode_utf32_unsafe(const char *src, size_t src_len,
                                                    uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_unsafe(const char *src, size_t src_len,
                                                    uint16_t *dst, size_t dst_len);

utf8_transcode_utf32_unsafe and utf8_transcode_utf16_unsafe transcode src[0..src_len) into dst[0..dst_len). Status is always OK or EXHAUSTED; advance is always 0. The result struct is identical to the safe variants for easy interchangeability.

Both functions process 8 bytes at a time when all 8 are ASCII (widening without decoding) and batch consecutive sequences of the same length class to avoid re-entering the lead-byte classifier. For UTF-16, codepoints above U+FFFF are encoded as surrogate pairs.

Error handling and U+FFFD replacement

When processing untrusted input, any ill-formed sequence should be replaced with U+FFFD (REPLACEMENT CHARACTER). Skipping ill-formed sequences can have security implications; see Unicode Technical Report #36, Unicode Security Considerations. Ill-formed sequences should be replaced with U+FFFD or the input rejected outright.

Maximal subpart

The number of U+FFFD characters to emit per ill-formed sequence is determined by the maximal subpart rule, defined in Unicode 17.0 Table 3-8. A maximal subpart is the longest prefix of an ill-formed sequence that is either the start of an otherwise well-formed sequence, or a single byte. Each maximal subpart produces exactly one U+FFFD.

For example, the byte sequence \xF0\x80\x80 is a truncated 4-byte sequence: \xF0 is a valid 4-byte lead followed by two valid continuation bytes, but a third continuation byte is missing. All three bytes form one maximal subpart and produce one U+FFFD. Two lone continuation bytes \x80\x80, on the other hand, each form their own maximal subpart of length 1 and produce two U+FFFD.

Standards requirements

Unicode 17.0 §3.9 recommends, and the WHATWG Encoding Standard requires, that decoders replace each maximal subpart of an ill-formed sequence with exactly one U+FFFD. This is the behaviour implemented by utf8_decode_next_replace, utf8_transcode_utf16_replace, and utf8_transcode_utf32_replace.

The non-replace variants (utf8_decode_next, utf8_transcode_utf16, utf8_transcode_utf32) stop and report the error position instead. These are intended for applications that need to handle ill-formed input explicitly, such as validating input at a trust boundary, logging the error location, or applying a custom substitution policy. If you have no such requirement, prefer the _replace variants.

Security

The guarantees below apply to the safe (DFA-based) API. The _unsafe variants perform no validation and assume the caller has already verified the input; passing ill-formed UTF-8 to an unsafe function is undefined behaviour.

All decoded codepoints are within the Unicode scalar value range. The DFA structurally rejects non-shortest-form encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF. These cannot appear in the output of any safe decoding or transcoding function.

No dynamic allocation. All functions (safe and unsafe) operate on caller-supplied buffers with no heap allocation, no global mutable state, and no use of errno.

No data-dependent branches on byte value. The DFA step is a table lookup and bitwise shift with no conditional branches that depend on the input byte. Execution time scales with the number of bytes processed, not their values. Note that the _ascii variants are exceptions; they branch on whether a chunk contains only ASCII bytes, making their execution time content-dependent. The _unsafe variants also use content-dependent branching (lead-byte classification) since they are intended for performance-critical paths where the input is already trusted.

Performance

Corpus

Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.

File	Size	Code points	Distribution	Best `utf8_valid`	Best `utf8_valid_ascii`
ar.txt	25 KB	14K	19% ASCII, 81% 2-byte	6725 MB/s	5349 MB/s
el.txt	102 KB	59K	23% ASCII, 77% 2-byte	6637 MB/s	5254 MB/s
en.txt	80 KB	82K	99.9% ASCII	6582 MB/s	41071 MB/s
ja.txt	176 KB	65K	11% ASCII, 89% 3-byte	6584 MB/s	5478 MB/s
lv.txt	135 KB	127K	92% ASCII, 7% 2-byte	6600 MB/s	6445 MB/s
ru.txt	148 KB	85K	23% ASCII, 77% 2-byte	6601 MB/s	4154 MB/s
sv.txt	94 KB	93K	96% ASCII, 4% 2-byte	6646 MB/s	9199 MB/s

Best numbers from -O2 -march=x86-64-v3 (Raptor Lake). utf8_valid uses dual-stream validation; see Observations.

Raptor Lake (Clang 20, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	4107 MB/s	2986 MB/s	dual-stream ILP effective without BMI2
`-O2 -march=x86-64-v3`	6394 MB/s	4175 MB/s	BMI2 `SHRX` on two ports
`-O3 -march=x86-64-v3`	6471 MB/s	5349 MB/s	fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–35 GB/s at all optimization levels.

Raptor Lake (GCC 14, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	3714 MB/s	2774 MB/s	dual-stream ILP effective without BMI2
`-O2 -march=x86-64-v3`	6725 MB/s	4125 MB/s	BMI2 `SHRX` on two ports
`-O3 -march=x86-64-v3`	6489 MB/s	4164 MB/s	fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 36–41 GB/s at all optimization levels.

Haswell (Clang 22, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	2370 MB/s	1756 MB/s	narrow backend limits ILP gain
`-O2 -march=x86-64-v3`	3674 MB/s	2626 MB/s	BMI2 `SHRX`
`-O3 -march=x86-64-v3`	3682 MB/s	3328 MB/s	gap narrows at `-O3`

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 16–20 GB/s at all optimization levels.

Apple M1 Pro (Clang 21, AArch64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	4498 MB/s	2877 MB/s	dual-stream ILP on wide Firestorm
`-O2 -mtune=apple-m1`	4356 MB/s	2867 MB/s	mtune negligible
`-O3`	4445 MB/s	5231 MB/s	`-O3` unlocks NEON fast path
`-O3 -mtune=apple-m1`	4228 MB/s	4866 MB/s	fast path profitable on all content

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~21 GB/s at -O3.

Apple M1 Pro (GCC 15, AArch64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	4206 MB/s	2742 MB/s	dual-stream ILP on wide Firestorm
`-O2 -mtune=apple-m1`	4214 MB/s	2738 MB/s	mtune no effect
`-O3`	4506 MB/s	2757 MB/s	`-O3` improves over `-O2`
`-O3 -mtune=apple-m1`	4453 MB/s	2881 MB/s	fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~36 GB/s at -O3.

Observations

utf8_valid uses a dual-stream validation strategy: the input is split at a UTF-8 sequence boundary near the midpoint and two independent DFA chains run in a single interleaved loop. Since the chains have no data dependency, the CPU's out-of-order engine can overlap both shift operations on cores with multiple shift-capable execution ports.
On wide-issue cores (Raptor Lake P-cores with BMI2, Apple M1 Firestorm), dual-stream reaches approximately 1.4 bytes per clock cycle at peak throughput.
On x86, -march=x86-64-v3 or -march=native enables BMI2 SHRX, which removes the variable-shift dependency on CL.
utf8_valid_ascii is profitable on high-ASCII content across all tested platforms. On multibyte-heavy content it is generally slower than utf8_valid, with the exception of Apple M1 with Clang -O3 where the NEON fast path keeps it competitive.

Benchmark

The benchmark is in benchmark/bench.c. Compile and run with:

make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench                        # benchmarks benchmark/corpus/
./bench -d <directory>         # benchmarks all .txt files in directory
./bench -f <file>              # benchmarks single file
./bench -s <MB>                # resize input to <MB> before benchmarking

Benchmark mode (mutually exclusive):

Flag	Description
`-t <secs>`	run each implementation for `<secs>` seconds (default: 20)
`-n <reps>`	run each implementation for `<reps>` repetitions
`-b <MB>`	run each implementation until `<MB>` total data processed

Warmup (-w <n>): before timing, each implementation is run for n iterations to warm up caches and branch predictors. By default the warmup count is derived from input size, targeting approximately 256 MB of warmup data, capped between 1 and 100 iterations. Use -w 0 to disable warmup entirely.

Output format: For each file, the header line shows the filename, byte size, code point count, and average bytes per code point (units/point). The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the multiplier shows throughput relative to the slowest implementation.

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    432 MB/s
  utf8_valid_old              1780 MB/s  (4.12x)
  utf8_valid                  4449 MB/s  (10.29x)
  utf8_valid_ascii            9968 MB/s  (23.05x)

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7–1.9 is common for 2-byte-heavy text, and ~2.7–3.0 for CJK-heavy text.

The benchmark includes two reference implementations: hoehrmann a widely used table-driven DFA implementation and utf8_valid_old (previous scalar decoder) to track regression and quantify gains from the current DFA approach.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
benchmark		benchmark
test		test
tool		tool
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
utf8_advance_backward.h		utf8_advance_backward.h
utf8_advance_backward_unsafe.h		utf8_advance_backward_unsafe.h
utf8_advance_forward.h		utf8_advance_forward.h
utf8_advance_forward_unsafe.h		utf8_advance_forward_unsafe.h
utf8_decode_next.h		utf8_decode_next.h
utf8_decode_next_unsafe.h		utf8_decode_next_unsafe.h
utf8_decode_prev.h		utf8_decode_prev.h
utf8_decode_prev_unsafe.h		utf8_decode_prev_unsafe.h
utf8_dfa32.h		utf8_dfa32.h
utf8_dfa64.h		utf8_dfa64.h
utf8_distance.h		utf8_distance.h
utf8_distance_unsafe.h		utf8_distance_unsafe.h
utf8_rdfa32.h		utf8_rdfa32.h
utf8_rdfa64.h		utf8_rdfa64.h
utf8_simd.h		utf8_simd.h
utf8_swar.h		utf8_swar.h
utf8_transcode.h		utf8_transcode.h
utf8_transcode_common.h		utf8_transcode_common.h
utf8_transcode_unsafe.h		utf8_transcode_unsafe.h
utf8_valid.h		utf8_valid.h
utf8_valid_stream.h		utf8_valid_stream.h

Folders and files

Latest commit

History

Repository files navigation

c-utf8

Usage

Requirements

DFA backends

Forward DFA

Reverse DFA

Shift-based DFA encoding

Choosing a backend

Forward backends

Reverse backends

Summary

API

Validation — utf8_valid.h

Streaming validation — utf8_valid_stream.h

Codepoint count — utf8_distance.h

Forward navigation — utf8_advance_forward.h

Backward navigation — utf8_advance_backward.h

Forward decoding — utf8_decode_next.h

Backward decoding — utf8_decode_prev.h

Transcoding — utf8_transcode.h

Unsafe variants

Unsafe codepoint count — utf8_distance_unsafe.h

Unsafe forward navigation — utf8_advance_forward_unsafe.h

Unsafe backward navigation — utf8_advance_backward_unsafe.h

Unsafe forward decoding — utf8_decode_next_unsafe.h

Unsafe backward decoding — utf8_decode_prev_unsafe.h

Unsafe transcoding — utf8_transcode_unsafe.h

Error handling and U+FFFD replacement

Maximal subpart

Standards requirements

Security

Performance

Corpus

Raptor Lake (Clang 20, x86-64)

Raptor Lake (GCC 14, x86-64)

Haswell (Clang 22, x86-64)

Apple M1 Pro (Clang 21, AArch64)

Apple M1 Pro (GCC 15, AArch64)

Observations

Benchmark

License

See Also

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Validation — `utf8_valid.h`

Streaming validation — `utf8_valid_stream.h`

Codepoint count — `utf8_distance.h`

Forward navigation — `utf8_advance_forward.h`

Backward navigation — `utf8_advance_backward.h`

Forward decoding — `utf8_decode_next.h`

Backward decoding — `utf8_decode_prev.h`

Transcoding — `utf8_transcode.h`

Unsafe codepoint count — `utf8_distance_unsafe.h`

Unsafe forward navigation — `utf8_advance_forward_unsafe.h`

Unsafe backward navigation — `utf8_advance_backward_unsafe.h`

Unsafe forward decoding — `utf8_decode_next_unsafe.h`

Unsafe backward decoding — `utf8_decode_prev_unsafe.h`

Unsafe transcoding — `utf8_transcode_unsafe.h`

Packages