Skip to content

chansen/c-utf8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

c-utf8

A header-only UTF-8 library in C implementing validation, decoding, and transcoding conforming to the Unicode and ISO/IEC 10646 specifications.

Usage

#include "utf8_dfa32.h"   // or utf8_dfa64.h for the 64-bit variant
#include "utf8_valid.h"

// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);

// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
  // cursor is the byte offset of the first ill-formed sequence
}

// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);

Requirements

  • C99 or later

DFA backends

All routines in this library are driven by a Deterministic Finite Automaton (DFA) that consumes one byte at a time and transitions between a small set of states. Four backend headers are provided; you choose one or two based on what the consumer headers you include actually need.

Forward DFA

The forward DFA scans bytes left-to-right starting from the ACCEPT state. Each byte indexes into a 256-entry lookup table and produces the next state. Returning to ACCEPT marks the end of a complete valid sequence. Reaching REJECT means an ill-formed byte was encountered; that state is a permanent trap and no further input can leave it.

The 9 forward DFA states are:

State Meaning
ACCEPT Start / valid sequence boundary
REJECT Ill-formed input (absorbing trap)
TAIL1 Awaiting 1 more continuation byte
TAIL2 Awaiting 2 more continuation bytes
TAIL3 Awaiting 3 more continuation bytes
E0 After 0xE0; next must be A0–BF (reject non-shortest form)
ED After 0xED; next must be 80–9F (reject surrogates)
F0 After 0xF0; next must be 90–BF (reject non-shortest form)
F4 After 0xF4; next must be 80–8F (reject above U+10FFFF)

The four special states (E0, ED, F0, F4) enforce Unicode constraints that a naive continuation-byte count cannot express: non-shortest-form encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF are all structurally rejected.

Reverse DFA

The reverse DFA scans bytes right-to-left and is used for backward navigation and backward decoding and moving a cursor back by N codepoints.

Going forward you see the lead byte first and know immediately how many continuation bytes follow and what constraints apply to them. Going in reverse you see continuation bytes before the lead, so the DFA tracks how many have been seen and remembers the range of the outermost continuation byte (the one adjacent to the lead), because that range determines which lead bytes are valid.

The reverse DFA uses 7 states, 2 fewer than the forward DFA:

State Meaning
ACCEPT Start / valid sequence boundary
REJECT Ill-formed input (absorbing trap)
TL1 Seen 1 continuation byte
TL2LO Seen 2 continuations; outermost was 80–9F
TL2HI Seen 2 continuations; outermost was A0–BF
TL3LO Seen 3 continuations; outermost was 80–8F
TL3MH Seen 3 continuations; outermost was 90–BF

The LO/HI and LO/MH split on the outermost continuation range enforces the same constraints as the forward DFA's E0, ED, F0, and F4 states, applied when the lead byte finally arrives rather than when it is first seen.

Shift-based DFA encoding

All four backends implement the shift-based DFA approach originally described by Per Vognsen. Each byte maps to a row in a 256-entry table. The row is a packed integer where each state's transition target is stored at a fixed bit offset. The current state value is used directly as the shift amount to extract the next state, avoiding a second lookup. The inner loop is a table load, a shift, and a mask:

state = (table[byte] >> state) & mask;

The error state is fixed at offset 0. Any transition to error contributes (0 << 0) = 0 to the row value, which is itself the error state at every bit offset. Once entered, no byte can leave it.

The 32-bit row packing technique was derived by Dougall Johnson; state offsets are chosen by an SMT solver (tool/smt_solver.py) to pack all transitions without collision inside a uint32_t.

The 64-bit forward backend uses state offsets at fixed multiples of 6 (6, 12, 18, ..., 48), packing all transitions into 54 bits of a uint64_t. Bits 56–62 store a per-byte payload mask that the decode step uses to accumulate codepoint bits in the same pass:

*codepoint = (*codepoint << 6) | (byte & (row >> 56));
state      = (row >> state) & 63;

ASCII rows carry mask 0x7F, 2/3/4-byte lead rows carry 0x1F/0x0F/0x07, continuation rows carry 0x3F, and error rows carry 0x00.

The 64-bit reverse backend uses the same payload mask scheme. The sequence depth is encoded in the state value itself, so the bit shift for codepoint accumulation can be derived without a separate lookup:

shift      = ((state >> 2) & 3) * 6;
*codepoint = *codepoint | (byte & (row >> 56)) << shift;
state      = (row >> state) & 63;

Choosing a backend

Forward backends

Header Row type Table size Decoding
utf8_dfa32.h uint32_t 1 KB No
utf8_dfa64.h uint64_t 2 KB Yes

Use utf8_dfa32.h for validation-only work: utf8_valid.h, utf8_valid_stream.h, utf8_distance.h, utf8_advance_forward.h.

Use utf8_dfa64.h when codepoint decoding or transcoding is needed: utf8_decode_next.h, utf8_transcode.h.

Reverse backends

Header Row type Table size Decoding
utf8_rdfa32.h uint32_t 1 KB No
utf8_rdfa64.h uint64_t 2 KB Yes

Use utf8_rdfa32.h for backward navigation: utf8_advance_backward.h.

Use utf8_rdfa64.h for backward decoding: utf8_decode_prev.h.

Summary

The 64-bit variants are supersets of the 32-bit variants — include exactly one forward backend and, if you need backward traversal, exactly one reverse backend.

You need Forward Reverse
Validate utf8_dfa{32,64}.h
Count codepoints utf8_dfa{32,64}.h
Streaming validation utf8_dfa{32,64}.h
Skip forward N codepoints utf8_dfa{32,64}.h
Decode forward utf8_dfa64.h
Transcode to UTF-16/32 utf8_dfa64.h
Skip backward N codepoints utf8_rdfa{32,64}.h
Decode backward utf8_rdfa64.h

API

The library is split into independent header files. The safe API headers each require exactly one DFA backend to be included first. The unsafe variants require no DFA backend and operate on pre-validated input only.


Validation — utf8_valid.h

Requires utf8_dfa32.h or utf8_dfa64.h.

bool   utf8_valid(const char *src, size_t len);
bool   utf8_valid_ascii(const char *src, size_t len);
bool   utf8_check(const char *src, size_t len, size_t *cursor);
bool   utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);

utf8_valid returns true if src[0..len) is valid UTF-8. Internally uses dual-stream validation: the input is split at a sequence boundary near the midpoint and two independent DFA chains run in a single interleaved loop, exploiting instruction-level parallelism on wide-issue cores.

utf8_check returns true if src[0..len) is valid UTF-8. On failure, if cursor is non-NULL, sets *cursor to the byte offset of the first ill-formed sequence (the length of the maximal valid prefix). Uses the same dual-stream strategy as utf8_valid.

utf8_valid_ascii and utf8_check_ascii are drop-in replacements that use a single DFA stream with a 16-byte ASCII fast path. On each iteration the fast path checks whether the next 16 bytes are all ASCII; if so, it skips them without entering the DFA. When a non-ASCII byte is encountered the DFA processes bytes until it returns to the ACCEPT state, at which point the fast path is re-entered. Behaviour is identical to utf8_valid and utf8_check. Throughput advantage depends on content mix and microarchitecture; see the Performance section.

utf8_maximal_subpart returns the length of the maximal subpart of the ill-formed sequence starting at src[0..len), as defined by Unicode (see Error handling and U+FFFD replacement). The return value is always >= 1. Call this after utf8_check reports failure, with src advanced to the cursor position.


Streaming validation — utf8_valid_stream.h

Requires utf8_dfa32.h or utf8_dfa64.h.

typedef enum {
  UTF8_VALID_STREAM_OK,         // src fully consumed, no errors
  UTF8_VALID_STREAM_PARTIAL,    // src fully consumed, ends mid-sequence
  UTF8_VALID_STREAM_ILLFORMED,  // stopped at an ill-formed subsequence
  UTF8_VALID_STREAM_TRUNCATED,  // eof is true and src ends mid-sequence
} utf8_valid_stream_status_t;

typedef struct {
  utf8_valid_stream_status_t status;
  size_t consumed;            // bytes read from src
  size_t pending;             // bytes in an incomplete trailing sequence, else 0
  size_t advance;             // bytes to skip on ILLFORMED or TRUNCATED, else 0
  size_t carried;             // bytes from a previous chunk that belong to the same subpart
} utf8_valid_stream_result_t;

typedef struct {
  utf8_dfa_state_t state;
  size_t pending;
} utf8_valid_stream_t;

void   utf8_valid_stream_init(utf8_valid_stream_t *s);
utf8_valid_stream_result_t utf8_valid_stream_check(utf8_valid_stream_t *s,
                                                   const char *src, 
                                                   size_t len,
                                                   bool eof);

utf8_valid_stream_init initialises a stream validator. Call this before the first utf8_valid_stream_check.

utf8_valid_stream_check validates src[0..len) as the next chunk of a UTF-8 byte stream. eof should be true only for the final chunk. The DFA state is carried in utf8_valid_stream_t across calls.

If the chunk is well-formed and ends on a sequence boundary, status is OK. If the chunk ends in the middle of a sequence and eof is false, status is PARTIAL; pending is the number of trailing bytes in that incomplete sequence.

If validation stops at an ill-formed sequence, status is ILLFORMED. If eof is true and the chunk ends in the middle of a sequence, status is TRUNCATED.

On ILLFORMED or TRUNCATED, the stream state resets to ACCEPT automatically so the caller can continue without reinitialising.

On ILLFORMED or TRUNCATED, pending is 0. advance is the number of bytes in the current chunk that belong to the subpart, and carried is the number of bytes from previous chunks that belong to the same subpart. Resume validation at src[consumed + advance].

If the current chunk starts at absolute offset stream_offset, then stream_offset + consumed - carried is the error position, carried + advance is the subpart length, and stream_offset + consumed + advance is the resume position.

utf8_valid_stream_t s;
utf8_valid_stream_init(&s);

size_t stream_offset = 0;  // absolute offset of current chunk start
bool valid = true;

while (valid && (len = read_chunk(buf, sizeof buf)) > 0) {
  bool eof = len < sizeof buf;
  utf8_valid_stream_result_t r = utf8_valid_stream_check(&s, buf, len, eof);

  switch (r.status) {
  case UTF8_VALID_STREAM_OK:
  case UTF8_VALID_STREAM_PARTIAL:
    stream_offset += len;
    break;
  case UTF8_VALID_STREAM_ILLFORMED:
  case UTF8_VALID_STREAM_TRUNCATED: {
    size_t error_pos   = stream_offset + r.consumed - r.carried;
    size_t subpart_len = r.carried + r.advance;
    size_t resume_pos  = stream_offset + r.consumed + r.advance;

    handle_error(error_pos, subpart_len);
    stream_offset = resume_pos;
    valid = false;
    break;
  }
  }
}

Codepoint count — utf8_distance.h

Requires utf8_dfa32.h or utf8_dfa64.h.

size_t utf8_distance(const char *src, size_t len);
size_t utf8_distance_ascii(const char *src, size_t len);

utf8_distance returns the number of Unicode codepoints in src[0..len), or (size_t)-1 if the input contains ill-formed UTF-8.

utf8_distance_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_distance.


Forward navigation — utf8_advance_forward.h

Requires utf8_dfa32.h or utf8_dfa64.h.

size_t utf8_advance_forward(const char *src,
                            size_t len,
                            size_t distance,
                            size_t *advanced);
size_t utf8_advance_forward_ascii(const char *src,
                                  size_t len,
                                  size_t distance,
                                  size_t *advanced);

utf8_advance_forward returns the byte offset of the codepoint distance positions ahead within src[0..len). Returns len if distance exceeds the number of codepoints in the buffer. Returns (size_t)-1 if the input contains ill-formed UTF-8.

If advanced is non-NULL, sets *advanced to the number of codepoints actually skipped before stopping.

utf8_advance_forward_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_advance_forward.


Backward navigation — utf8_advance_backward.h

Requires utf8_rdfa32.h or utf8_rdfa64.h.

size_t utf8_advance_backward(const char *src, 
                             size_t len,
                             size_t distance, 
                             size_t *advanced);
size_t utf8_advance_backward_ascii(const char *src,
                                   size_t len,
                                   size_t distance,
                                   size_t *advanced);

utf8_advance_backward returns the byte offset of the codepoint distance positions before the end of src[0..len). Returns 0 if distance exceeds the number of codepoints in the buffer. Returns (size_t)-1 if the input contains ill-formed UTF-8.

If advanced is non-NULL, sets *advanced to the number of codepoints actually skipped before stopping.

utf8_advance_backward_ascii is a drop-in replacement with an 8-byte ASCII fast path that skips the DFA for chunks containing only ASCII bytes. Behaviour is identical to utf8_advance_backward.


Forward decoding — utf8_decode_next.h

Requires utf8_dfa64.h.

int utf8_decode_next(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_next_replace(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_next decodes the codepoint starting at src[0].

  • Success: returns bytes consumed (1–4) and writes the codepoint to *codepoint.
  • End of input (len == 0): returns 0; *codepoint is unchanged.
  • Ill-formed sequence: returns the negated length of the maximal subpart (see Error handling and U+FFFD replacement), in the range -1..-3. Advance by -return_value bytes and call again. *codepoint is unchanged.

utf8_decode_next_replace is identical but on error writes U+FFFD to *codepoint and returns the maximal subpart length as a positive value. Never returns a negative value; returns 0 only when len is 0.

uint32_t cp;
while (len > 0) {
  int n = utf8_decode_next_replace(src, len, &cp);
  process(cp); // U+FFFD for any ill-formed sequence
  src += n;
  len -= n;
}

Backward decoding — utf8_decode_prev.h

Requires utf8_rdfa64.h.

int utf8_decode_prev(const char *src, size_t len, uint32_t *codepoint);
int utf8_decode_prev_replace(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_prev decodes the codepoint ending at src[len-1].

  • Success: returns bytes consumed (1–4) and writes the codepoint to *codepoint. Step back by the return value: next call uses src[0..len-return_value).
  • End of input (len == 0): returns 0; *codepoint is unchanged.
  • Ill-formed sequence: returns the negated number of bytes to step back, in the range -1..-3. Step back by -return_value bytes and call again. *codepoint is unchanged.

utf8_decode_prev_replace is identical but on error writes U+FFFD to *codepoint and returns the step-back distance as a positive value. Never returns a negative value; returns 0 only when len is 0.

The reverse DFA sees continuation bytes before the lead byte. For some ill-formed inputs (e.g. two lone continuation bytes \x80\x80), utf8_decode_prev may produce fewer U+FFFD substitutions than utf8_decode_next over the same bytes, because all bytes consumed before rejection are reported as one ill-formed unit rather than individually.


Transcoding — utf8_transcode.h

Requires utf8_dfa64.h.

typedef enum {
    UTF8_TRANSCODE_OK,         // src fully consumed, no errors
    UTF8_TRANSCODE_EXHAUSTED,  // dst full before src was consumed
    UTF8_TRANSCODE_ILLFORMED,  // stopped at an ill-formed sequence
    UTF8_TRANSCODE_TRUNCATED,  // src ends mid-sequence
} utf8_transcode_status_t;

typedef struct {
    utf8_transcode_status_t status;
    size_t consumed;   // bytes read from src
    size_t decoded;    // codepoints decoded from src
    size_t written;    // code units written to dst
    size_t advance;    // bytes to skip on ILLFORMED or TRUNCATED, else 0
} utf8_transcode_result_t;

utf8_transcode_result_t utf8_transcode_utf16(const char *src, size_t src_len,
                                             uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_replace(const char *src, size_t src_len,
                                                     uint16_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32(const char *src, size_t src_len,
                                             uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf32_replace(const char *src, size_t src_len,
                                                     uint32_t *dst, size_t dst_len);

utf8_transcode_utf16 and utf8_transcode_utf32 transcode src[0..src_len) into dst[0..dst_len), stopping at the first ill-formed or truncated sequence. Codepoints above U+FFFF are encoded as surrogate pairs in UTF-16 and consume two code units.

utf8_transcode_utf16_replace and utf8_transcode_utf32_replace replace each ill-formed sequence with U+FFFD and continue. Status is always OK or EXHAUSTED; advance is always 0.

On ILLFORMED or TRUNCATED, consumed is the byte offset of the ill-formed sequence and advance is the length of its maximal subpart (see Error handling and U+FFFD replacement). Resume transcoding at src[consumed + advance].

// Replace ill-formed sequences with U+FFFD
uint16_t dst[256];
utf8_transcode_result_t r;
do {
  r = utf8_transcode_utf16_replace(src, src_len, dst, sizeof dst / sizeof *dst);
  flush(dst, r.written);
  src     += r.consumed;
  src_len -= r.consumed;
} while (r.status == UTF8_TRANSCODE_EXHAUSTED);

// Strict: report each ill-formed sequence to the caller
while (src_len > 0) {
  r = utf8_transcode_utf16(src, src_len, dst, sizeof dst / sizeof *dst);
  flush(dst, r.written);
  src     += r.consumed;
  src_len -= r.consumed;

  if (r.status == UTF8_TRANSCODE_ILLFORMED || r.status == UTF8_TRANSCODE_TRUNCATED) {
    handle_error(src, r.advance);
    src     += r.advance;
    src_len -= r.advance;
    if (r.status == UTF8_TRANSCODE_TRUNCATED)
      break;
  }
  else if (r.status == UTF8_TRANSCODE_OK)
    break;
}

For UTF-32, result.written always equals result.decoded since each codepoint maps to exactly one code unit. For UTF-16, result.written may exceed result.decoded due to surrogate pairs for codepoints above U+FFFF.

Unsafe variants

The following headers provide validation-free implementations for use on input that has already been validated. They skip the DFA entirely and decode or count by inspecting lead/continuation byte structure directly. The caller must guarantee that src contains well-formed UTF-8; behaviour is undefined otherwise.

These are not drop-in replacements for the safe API — they never return error indicators. Use them on the output side of a validation boundary (e.g. after utf8_valid or utf8_check has accepted the input).


Unsafe codepoint count — utf8_distance_unsafe.h

No DFA backend required.

size_t utf8_distance_unsafe(const char *src, size_t len);

utf8_distance_unsafe returns the number of codepoints in src[0..len). Uses SIMD (when available) or SWAR to process blocks in bulk. Cannot fail.


Unsafe forward navigation — utf8_advance_forward_unsafe.h

No DFA backend required.

size_t utf8_advance_forward_unsafe(const char *src,
                                   size_t len,
                                   size_t distance,
                                   size_t *advanced);

utf8_advance_forward_unsafe returns the byte offset of the codepoint distance positions ahead in src[0..len), or len if distance exceeds the number of codepoints in the buffer. If advanced is non-NULL, writes the number of codepoints actually skipped. Uses SIMD/SWAR bulk counting internally.


Unsafe backward navigation — utf8_advance_backward_unsafe.h

No DFA backend required.

size_t utf8_advance_backward_unsafe(const char *src,
                                    size_t len,
                                    size_t distance,
                                    size_t *advanced);

utf8_advance_backward_unsafe returns the byte offset of the codepoint distance positions before the end of src[0..len), or 0 if distance exceeds the number of codepoints in the buffer. If advanced is non-NULL, writes the number of codepoints actually skipped. Uses SIMD/SWAR bulk counting internally.


Unsafe forward decoding — utf8_decode_next_unsafe.h

No DFA backend required.

int utf8_decode_next_unsafe(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_next_unsafe decodes the codepoint starting at src[0]. Returns bytes consumed (1–4) and writes the codepoint to *codepoint. Returns 0 when len is 0.


Unsafe backward decoding — utf8_decode_prev_unsafe.h

No DFA backend required.

int utf8_decode_prev_unsafe(const char *src, size_t len, uint32_t *codepoint);

utf8_decode_prev_unsafe decodes the codepoint ending at src[len-1]. Returns bytes consumed (1–4) and writes the codepoint to *codepoint. Returns 0 when len is 0.


Unsafe transcoding — utf8_transcode_unsafe.h

No DFA backend required.

utf8_transcode_result_t utf8_transcode_utf32_unsafe(const char *src, size_t src_len,
                                                    uint32_t *dst, size_t dst_len);
utf8_transcode_result_t utf8_transcode_utf16_unsafe(const char *src, size_t src_len,
                                                    uint16_t *dst, size_t dst_len);

utf8_transcode_utf32_unsafe and utf8_transcode_utf16_unsafe transcode src[0..src_len) into dst[0..dst_len). Status is always OK or EXHAUSTED; advance is always 0. The result struct is identical to the safe variants for easy interchangeability.

Both functions process 8 bytes at a time when all 8 are ASCII (widening without decoding) and batch consecutive sequences of the same length class to avoid re-entering the lead-byte classifier. For UTF-16, codepoints above U+FFFF are encoded as surrogate pairs.


Error handling and U+FFFD replacement

When processing untrusted input, any ill-formed sequence should be replaced with U+FFFD (REPLACEMENT CHARACTER). Skipping ill-formed sequences can have security implications; see Unicode Technical Report #36, Unicode Security Considerations. Ill-formed sequences should be replaced with U+FFFD or the input rejected outright.

Maximal subpart

The number of U+FFFD characters to emit per ill-formed sequence is determined by the maximal subpart rule, defined in Unicode 17.0 Table 3-8. A maximal subpart is the longest prefix of an ill-formed sequence that is either the start of an otherwise well-formed sequence, or a single byte. Each maximal subpart produces exactly one U+FFFD.

For example, the byte sequence \xF0\x80\x80 is a truncated 4-byte sequence: \xF0 is a valid 4-byte lead followed by two valid continuation bytes, but a third continuation byte is missing. All three bytes form one maximal subpart and produce one U+FFFD. Two lone continuation bytes \x80\x80, on the other hand, each form their own maximal subpart of length 1 and produce two U+FFFD.

Standards requirements

Unicode 17.0 §3.9 recommends, and the WHATWG Encoding Standard requires, that decoders replace each maximal subpart of an ill-formed sequence with exactly one U+FFFD. This is the behaviour implemented by utf8_decode_next_replace, utf8_transcode_utf16_replace, and utf8_transcode_utf32_replace.

The non-replace variants (utf8_decode_next, utf8_transcode_utf16, utf8_transcode_utf32) stop and report the error position instead. These are intended for applications that need to handle ill-formed input explicitly, such as validating input at a trust boundary, logging the error location, or applying a custom substitution policy. If you have no such requirement, prefer the _replace variants.

Security

The guarantees below apply to the safe (DFA-based) API. The _unsafe variants perform no validation and assume the caller has already verified the input; passing ill-formed UTF-8 to an unsafe function is undefined behaviour.

All decoded codepoints are within the Unicode scalar value range. The DFA structurally rejects non-shortest-form encodings, surrogate halves (U+D800–U+DFFF), and codepoints above U+10FFFF. These cannot appear in the output of any safe decoding or transcoding function.

No dynamic allocation. All functions (safe and unsafe) operate on caller-supplied buffers with no heap allocation, no global mutable state, and no use of errno.

No data-dependent branches on byte value. The DFA step is a table lookup and bitwise shift with no conditional branches that depend on the input byte. Execution time scales with the number of bytes processed, not their values. Note that the _ascii variants are exceptions; they branch on whether a chunk contains only ASCII bytes, making their execution time content-dependent. The _unsafe variants also use content-dependent branching (lead-byte classification) since they are intended for performance-critical paths where the input is already trusted.

Performance

Corpus

Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.

File Size Code points Distribution Best utf8_valid Best utf8_valid_ascii
ar.txt 25 KB 14K 19% ASCII, 81% 2-byte 6725 MB/s 5349 MB/s
el.txt 102 KB 59K 23% ASCII, 77% 2-byte 6637 MB/s 5254 MB/s
en.txt 80 KB 82K 99.9% ASCII 6582 MB/s 41071 MB/s
ja.txt 176 KB 65K 11% ASCII, 89% 3-byte 6584 MB/s 5478 MB/s
lv.txt 135 KB 127K 92% ASCII, 7% 2-byte 6600 MB/s 6445 MB/s
ru.txt 148 KB 85K 23% ASCII, 77% 2-byte 6601 MB/s 4154 MB/s
sv.txt 94 KB 93K 96% ASCII, 4% 2-byte 6646 MB/s 9199 MB/s

Best numbers from -O2 -march=x86-64-v3 (Raptor Lake). utf8_valid uses dual-stream validation; see Observations.


Raptor Lake (Clang 20, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 4107 MB/s 2986 MB/s dual-stream ILP effective without BMI2
-O2 -march=x86-64-v3 6394 MB/s 4175 MB/s BMI2 SHRX on two ports
-O3 -march=x86-64-v3 6471 MB/s 5349 MB/s fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–35 GB/s at all optimization levels.

Raptor Lake (GCC 14, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 3714 MB/s 2774 MB/s dual-stream ILP effective without BMI2
-O2 -march=x86-64-v3 6725 MB/s 4125 MB/s BMI2 SHRX on two ports
-O3 -march=x86-64-v3 6489 MB/s 4164 MB/s fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 36–41 GB/s at all optimization levels.

Haswell (Clang 22, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2370 MB/s 1756 MB/s narrow backend limits ILP gain
-O2 -march=x86-64-v3 3674 MB/s 2626 MB/s BMI2 SHRX
-O3 -march=x86-64-v3 3682 MB/s 3328 MB/s gap narrows at -O3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 16–20 GB/s at all optimization levels.

Apple M1 Pro (Clang 21, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 4498 MB/s 2877 MB/s dual-stream ILP on wide Firestorm
-O2 -mtune=apple-m1 4356 MB/s 2867 MB/s mtune negligible
-O3 4445 MB/s 5231 MB/s -O3 unlocks NEON fast path
-O3 -mtune=apple-m1 4228 MB/s 4866 MB/s fast path profitable on all content

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~21 GB/s at -O3.

Apple M1 Pro (GCC 15, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 4206 MB/s 2742 MB/s dual-stream ILP on wide Firestorm
-O2 -mtune=apple-m1 4214 MB/s 2738 MB/s mtune no effect
-O3 4506 MB/s 2757 MB/s -O3 improves over -O2
-O3 -mtune=apple-m1 4453 MB/s 2881 MB/s fast path not profitable on multibyte

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~36 GB/s at -O3.

Observations

  • utf8_valid uses a dual-stream validation strategy: the input is split at a UTF-8 sequence boundary near the midpoint and two independent DFA chains run in a single interleaved loop. Since the chains have no data dependency, the CPU's out-of-order engine can overlap both shift operations on cores with multiple shift-capable execution ports.
  • On wide-issue cores (Raptor Lake P-cores with BMI2, Apple M1 Firestorm), dual-stream reaches approximately 1.4 bytes per clock cycle at peak throughput.
  • On x86, -march=x86-64-v3 or -march=native enables BMI2 SHRX, which removes the variable-shift dependency on CL.
  • utf8_valid_ascii is profitable on high-ASCII content across all tested platforms. On multibyte-heavy content it is generally slower than utf8_valid, with the exception of Apple M1 with Clang -O3 where the NEON fast path keeps it competitive.

Benchmark

The benchmark is in benchmark/bench.c. Compile and run with:

make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench                        # benchmarks benchmark/corpus/
./bench -d <directory>         # benchmarks all .txt files in directory
./bench -f <file>              # benchmarks single file
./bench -s <MB>                # resize input to <MB> before benchmarking

Benchmark mode (mutually exclusive):

Flag Description
-t <secs> run each implementation for <secs> seconds (default: 20)
-n <reps> run each implementation for <reps> repetitions
-b <MB> run each implementation until <MB> total data processed

Warmup (-w <n>): before timing, each implementation is run for n iterations to warm up caches and branch predictors. By default the warmup count is derived from input size, targeting approximately 256 MB of warmup data, capped between 1 and 100 iterations. Use -w 0 to disable warmup entirely.

Output format: For each file, the header line shows the filename, byte size, code point count, and average bytes per code point (units/point). The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the multiplier shows throughput relative to the slowest implementation.

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    432 MB/s
  utf8_valid_old              1780 MB/s  (4.12x)
  utf8_valid                  4449 MB/s  (10.29x)
  utf8_valid_ascii            9968 MB/s  (23.05x)

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7–1.9 is common for 2-byte-heavy text, and ~2.7–3.0 for CJK-heavy text.

The benchmark includes two reference implementations: hoehrmann a widely used table-driven DFA implementation and utf8_valid_old (previous scalar decoder) to track regression and quantify gains from the current DFA approach.

License

MIT License. Copyright (c) 2017–2026 Christian Hansen.

See Also

About

A header-only UTF-8 library in C implementing validation, decoding, and transcoding.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors