GitHub - floodyberry/utf8dfadecoder: A DFA based UTF-8 decoder

This is a UTF-8 DFA decoder, based on Björn Höhrmann's Flexible and Economical UTF-8 Decoder. I really liked how simple it was, but I wanted a few changes:

It allowed some wiggly non-characters through (0xfdd0-0xfdef, 0x??fffe/0x??ffff)
It didn't handle 5 or 6 byte UTF-8 sequences, i.e. fully consume them and emit a replacement character
It didn't distinguish between an invalid byte in a UTF-8 stream, and a validly encoded yet illegal value, e.g. overlong encodings. This meant the decoder had no way to know if it should back up because it encountered an illegal byte.

The somewhat arbitrary goal was a decoder that emitted 1 replacement character for each unexpected byte, 1 replacement character for each unfinished UTF-8 sequence up to the point where the sequence was still legal, and 1 replacement character for each valid UTF-8 sequence that represents a non-valid codepoint or overlong encoding.

I still use the basic form he presented, but generated expanded state tables to handle the new requirements, and slightly modified the innerloop to only require one state lookup at the expense of a larger tables (256 + 108) bytes vs (256 + 5376) bytes.

LICENSE

MIT, or Public Domain

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
test.c		test.c
utf8decoder.c		utf8decoder.c
utf8decoder.h		utf8decoder.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LICENSE

About

Releases

Packages

Languages

floodyberry/utf8dfadecoder

Folders and files

Latest commit

History

Repository files navigation

LICENSE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages