Skip to content

floodyberry/utf8dfadecoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is a UTF-8 DFA decoder, based on Björn Höhrmann's Flexible and Economical UTF-8 Decoder. I really liked how simple it was, but I wanted a few changes:

  1. It allowed some wiggly non-characters through (0xfdd0-0xfdef, 0x??fffe/0x??ffff)
  2. It didn't handle 5 or 6 byte UTF-8 sequences, i.e. fully consume them and emit a replacement character
  3. It didn't distinguish between an invalid byte in a UTF-8 stream, and a validly encoded yet illegal value, e.g. overlong encodings. This meant the decoder had no way to know if it should back up because it encountered an illegal byte.

The somewhat arbitrary goal was a decoder that emitted 1 replacement character for each unexpected byte, 1 replacement character for each unfinished UTF-8 sequence up to the point where the sequence was still legal, and 1 replacement character for each valid UTF-8 sequence that represents a non-valid codepoint or overlong encoding.

I still use the basic form he presented, but generated expanded state tables to handle the new requirements, and slightly modified the innerloop to only require one state lookup at the expense of a larger tables (256 + 108) bytes vs (256 + 5376) bytes.

LICENSE

MIT, or Public Domain

About

A DFA based UTF-8 decoder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages