Skip to content
/ unicorn Public

lightweight implementation of wide character functions for C

Notifications You must be signed in to change notification settings

cs127/unicorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

unicorn

unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of wchar_t, as long as it is at least 16 bits (if unsigned) or 17 bits (if signed).

Note

this is just a hobby project. as much as I try to fix issues, you should still probably not expect it to always work properly. also, the code isn't exactly the most optimized. you have my warning.

unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:

  • wide characters (wchar_t) are assumed to be encoded in UTF-32 if WCHAR_MAX is at least 0x10FFFF (e.g. Linux), or UTF-16 otherwise (e.g. Windows).
    • surrogates (U+D800-U+DFFF) are considered invalid in UTF-32.
    • a new function (mbstowc) has been implemented as an alternative to mbtowc to allow converting individual non-BMP characters in UTF-16.
  • multibyte strings (used in mbstowcs and the like) are assumed to be encoded in UTF-8.
    • surrogates (U+D800-U+DFFF) are considered invalid in multibyte strings.
    • characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed U+10FFFF.
    • overlong characters (characters encoded in a larger number of bytes than necessary) are considered invalid.

everything that unicorn implements uses the same name as its counterpart in standard C, except with a UC_ prefix. the only exception being the wchar_t type. unicorn uses the standard wchar_t.

compatibility

unicorn is almost C89-compatible, except that it needs to know the maximum possible value of the wchar_t type. if your compiling environment does not support C99 or newer, then unless your compiler itself predefines WCHAR_MAX, __WCHAR_MAX, or __WCHAR_MAX__, you need to manually define one of them during compile time (make sure to give it the correct value!).

what's not implemented

  • the following will be implemented in a later update:

    • wcstok function.
  • the following do not need to be implemented, because UTF-8 is stateless:

    • mbstate_t type.
    • mbsinit function.
    • thread-safe versions of encoding conversion functions.
  • the following are not planned to be implemented any time soon (or maybe ever):

    • wctype_t type.
    • character type functions (towlower, towupper, wcscasecmp, wcscasecmp_l, wcsncasecmp, wcsncasecmp_l, wctype, and the isw family, including iswctype).
    • string to number conversion functions (wcstol, wcstoul, wcstoll, wcstoull, wcstof, wcstod, and wcstold).
    • functions that interact with file streams (e.g. fgetws, fputws, wprintf).
    • wcscoll and wcscoll_l functions.
    • wcsftime function.
    • wcsdup function.
    • wcwidth and wcswidth functions.
    • wcsxfrm and wcsxfrm_l functions.

what is implemented

Important

you need to append a UC_ prefix to the names of these functions, types, and macros!

  • every wchar.h function not mentioned above, including a few nonstandard POSIX-only functions, like wcpcpy.
  • wint_t type (equivalent to signed long int), with range macros WINT_MIN and WINT_MAX.
  • WEOF macro (evaluates to -1).
  • MB_LEN_MAX and MB_CUR_MAX macros (both evaluate to 4, because the multibyte encoding is always UTF-8).
  • wide character related stdlib.h functions (e.g. wcstombs, mbstowcs, mblen).
  • nonstandard mbstowc function, which is an alternative to mbtowc, but expects a wchar_t* instead of wchar, to be able to read surrogate pairs in UTF-16.