Skip to content
/ unicorn Public

lightweight implementation of wide character functions for C

Notifications You must be signed in to change notification settings

cs127/unicorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

unicorn

unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of wchar_t (mainly DJGPP).

functions for converting extended ASCII codepages to unicode are also provided as an addon (see ascx at the end of the document).

Important

to be able to use unicorn, the wchar_t type in your compiling environment must be at least 16 bits (if unsigned) or 17 bits (if signed).

Note

this is just a hobby project. as much as I try to fix issues, you should still probably not expect it to always work properly. also, the code isn't exactly the most optimized. you have my warning.

unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:

  • wide characters (wchar_t) are assumed to be encoded in UTF-32 if WCHAR_MAX is at least 0x10FFFF (e.g. Linux), or UTF-16 otherwise (e.g. Windows).
    • surrogates (U+D800-U+DFFF) are considered invalid in UTF-32.
    • two new functions (wcstomb and mbtowcs) have been implemented as alternatives to wctomb and mbtowc respectively, to allow converting individual non-BMP characters in UTF-16.
  • multibyte strings (used in mbstowcs and the like) are assumed to be encoded in UTF-8.
    • surrogates (U+D800-U+DFFF) are considered invalid in multibyte strings.
    • characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed U+10FFFF.
    • overlong characters (characters encoded in a larger number of bytes than necessary) are considered invalid.

everything that unicorn implements uses the same name as its counterpart in standard C, except with a UC_ prefix. the only exception being the wchar_t type. unicorn uses the standard wchar_t.

compatibility

unicorn is almost C89-compatible, except that it needs to know the maximum possible value of the wchar_t type. if your compiling environment does not support C99 or newer, then unless your compiler itself predefines WCHAR_MAX, __WCHAR_MAX, or __WCHAR_MAX__, you need to manually define one of them during compile time (make sure to give it the correct value! and remember, if the type is not large enough, none of this will work!).

what's not implemented

  • the following do not need to be implemented, because UTF-8 is stateless:

    • mbstate_t type.
    • mbsinit function.
    • thread-safe versions of encoding conversion functions.
  • the following are not planned to be implemented any time soon (or maybe ever):

    • wctype_t type.
    • character type functions (towlower, towupper, wcscasecmp, wcscasecmp_l, wcsncasecmp, wcsncasecmp_l, wctype, and the isw family, including iswctype).
    • string to number conversion functions (wcstol, wcstoul, wcstoll, wcstoull, wcstof, wcstod, and wcstold).
    • functions that interact with file streams (e.g. fgetws, fputws, wprintf).
    • wcscoll and wcscoll_l functions.
    • wcsftime function.
    • wcsdup function.
    • wcwidth and wcswidth functions.
    • wcsxfrm and wcsxfrm_l functions.

what is implemented

Important

you need to append a UC_ prefix to the names of these functions, types, and macros!

  • every wchar.h function not mentioned above, including a few nonstandard POSIX-only functions, like wcpcpy.
  • wint_t type (equivalent to signed long int), with range macros WINT_MIN and WINT_MAX.
  • WEOF macro (evaluates to -1).
  • MB_LEN_MAX and MB_CUR_MAX macros (both evaluate to 4, because the multibyte encoding is always UTF-8).
  • wide character related stdlib.h functions (e.g. wcstombs, mbstowcs, mblen).
  • nonstandard wcstomb function, which is an alternative to wctomb, but expects a wide character string instead of of a single wide character, to be able to read surrogate pairs in UTF-16.
  • nonstandard mbtowcs function, which is an alternative to mbtowc, but treats the wide character pointer as a string instead of a pointer to a single wide character, to be able to write surrogate pairs in UTF-16.

ascx

unicorn includes an addon called ascx, which you can find in the ascx subdirectory of the sources. it includes functions for converting strings from various extended ASCII codepages to Unicode.

the functions are:

  • ascxtowc: for converting a single character.
  • ascxstowcs: for converting an entire string.

the behaviour of the two functions are identical to mbtowc and mbstowcs respectively, except that they take an extra parameter to specify which codepage the extended ASCII string is encoded in.

the currently supported codepages are:

  • IBM437
  • IBM850
  • IBM858
  • Windows-1252
  • ISO-8859-1 with C0 and C1 control characters

the IBM codepages each have two variants:

  • C0: containing C0 control characters.
  • C0_REP: containing IBM dingbats in place of C0 control characters.

unassigned codepoints (e.g. $81 in Windows-1252) are converted to the replacement character (U+FFFD).