unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of wchar_t
(mainly DJGPP).
functions for converting extended ASCII codepages to unicode are also provided as an addon (see ascx at the end of the document).
Important
to be able to use unicorn, the wchar_t
type in your compiling environment must be at least 16 bits (if unsigned) or 17 bits (if signed).
Note
this is just a hobby project. as much as I try to fix issues, you should still probably not expect it to always work properly. also, the code isn't exactly the most optimized. you have my warning.
unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:
- wide characters (
wchar_t
) are assumed to be encoded in UTF-32 ifWCHAR_MAX
is at least0x10FFFF
(e.g. Linux), or UTF-16 otherwise (e.g. Windows).- surrogates (
U+D800
-U+DFFF
) are considered invalid in UTF-32. - two new functions (
wcstomb
andmbtowcs
) have been implemented as alternatives towctomb
andmbtowc
respectively, to allow converting individual non-BMP characters in UTF-16.
- surrogates (
- multibyte strings (used in
mbstowcs
and the like) are assumed to be encoded in UTF-8.- surrogates (
U+D800
-U+DFFF
) are considered invalid in multibyte strings. - characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed
U+10FFFF
. - overlong characters (characters encoded in a larger number of bytes than necessary) are considered invalid.
- surrogates (
everything that unicorn implements uses the same name as its counterpart in standard C, except with a UC_
prefix.
the only exception being the wchar_t
type. unicorn uses the standard wchar_t
.
unicorn is almost C89-compatible, except that it needs to know the maximum possible value of the wchar_t
type.
if your compiling environment does not support C99 or newer, then unless your compiler itself predefines WCHAR_MAX
, __WCHAR_MAX
, or __WCHAR_MAX__
, you need to manually define one of them during compile time (make sure to give it the correct value! and remember, if the type is not large enough, none of this will work!).
-
the following do not need to be implemented, because UTF-8 is stateless:
mbstate_t
type.mbsinit
function.- thread-safe versions of encoding conversion functions.
-
the following are not planned to be implemented any time soon (or maybe ever):
wctype_t
type.- character type functions (
towlower
,towupper
,wcscasecmp
,wcscasecmp_l
,wcsncasecmp
,wcsncasecmp_l
,wctype
, and theisw
family, includingiswctype
). - string to number conversion functions (
wcstol
,wcstoul
,wcstoll
,wcstoull
,wcstof
,wcstod
, andwcstold
). - functions that interact with file streams (e.g.
fgetws
,fputws
,wprintf
). wcscoll
andwcscoll_l
functions.wcsftime
function.wcsdup
function.wcwidth
andwcswidth
functions.wcsxfrm
andwcsxfrm_l
functions.
Important
you need to append a UC_
prefix to the names of these functions, types, and macros!
- every
wchar.h
function not mentioned above, including a few nonstandard POSIX-only functions, likewcpcpy
. wint_t
type (equivalent tosigned long int
), with range macrosWINT_MIN
andWINT_MAX
.WEOF
macro (evaluates to-1
).MB_LEN_MAX
andMB_CUR_MAX
macros (both evaluate to4
, because the multibyte encoding is always UTF-8).- wide character related
stdlib.h
functions (e.g.wcstombs
,mbstowcs
,mblen
). - nonstandard
wcstomb
function, which is an alternative towctomb
, but expects a wide character string instead of of a single wide character, to be able to read surrogate pairs in UTF-16. - nonstandard
mbtowcs
function, which is an alternative tombtowc
, but treats the wide character pointer as a string instead of a pointer to a single wide character, to be able to write surrogate pairs in UTF-16.
unicorn includes an addon called ascx, which you can find in the ascx
subdirectory of the sources. it includes functions for converting strings from various extended ASCII codepages to Unicode.
the functions are:
ascxtowc
: for converting a single character.ascxstowcs
: for converting an entire string.
the behaviour of the two functions are identical to mbtowc
and mbstowcs
respectively, except that they take an extra parameter to specify which codepage the extended ASCII string is encoded in.
the currently supported codepages are:
- IBM437
- IBM850
- IBM858
- Windows-1252
- ISO-8859-1 with C0 and C1 control characters
the IBM codepages each have two variants:
C0
: containing C0 control characters.C0_REP
: containing IBM dingbats in place of C0 control characters.
unassigned codepoints (e.g. $81
in Windows-1252) are converted to the replacement character (U+FFFD
).