-
Notifications
You must be signed in to change notification settings - Fork 151
Description
After achieving fully integration of my dynamic unicode string (DWSTRING), working with all the intrinsic FreeBasic string functions, I am attempting to solve the problem of unicode surrogate pairs and I already have code to do it when using my additional string functions (more than 30) The problem is that, as surrogates use two characters, when manipulating strings we can end breaking a surrogate by getting only half of it. The solution could be to check if the string to manipulate has surrogate pairs. If it has, they have to be replaced with unicode code points; once the string has been processed, these unicode points will be replaced with surrogate pairs. To do this last change, I needed to use CHR, but it is unusable for the purpose because it is only ansi. So I have needed to write my own:
FUNCTION ChrW (BYVAL codepoint AS UInteger) AS DWSTRING
If codepoint <= &HFFFF Then Return WString(1, codepoint)
' Convert to UTF-16 surrogate pair for higher codepoints
Dim As UShort highSurrogate = &HD800 Or ((codepoint - &H10000) Shr 10)
Dim As UShort lowSurrogate = &HDC00 Or ((codepoint - &H10000) And &H3FF)
Return WString(1, highSurrogate) + WString(1, lowSurrogate)
END FUNCTION
Other functions that I have written to deal with the surrogates are:
' ========================================================================================
' Converts surrogate pair to unicode code point
' Extracts the actual Unicode code point from a valid surrogate pair.
' ========================================================================================
FUNCTION SurrogatePairToCodePoint (BYVAL high AS USHORT, BYVAL low AS USHORT) AS ULONG
IF IsValidSurrogatePair(high, low) THEN
RETURN ((high - &HD800) * &H400) + (low - &HDC00) + &H10000
END IF
RETURN 0 ' Invalid surrogate pair
END FUNCTION
' ========================================================================================
' ========================================================================================
' Encode unicode code point as surrogate pair
' Converts a Unicode code point (above U+FFFF) back into its high and low surrogate pair.
' ========================================================================================
SUB CodePointToSurrogatePair (BYVAL codePoint AS ULONG, BYREF high AS USHORT, BYREF low AS USHORT)
IF codePoint >= &H10000 AND codePoint <= &H10FFFF THEN
high = &HD800 + ((codePoint - &H10000) \ &H400)
low = &HDC00 + ((codePoint - &H10000) MOD &H400)
ELSE
high = 0
low = 0
END IF
END SUB
' ========================================================================================
An to check if the string has surrogates:
' ========================================================================================
FUNCTION HasSurrogates (BYREF text AS WSTRING) AS BOOLEAN
FOR i AS LONG = 1 TO LEN(text)
IF ASC(text, i) >= &HD800 AND ASC(text, i) <= &HDBFF THEN RETURN TRUE
NEXT
RETURN False
END FUNCTION
' ========================================================================================
' ========================================================================================
' Checks whether a UTF-16 character is the high part of a surrogate pair.
' ========================================================================================
FUNCTION IsHighSurrogate (BYVAL ch AS USHORT) AS BOOLEAN
RETURN (ch >= &HD800 AND ch <= &HDBFF)
END FUNCTION
' ========================================================================================
' ========================================================================================
' Checks whether a UTF-16 character is the low part of a surrogate pair.
' ========================================================================================
FUNCTION IsLowSurrogate (BYVAL ch AS USHORT) AS BOOLEAN
RETURN (ch >= &HDC00 AND ch <= &HDFFF)
END FUNCTION
' ========================================================================================
' ========================================================================================
' Checks whether a UTF-16 encoded string contains valid high-low surrogate pairs.
' ========================================================================================
FUNCTION IsValidSurrogatePair (BYVAL high AS USHORT, BYVAL low AS USHORT) AS BOOLEAN
RETURN (high >= &HD800 AND high <= &HDBFF) AND (low >= &HDC00 AND low <= &HDFFF)
END FUNCTION
' ========================================================================================