-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of encodings of Dao strings #174
Comments
I think the proposed change is a must - I've been thinking about "UTF-8 rules them all" while having built-in support for WCS handling already more than once, but wasn't convinced it's right time to open up this issue :) |
String representation is exactly what I have been thinking over recently. You started the issue, so now you'll have to bear with me trying to push in my grand ideas -- you brought it on yourself :) I have a simple idea. Kick out wide strings, entirely. The problems of
Ruby and Go have byte strings only. Rust even gurantees that any properly created string is a valid UTF-8 sequence. Not counting C and C++, I know of only Python to have a mess with Unicode support similar to (and probably because of) It seems uncomfortable to not being able to treat single string element as a character. But in practice there are very few cases where it can be a problem. And it is quite easy to provide some means to work with a byte string on character-wise basis:
Finally, the internal handling of strings in the core and modules will be much simpler and easier to maintain. And the strings will be simpler for the user to reason about as well. P.S.
Just don't forget to park it back in my garage when you're done :) |
OK, it seems everyone has an issue with string. So let's do something about it:) Removing WCS completely? This option had been on my radar before, you just remind me to reconsider it again:) I had thought
|
Not It may also make sense to convert local encoding to UTF-8 automatically when reading from a file. Ideally, encoding should be bound to the file stream upon its creation, so that all the strings you obtain from it are already in UTF-8. By the way, any local encoding conversion in pure C will require critical sections with |
The original encoder: inline char FormU8Trail( uint_t cp, int shift )
{
return ( ( cp >> 6*shift ) & 0x3F ) + ( 0x2 << 6 );
}
static void DaoEnc_EncodeUTF8( DaoProcess *proc, DaoValue *p[], int N )
{
DString *str = p[0]->xString.data;
DString *out = DaoProcess_PutMBString( proc, "" );
if ( str->mbs ){
DaoProcess_RaiseException( proc, DAO_ERROR, "String already encoded" );
return;
}
for ( daoint i = 0; i < str->size; ){
wchar_t ch = str->wcs[i];
uint_t cp;
if ( sizeof(wchar_t) == 4 ) // utf-32
cp = (wchar_t)ch;
else { // utf-16
if ( ch >= 0xD800 && ch <= 0xDBFF ){ // lead surrogate
if ( i < str->size - 1 && str->wcs[i + 1] >= 0xDC00 && str->wcs[i + 1] <= 0xDFFF ) // trail surrogate
cp = ( ( (uint_t)ch - 0xD800 ) << 10 ) + ( (uint_t)str->wcs[i + 1] - 0xDC00 );
else
goto Error;
i++;
}
else // bmp
cp = (wchar_t)ch;
}
if ( cp < 0x80 ) // 0xxxxxxx
DString_AppendChar( out, (char)cp );
else if ( cp < 0x800 ){ // 110xxxxx 10xxxxxx
DString_AppendChar( out, (char)( ( cp >> 6 ) + ( 0x6 << 5 ) ) );
DString_AppendChar( out, FormU8Trail( cp, 0 ) );
}
else if ( cp < 0x10000 ){ // 1110xxxx 10xxxxxx 10xxxxxx
DString_AppendChar( out, (char)( ( cp >> 12 ) + ( 0xE << 4 ) ) );
DString_AppendChar( out, FormU8Trail( cp, 1 ) );
DString_AppendChar( out, FormU8Trail( cp, 0 ) );
}
else if ( cp < 0x200000 ){ // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
DString_AppendChar( out, (char)( ( cp >> 18 ) + ( 0x1E << 3 ) ) );
DString_AppendChar( out, FormU8Trail( cp, 2 ) );
DString_AppendChar( out, FormU8Trail( cp, 1 ) );
DString_AppendChar( out, FormU8Trail( cp, 0 ) );
}
else
goto Error;
}
return;
Error:
DaoProcess_RaiseException( proc, DAO_ERROR, "Invalid code unit found" );
} |
A remark: since double-quoted |
What I thought about was separation of |
MBS string, which is going to be the only string representation, is actually just a byte array. Nothing will have to be changed in the existing stuff like indexing a string or getting its size, so it will continue to be just an array of There is nothing to separate then. Means can be added to convert a string to array of Unicode characters and vice versa, but it is not exactly about encoding. |
@Night-walker, thanks for the encoder!
I was actually considering to use
Of course, you can work with strings as easily as with arrays of bytes. A string is essentially an array of bytes, and supports operations that allow it to be used like an array. I don't see why you believe it is not convenient enough. |
So for example if all strings will be UTF-8 by default, I suppose |
It will be byte indexing, and slicing will still work byte-wise, and string size will be returned in bytes, and etc. For working with characters, there will be additional means. However, it's questionable how |
Well, if almost everything will stay byte-wise, then the I really like the simplicity of indexing, slicing, for-in etc. and would like to use it also for characters, but it would imply introduction of another type "codepoint string" or switching the default string manipulation from byte-wise to character-wise and introduce type for byte array. Assignment between the "codepoint string" and "byte array" wouldn't be allowed, but casting would behave like |
Look at Ruby or Rust strings. Something like There are really very few cases when you have to index multi-byte characters or iterate them one by one. In fact, I also had doubts about how safe and convenient it is to work with UTF-8 strings. But after inspecting some use cases I concluded that it's a rather exceptional case when a multi-byte character can compromise code correctness or require special handling. |
I must disagree as a non-ASCII-country based citizen. I usually use regexps (and any other searching and string-handling) with character strings and I'm really disappointed if it's not available (e.g. GNU coreutils, grep, sed, awk...). In these days more and more data gets internationalized and the need for out-of-box support for such data enormously grows. Therefore I'm not fully satisfied with the OOP-like approach (i.e. having methods for character-wise handling) for treating majority of our input data (for example take random data from web - you'll get most of them in UTF-8 or other internationalized encoding, not in ASCII any more :( ). |
I am just considering another alternative: instead of removing wide character string, why not use Just tested, removing WCS does not cut down the binary size much, just about 20K (2.4%). Considering that there is still need to implement character-wise methods, the reduction will be even less. So removing WCS, is not a big win as I hoped. But keeping the option of WCS will certainly make it more convenient to handle non-ascii texts. (Currently, the Dao help module also relies on WCS, by the way). |
When I said about ASCII-based parsing, I didn't meant that it is only suitable for ASCII input. I pointed out that in most cases only ASCII characters are essential as "anchors" in string handling. Other characters, regardless of their size, are simply passed over. For instance, XML parser doesn't care about anything which is not Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size. Only something like using hard-coded non-ASCII string literals in the source together with their hard-coded sizes, etc. |
Maybe I miss something, but how does it solve the fact, that
If the input for such XML parser is in UTF-8, then this statement holds. In other encodings it depends (i.e. it's false).
Slicing? Indexing? I'm starting to be more and more convinced the need for |
I doubt it will. It is almost always possible to process text in MBS in the same way as in WCS. If you have a realistic counter-examples, then just dispel my delusion, for I didn't find any such cases which I would call typical.
I thought about that too -- UTF-16 strings, as in Qt, Java and .Net. But that would only add extra complexity for the internal use of such non-C/C++-compatible strings, while they may appear not as useful as they may seem to. Note that having wide strings doesn't prevent character-related errors. One can easily make a false assumption about the form of some string at some point, e.g. simply forget to convert it to WCS. Having only Unicode strings would solve this, but that is doubtfully an option for Dao. Relatively new independent languages seem to avoid wide strings. Including Ruby, Go and Rust which are all aimed to be used for various web stuff. There is even an apparent tendency towards UTF-8, which can't exist without a reason. |
.Net supports wchar_t in all variants (namely 8b, 16b and 32b) IIRC.
Yep and I think my proposal with |
It always holds regardless of the encoding, because only ASCII characters constitute XML markup -- and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.
These are not use cases. That's operations based on one or two indexes which are obtained from somewhere else -- usually from searching or matching. Only hard-coded values could lead to errors, but such case is rather unlikely even if you search/match a multi-byte string. |
But internally all strings there are always 16-bit.
UTF-8 string is a byte array. It is not only redundant, but also quite inefficient to make index-based string operations character-wise, if that is what you imply. It is simply not a viable solution, as all indexing, slicing, index-based searching and matching would have O(n) worst performance. |
Character access by index cannot be done efficiently with MBS. But this is probably not a very typical use case.
Actually not much extra complexity.
This may be an issue. Having two forms of strings demands extra attention when dealing with strings passed from somewhere else. At certain point, a programer may simply forget to check and handle them properly. After more consideration, it seems the advantage of removing WCS should outweigh the advantage of keeping it. |
Exactly, considering e.g. 10 most used encodings, only first few (ASCII, UTFs and maybe one or two others) are sane, which is what I wanted to emphasize - i.e. I have no idea how many encodings are used in China (daokoder, any specifics?), Japan, Arabic countries etc. which play inherent role in IT technology. So, the sanity shouldn't be considered as it's not a factual measure, but rather a hope :)
What are those use-cases? I couldn't come up with anything else then operations with characters or methods (search, match...).
Compared to the solution with methods for each such operation it's even worse - O(n + overhead_of_calling_methods). If we openly say to programmer "hey, there is bytearray with all it's efficiency" and "there is also a UTF-8 string, but with O(n) operations", he'll decide what to use where and when to convert between them. If UTF-8 string is treated like byte-array we provide him with "half-integrated" support for character strings (part of the usage would be with operators and the rest with methods). We'll allow him to modify UTF-8 anywhere on a byte-wise basis, which'll lead to flawed UTF-8 string etc. |
I can clarify it for you :) There is ASCII, there are local 8-bit encodings, there is UTF-7/8/16(BE/LE)/32, there are few local variable-width encoding. But I don't know of anything which is not backward-compatible to ASCII. By saying "sane" I mainly wanted to exclude some 50+ years old standards which could predate or compete with ASCII.
"Operations with characters" is not a task. What's the goal? What should be accomplished and why this particular way? That's what I would call a use-case.
There is no practical reason why a string should use character-aware indexes for basic operations. It can just use byte indexes, retaining both the simplicity and efficiency.
You can put garbage into a string In virtually any language. And here UTF-8 is actually beneficial in that such an act can be detected in time. So it's another point added to UTF-8 score :) |
Ok, if you're so sure, I'll call you when I get into trouble with encodings in Dao sometime in the future :) Btw looking at http://en.wikipedia.org/wiki/GB_18030 (and the corresponding "See also" paragraph) makes me sure I'll contact you very soon :)
Find all words containing ď starting from the fiftieth character. Count characters in a word. Find all names starting with Žď. Print all non-ASCII characters from string. Etc. Basically one really needs character-wise handling nearly everywhere. Conversely, I can't think of any use-case where I'm interested in the underlying representation rather than it's meaning.
For efficiency there is bytearray or any other similar vector-structure with fixed-size elements (e.g. DataFrame). I really don't want to use UTF-8 string neither like
Such detection is inherently O(n) => not feasible doing it before (or during) each string operation. |
Starting from 50th character? Not realistic. Maybe starting from some position, but not from Nth character.
OK, that's fair. But it's a matter of calling something like
If you don't make assumption that 'Žď' occupies 2 bytes, i.e. you don't use hard-coded literals with their hard-coded sizes, no problem. Even with a low-level approach: pat = 'Žď'
str = '<some text>'
pos = 0
while (pos = str.find(pat, pos), pos > 0){
io.writeln(str[pos : pos + %pat - 1])
pos += %pat
}
Virtually the same as above: pat = '[^ABCDEF...]'
str = '<some text>'
pos = 0
while (match = str.match(pat, pos), match != none){
io.writeln(str[match.start : match.end])
pos = match.end + 1
}
But it doesn't mean that one has to use some special, explicitly character-wise handling everywhere. It works fine without it.
You won't have to do that because it's absolutely meaningless.
But there is no need to make it before any string operation. Character-wise operation will inevitably detect it, obviously with zero overhead. For byte-wise operation it doesn't make sense to care about characters. Let me be a smart-ass a bit, if I may. I spent hours in considering all possible variants of revising strings in Dao. I inspected other languages with regard to string handling (particularly UTF-8), read various discussions, manifestos, cries of pain, documentation and historical notes regarding encodings and text representations and multilingual support and whatever else. UTF-8 is not my favorite string representation -- I tend to like UTF-16 more. But having only UTF-8/byte strings in Dao is:
If you think one can't live without wide strings, look at Ruby. It has only byte strings. Default operations work byte-wise. There are additional methods to (rarely) work with characters in an explicit way. Ruby exist for two decades, it works, and a lot of people are using it, and it's used extensively on the Web. There is no real problem in handling non-ASCII text via byte strings. At least I don't see anything proving the opposite. |
I have been thinking about an approach to support fast access to characters by index, it is O(1) in the best cases and O(n) in the worst. This approach attaches an auxiliary array to the string when it is accessed by char index, and stays there as long as the string is not modified. This array store pairs of numbers (may be in short ints), where the first indicates the width (in bytes) of a char, and the second stores the number of continuous occurrences of the char. In the best cases, there are only a few pairs of such numbers, so the byte location of a char of certain index can be computed efficiently. In typical cases, there may be more, but shouldn't be many, computing the indices should still be a lot faster than O(n). The down side is that, even if a string is never accessed by char indices, each string will still require at least 12 bytes (for 64 bits machine, 8 bytes for 32 bits) more space (two short int fields and one pointer field). Another downside is, in order to use short int fields, it may only be able to support string of a couple tens of thousand characters in the worst cases. But worst case scenarios are usually not a big concern. Of course, another obvious approach is to pre-index all the chars before char access by index. This way a single access will be guaranteed to be O(1). But it clearly take a lot more space than the above approach, and is not feasible to store it along with the string. And if the pre-indexing is to be done each time, it would be just too expensive. Though when accessing all the chars in a single loop, this is the preferred approach, but the common scenario for this is to access all chars from the first to the last sequentially, then there is no need for any kind auxiliary array. So my approach should be much more preferable for general cases. My only concern is the extra 12 bytes space for each string, actually only 4 bytes more with respect to the current string data structure, maybe not a big deal. |
I wouldn't worry that much about accessing individual characters. It just doesn't happen as "I want Nth char", which I have been trying to show. I specifically spent time on trying to find a case when one may actually need to jump over N characters forward or backward, in order to prove that byte strings are inconvenient and dangerous, which was what I believed in. I didn't find any realistic, typical case. It is surprising, but almost all real-world tasks on byte strings cannot be compromised by multi-byte characters. You still should better not be careless about such strings, but that applies to dual (MBS/WCS) strings as well. Maybe even more so, as the only way to ensure that you have a wide string in Dao is to check it manually for any routine parameter and returned value -- which is considerably more cumbersome and error-prone than simply knowing that all strings are always MBS and don't making any assumptions about character size. |
Worry or not is not an issue, I am just trying to consider possible options. BTW, the base overhead in my approach is actually 8 bytes, so the string structure would have the same size as now. |
Well, if you intend to leave all basic string operations byte-wise, then there is nothing to argue. It is of course feasible to provide character cache, as far as it is used only when the user explicitly uses something like |
On Linux, I guess. But not on Mac OSX. I did seem to remember it worked before, I don't remember if it was only on Linux.
You forget that terminals do not alway support UTF-8. So when printing out a string, conversion may be necessary, otherwise, you won't be able to display some strings properly with simple |
Naturally, there is
Printing to a terminal is an edge use case, albeit it's still the same as writing to a file. It's better to know exactly what encoding will be assumed rather then trying to guess how it will behave depending on what you have in a string. All these possible implicit changes are quite error-prone and hard to diagnose. |
For |
Which APIs are you talking about? |
Those functions aren't exported, so it doesn't matter actually. I meant |
Right, they are just internal functions, and are not meant to be exported. |
OK, let me sum it up regarding conversion into local encoding when writing to stream.
Isn't putting |
Don't worry for this, I already changed it. Now conversion to local encoding (assuming UTF-8 without guessing) is done only in the following cases:
In all other situations, there is no implicit conversion. If one wants to avoid conversion even in these cases, there is the option of using formatted writing with Regarding locale, it is better not to rely too much on it, it is not consistently supported across different platforms. |
I would rather flip the meanings, it's so easy to write
I think providing
Exactly. That's why it's better to avoid its implicit use for converting strings to local encoding. |
int dao_character( uint_t ch )
{
#ifdef BSD
return (ch == '_' || iswalnum(ch) || iswideogram(ch) || iswphonogram(ch) );
#elif defined(LINUX)
return (ch == '_' || iswalnum(ch));
#else
uint_t ch2 = ch;
if( sizeof(wint_t) == 2 && ch > 0xFFFF ) ch = 0; /* for isw*(); */
return (ch == '_' || iswalnum(ch) || dao_cjk(ch2));
#endif
}
const uint_t dao_cjk_charts[][2] =
{
{0x3400, 0x4DBF}, /* Extension A; */
{0x4E00, 0x9FFF}, /* Basic Block; */
{0xF900, 0xFAFF}, /* Extension F; */
{0x20000, 0x2A6DF}, /* Extension B; */
{0x29100, 0x2A6DF}, /* Extension B; */
{0x2A700, 0x2B73F}, /* Extension C; */
{0x2B740, 0x2B81F}, /* Extension D; */
}; Extension B has two overlapping ranges? Smells suspicious. Also it's strange that you included CJK Compatibility Ideographs (0xFA900-0xFAFF) as Extension F (which is not implemented yet), omitting CJK Compatibility (0x3300-0x33FF), CJK Compatibility Forms (0xFE30-0xFE4F) and CJK Compatibility Ideographs Supplement (0x2F800-0x2FA1F). |
It makes sense to flip the meaning. Or maybe we can use
But it is certain that the native terminals always support the local encoding. So when writing to
The size of
My mistake here. |
But they may not use that encoding by default. And there may be two local encodings -- one for console, another for everything else, which is how it is on Windows in my case.
OK, but that doesn't mean it can't be 4 while |
Seriously?
What are those things?
The parameter type of |
You think I'm trying to trick you? :) The "normal" local encoding for Cyrillic languages is windows-1251, while the cursed
|
Of course no:), it was just too surprising to me.
What happens when you write windows-1251 string to Anyway, now I agree with you that no implicit conversion should be done.
I read about this before. But I simple made a rationale assumption to do it the way I did. But given the above example, it seems that rationale assumptions cannot be made for certain platforms. So I will make the change you suggested. |
I am also considering to add a method in |
Yup, garbage. It's always an issue to reckon with, for instance, when making queries via console to a database. It is possible to change the default encoding, but it's nontrivial and I suppose few ever care to do it. And it's the same with PowerShell. Linux terminals and Bash are a blessing comparing to what Windows provides.
That would be better. As well as adding such mode for reading from stream. |
I agree with @Night-walker with the removal of implicit conversion and also with flipping of the meaning of The idea of adding a method in |
|
That would be too much. Maybe it could be supported by modules. |
That is a too basic task to take it out of the core, while leaving ASCII-aware-only variant may be error-prone. The implementation is trivial anyway, and reserving the same amount of space in the new string ensures that no overhead will appear for ASCII data. |
Yes, the implementation is trivial, but I was just not sure if it worths the effort as I think case conversion is not very important to be included in the core. Anyway, it may be better to keep it and improve it in the same way as handling it in the regex. |
I'm not aware of many cases where only ASCII |
A lot of text processing tasks (including trivial ones) require normalization of a string to certain character case. Here it is better to overdo then to underdo. |
Now |
Then I suppose we're done here. |
Then let's close it. |
I have been considering to change a few things about Dao strings. You know, Dao can store a string either as Multi-Byte String (MBS) or Wide Character String (WCS), and support automatic conversions between them. Currently, when converting MBS to and from WCS, a MBS string is assumed to be encoded with system encoding.
I am considering to change the assumed encoding to UTF-8, this will not only make conversion between MBS and WCS potentially a lot more efficient, probably also make Dao programs more portable.
Basically the following will be done:
%s
, it will be converted to system encoding MBS first; When a string is printed with%S
(planned), MBS will be printed as it is, and WCS will be converted to UTF-8 MBS for printing.These should have covered most scenarios that require handling of string encoding. If I missed something, please add.
@Night-walker
I may need your UTF-8 encoder for this:)
The text was updated successfully, but these errors were encountered: