Handling of encodings of Dao strings #174

daokoder · 2014-04-28T09:09:12Z

I have been considering to change a few things about Dao strings. You know, Dao can store a string either as Multi-Byte String (MBS) or Wide Character String (WCS), and support automatic conversions between them. Currently, when converting MBS to and from WCS, a MBS string is assumed to be encoded with system encoding.

I am considering to change the assumed encoding to UTF-8, this will not only make conversion between MBS and WCS potentially a lot more efficient, probably also make Dao programs more portable.

Basically the following will be done:

Program source code will be converted to UTF-8 upon parsing if it is not in UTF-8 already;
When a MBS is converted to WCS, the bytes in UTF-8 will be converted, and the other bytes will be simply copied; When converting WCS to MBS, similar rules will apply;
When chars are appended to MBS or wide chars are appended to WCS, they are simply copied;
When chars are appended to WCS, or wide chars to MBS, they are first converted in the way as mentioned above;
When a string is printed with %s, it will be converted to system encoding MBS first; When a string is printed with %S (planned), MBS will be printed as it is, and WCS will be converted to UTF-8 MBS for printing.
New methods will be added for checking and converting UTF-8 strings;

These should have covered most scenarios that require handling of string encoding. If I missed something, please add.

@Night-walker
I may need your UTF-8 encoder for this:)

The text was updated successfully, but these errors were encountered:

dumblob · 2014-04-28T09:36:56Z

I think the proposed change is a must - I've been thinking about "UTF-8 rules them all" while having built-in support for WCS handling already more than once, but wasn't convinced it's right time to open up this issue :)

Night-walker · 2014-04-28T11:30:31Z

String representation is exactly what I have been thinking over recently. You started the issue, so now you'll have to bear with me trying to push in my grand ideas -- you brought it on yourself :)

I have a simple idea. Kick out wide strings, entirely.

The problems of wchar_t in C/C++ inherited by Dao:

it's not portable, you cannot rely on it to have same size and encoding
it's not guaranteed to be able to represent any Unicode code point
it's inherently inefficient becase it always requires conversion from/to external encoding

Ruby and Go have byte strings only. Rust even gurantees that any properly created string is a valid UTF-8 sequence. Not counting C and C++, I know of only Python to have a mess with Unicode support similar to (and probably because of) wchar_t. Virtually all other modern languages have a single string representation.

It seems uncomfortable to not being able to treat single string element as a character. But in practice there are very few cases where it can be a problem. And it is quite easy to provide some means to work with a byte string on character-wise basis:

for loop utilizing mbtowc() to traverse strings
same for code section methods
a function to get the array of Unicode code points from a string, and vice versa
various other minor stuff to work with characters, like getting Nth char, appending a char, etc.

Finally, the internal handling of strings in the core and modules will be much simpler and easier to maintain. And the strings will be simpler for the user to reason about as well.

P.S.

@Night-walker
I may need your UTF-8 encoder for this:)

Just don't forget to park it back in my garage when you're done :)

daokoder · 2014-04-28T12:00:33Z

OK, it seems everyone has an issue with string. So let's do something about it:)

Removing WCS completely? This option had been on my radar before, you just remind me to reconsider it again:) I had thought wchar_t works the same on different platforms, apparently I was wrong. Also, WCS turns out to be only occasionally useful so far, removing WCS sounds more reasonable now. With WCS, the code size should reduce considerably (I have always wanted to cut down something:) ).

for loop utilizing mbtowc() to traverse strings

mbstowcs() won't be necessary, I still prefer to the assumed/default encoding of strings to be UTF-8 and handle it accordingly and automatically. Other encodings have to be handled explicitly. This should be a more portable and consistent approach.

Night-walker · 2014-04-28T12:47:09Z

for loop utilizing mbtowc() to traverse strings
mbstowcs() won't be necessary

Not mbstowcs(). mbtowc(), to step one MBS character forward. However, if you want UTF-8 to be the presumed encoding of any string, then it all becomes simpler. UTF-8 can be traversed in any direction starting from any byte.

It may also make sense to convert local encoding to UTF-8 automatically when reading from a file. Ideally, encoding should be bound to the file stream upon its creation, so that all the strings you obtain from it are already in UTF-8.

By the way, any local encoding conversion in pure C will require critical sections with setlocale() to be implemented properly.

Night-walker · 2014-04-28T13:14:39Z

The original encoder:

inline char FormU8Trail( uint_t cp, int shift )
{
    return ( ( cp >> 6*shift ) & 0x3F ) + ( 0x2 << 6 );
}

static void DaoEnc_EncodeUTF8( DaoProcess *proc, DaoValue *p[], int N )
{
    DString *str = p[0]->xString.data;
    DString *out = DaoProcess_PutMBString( proc, "" );
    if ( str->mbs ){
        DaoProcess_RaiseException( proc, DAO_ERROR, "String already encoded" );
        return;
    }
    for ( daoint i = 0; i < str->size; ){
        wchar_t ch = str->wcs[i];
        uint_t cp;
        if ( sizeof(wchar_t) == 4 ) // utf-32
            cp = (wchar_t)ch;
        else { // utf-16
            if ( ch >= 0xD800 && ch <= 0xDBFF ){ // lead surrogate
                if ( i < str->size - 1 && str->wcs[i + 1] >= 0xDC00 && str->wcs[i + 1] <= 0xDFFF ) // trail surrogate
                    cp = ( ( (uint_t)ch - 0xD800 ) << 10 ) + ( (uint_t)str->wcs[i + 1] - 0xDC00 );
                else
                    goto Error;
                i++;
            }
            else // bmp
                cp = (wchar_t)ch;
        }
        if ( cp < 0x80 )            // 0xxxxxxx
            DString_AppendChar( out, (char)cp );
        else if ( cp < 0x800 ){ // 110xxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 6 ) + ( 0x6 << 5 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else if ( cp < 0x10000 ){   // 1110xxxx 10xxxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 12 ) + ( 0xE << 4 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 1 ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else if ( cp < 0x200000 ){  // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 18 ) + ( 0x1E << 3 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 2 ) );
            DString_AppendChar( out, FormU8Trail( cp, 1 ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else
            goto Error;
    }
    return;
Error:
    DaoProcess_RaiseException( proc, DAO_ERROR, "Invalid code unit found" );
}

Night-walker · 2014-04-28T13:24:38Z

A remark: since double-quoted "" string literals become vacant, they may be utilized as a short form of verbatim strings or so.

dumblob · 2014-04-28T13:48:22Z

What I thought about was separation of string from byte array completely, but providing means to convert and cast them seamlessly (just like byte array being another encoding). I want to have a possibility to work with strings as easily a it's an array of bytes.

Night-walker · 2014-04-28T14:54:08Z

MBS string, which is going to be the only string representation, is actually just a byte array. Nothing will have to be changed in the existing stuff like indexing a string or getting its size, so it will continue to be just an array of char.

There is nothing to separate then. Means can be added to convert a string to array of Unicode characters and vice versa, but it is not exactly about encoding.

daokoder · 2014-04-28T14:55:05Z

@Night-walker, thanks for the encoder!

since double-quoted "" string literals become vacant, they may be utilized as a short form of verbatim strings or so.

I was actually considering to use '' for char, and "" for string, like in C/C++.

@dumblob

What I thought about was separation of string from byte array completely, but providing means to convert and cast them seamlessly (just like byte array being another encoding). I want to have a possibility to work with strings as easily a it's an array of bytes.

Of course, you can work with strings as easily as with arrays of bytes. A string is essentially an array of bytes, and supports operations that allow it to be used like an array. I don't see why you believe it is not convenient enough.

dumblob · 2014-04-28T15:01:07Z

So for example if all strings will be UTF-8 by default, I suppose 'čřžď'[3] will return the last codepoint, but what if I want to treat it like byte-array for my own parsing (e.g. because I know the string was read from some file with corrupted UTF-8 and I want to handle errors myself)? How would I access individual bytes?

Night-walker · 2014-04-28T15:25:10Z

So for example if all strings will be UTF-8 by default, I suppose 'čřžď'[3] will return the last codepoint, but what if I want to treat it like byte-array for my own parsing (e.g. because I know the string was read from some file with corrupted UTF-8 and I want to handle errors myself)? How would I access individual bytes?

It will be byte indexing, and slicing will still work byte-wise, and string size will be returned in bytes, and etc. For working with characters, there will be additional means.

However, it's questionable how for-in should treat strings -- as byte sequences or as character sequences? Go, for instance, iterates strings character-wise in this case. But not Rust, which has a special iterator for that, while defaulting to byte-wise handling.

dumblob · 2014-04-28T15:35:35Z

Well, if almost everything will stay byte-wise, then the for-in loop should be definitely also byte-wise. I'm though curious how the means for treating characters (codepoints) will look like :)

I really like the simplicity of indexing, slicing, for-in etc. and would like to use it also for characters, but it would imply introduction of another type "codepoint string" or switching the default string manipulation from byte-wise to character-wise and introduce type for byte array. Assignment between the "codepoint string" and "byte array" wouldn't be allowed, but casting would behave like reinterpret_cast from C++ and each of these two types would have the same interface methods for conversion (to_byte_array, to_codepoint_string).

Night-walker · 2014-04-28T16:07:31Z

Look at Ruby or Rust strings. Something like str.char(i) and str.iterate_chars {...} is trivial to support. In most cases, however, you need searching, regex-matching and slicing with byte indexes obtained from those searches and matches. Also take into account that string parsing is virtually always done on ASCII basis, i.e. only ASCII characters are relevant. All this makes it quite safe to assume that any character equals byte.

There are really very few cases when you have to index multi-byte characters or iterate them one by one.

In fact, I also had doubts about how safe and convenient it is to work with UTF-8 strings. But after inspecting some use cases I concluded that it's a rather exceptional case when a multi-byte character can compromise code correctness or require special handling.

dumblob · 2014-04-28T16:21:54Z

I must disagree as a non-ASCII-country based citizen. I usually use regexps (and any other searching and string-handling) with character strings and I'm really disappointed if it's not available (e.g. GNU coreutils, grep, sed, awk...). In these days more and more data gets internationalized and the need for out-of-box support for such data enormously grows.

Therefore I'm not fully satisfied with the OOP-like approach (i.e. having methods for character-wise handling) for treating majority of our input data (for example take random data from web - you'll get most of them in UTF-8 or other internationalized encoding, not in ASCII any more :( ).

daokoder · 2014-04-28T16:57:19Z

I am just considering another alternative: instead of removing wide character string, why not use unsigned int in place of wchar_t? Since the encoding and decoding of UTF-8 can be done without using any library functions. This will solve the portability issue of wchar_t, isn't it?

Just tested, removing WCS does not cut down the binary size much, just about 20K (2.4%). Considering that there is still need to implement character-wise methods, the reduction will be even less. So removing WCS, is not a big win as I hoped. But keeping the option of WCS will certainly make it more convenient to handle non-ascii texts. (Currently, the Dao help module also relies on WCS, by the way).

Night-walker · 2014-04-28T17:00:05Z

I must disagree as a non-ASCII-country based citizen. I usually use regexps (and any other searching and string-handling) with character strings and I'm really disappointed if it's not available (e.g. GNU coreutils, grep, sed, awk...). In these days more and more data gets internationalized and the need for out-of-box support for such data enormously grows.

When I said about ASCII-based parsing, I didn't meant that it is only suitable for ASCII input. I pointed out that in most cases only ASCII characters are essential as "anchors" in string handling. Other characters, regardless of their size, are simply passed over. For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all.

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size. Only something like using hard-coded non-ASCII string literals in the source together with their hard-coded sizes, etc.

dumblob · 2014-04-28T17:45:50Z

@daokoder

This will solve the portability issue of wchar_t, isn't it?

Maybe I miss something, but how does it solve the fact, that wchar_t is compiler-specific in-memory representation of all existing codepoints (i.e. how will the platform-specific w functions work with differently defined wchar_t)?

@Night-walker

For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all.

If the input for such XML parser is in UTF-8, then this statement holds. In other encodings it depends (i.e. it's false).

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size.

Slicing? Indexing?

I'm starting to be more and more convinced the need for string type being always an array of UTF-8 characters (codepoints) and bytearray type being always an array of bytes is inevitable. Both with the same slicing, indexing... operators and both having convenient conversion methods (str2ba(), ba2str()) as mentioned in one of my comments above. The source code will be always automatically converted to UTF-8, the "" will represent string and '' will represent bytearray.

Night-walker · 2014-04-28T17:49:34Z

But keeping the option of WCS will certainly make it more convenient to handle non-ascii texts

I doubt it will. It is almost always possible to process text in MBS in the same way as in WCS. If you have a realistic counter-examples, then just dispel my delusion, for I didn't find any such cases which I would call typical.

I am just considering another alternative: instead of removing wide character string, why not use unsigned int in place of wchar_t? Since the encoding and decoding of UTF-8 can be done without using any library functions. This will solve the portability issue of wchar_t, isn't it?

I thought about that too -- UTF-16 strings, as in Qt, Java and .Net. But that would only add extra complexity for the internal use of such non-C/C++-compatible strings, while they may appear not as useful as they may seem to.

Note that having wide strings doesn't prevent character-related errors. One can easily make a false assumption about the form of some string at some point, e.g. simply forget to convert it to WCS. Having only Unicode strings would solve this, but that is doubtfully an option for Dao.

Relatively new independent languages seem to avoid wide strings. Including Ruby, Go and Rust which are all aimed to be used for various web stuff. There is even an apparent tendency towards UTF-8, which can't exist without a reason.

dumblob · 2014-04-28T17:59:27Z

I thought about that too -- UTF-16 strings, as in Qt, Java and .Net.

.Net supports wchar_t in all variants (namely 8b, 16b and 32b) IIRC.

There is even an apparent tendency towards UTF-8, which can't exist without a reason.

Yep and I think my proposal with string and bytearray distinct types meets this need.

Night-walker · 2014-04-28T18:01:26Z

For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all.
If the input for such XML parser is in UTF-8, then this statement holds. In other encodings it depends (i.e. it's false).

It always holds regardless of the encoding, because only ASCII characters constitute XML markup -- and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size.
Slicing? Indexing?

These are not use cases. That's operations based on one or two indexes which are obtained from somewhere else -- usually from searching or matching. Only hard-coded values could lead to errors, but such case is rather unlikely even if you search/match a multi-byte string.

Night-walker · 2014-04-28T18:20:11Z

.Net supports wchar_t in all variants (namely 8b, 16b and 32b) IIRC.

But internally all strings there are always 16-bit.

Yep and I think my proposal with string and bytearray distinct types meets this need.

UTF-8 string is a byte array. It is not only redundant, but also quite inefficient to make index-based string operations character-wise, if that is what you imply. It is simply not a viable solution, as all indexing, slicing, index-based searching and matching would have O(n) worst performance.

daokoder · 2014-04-28T18:26:27Z

It is almost always possible to process text in MBS in the same way as in WCS. If you have a realistic counter-examples, then just dispel my delusion, for I didn't find any such cases which I would call typical.

Character access by index cannot be done efficiently with MBS. But this is probably not a very typical use case.

But that would only add extra complexity for the internal use of such non-C/C++-compatible strings,

Actually not much extra complexity.

One can easily make a false assumption about the form of some string at some point, e.g. simply forget to convert it to WCS.

This may be an issue. Having two forms of strings demands extra attention when dealing with strings passed from somewhere else. At certain point, a programer may simply forget to check and handle them properly.

After more consideration, it seems the advantage of removing WCS should outweigh the advantage of keeping it.

dumblob · 2014-04-28T18:32:02Z

and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.

Exactly, considering e.g. 10 most used encodings, only first few (ASCII, UTFs and maybe one or two others) are sane, which is what I wanted to emphasize - i.e. I have no idea how many encodings are used in China (daokoder, any specifics?), Japan, Arabic countries etc. which play inherent role in IT technology. So, the sanity shouldn't be considered as it's not a factual measure, but rather a hope :)

These are not use cases.

What are those use-cases? I couldn't come up with anything else then operations with characters or methods (search, match...).

It is simply not a viable solution, as all indexing, slicing, index-based searching and matching would have O(n) worst performance.

Compared to the solution with methods for each such operation it's even worse - O(n + overhead_of_calling_methods). If we openly say to programmer "hey, there is bytearray with all it's efficiency" and "there is also a UTF-8 string, but with O(n) operations", he'll decide what to use where and when to convert between them.

If UTF-8 string is treated like byte-array we provide him with "half-integrated" support for character strings (part of the usage would be with operators and the rest with methods). We'll allow him to modify UTF-8 anywhere on a byte-wise basis, which'll lead to flawed UTF-8 string etc.

Night-walker · 2014-04-28T19:15:55Z

and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.
Exactly, considering e.g. 10 most used encodings, only first few (ASCII, UTFs and maybe one or two others) are sane, which is what I wanted to emphasize - i.e. I have no idea how many encodings are used in China (daokoder, any specifics?), Japan, Arabic countries etc. which play inherent role in IT technology. So, the sanity shouldn't be considered as it's not a factual measure, but rather a hope :)

I can clarify it for you :) There is ASCII, there are local 8-bit encodings, there is UTF-7/8/16(BE/LE)/32, there are few local variable-width encoding. But I don't know of anything which is not backward-compatible to ASCII. By saying "sane" I mainly wanted to exclude some 50+ years old standards which could predate or compete with ASCII.

What are those use-cases? I couldn't come up with anything else then operations with characters or methods (search, match...).

"Operations with characters" is not a task. What's the goal? What should be accomplished and why this particular way? That's what I would call a use-case.

Compared to the solution with methods for each such operation it's even worse - O(n + overhead_of_calling_methods). If we openly say to programmer "hey, there is bytearray with all it's efficiency" and "there is also a UTF-8 string, but with O(n) operations", he'll decide what to use where and when to convert between them.

There is no practical reason why a string should use character-aware indexes for basic operations. It can just use byte indexes, retaining both the simplicity and efficiency.

If UTF-8 string is treated like byte-array we provide him with "half-integrated" support for character strings (part of the usage would be with operators and the rest with methods). We'll allow him to modify UTF-8 anywhere on a byte-wise basis, which'll lead to flawed UTF-8 string etc.

You can put garbage into a string In virtually any language. And here UTF-8 is actually beneficial in that such an act can be detected in time. So it's another point added to UTF-8 score :)

dumblob · 2014-04-28T19:55:35Z

I can clarify it for you :)

Ok, if you're so sure, I'll call you when I get into trouble with encodings in Dao sometime in the future :) Btw looking at http://en.wikipedia.org/wiki/GB_18030 (and the corresponding "See also" paragraph) makes me sure I'll contact you very soon :)

What's the goal?

Find all words containing ď starting from the fiftieth character. Count characters in a word. Find all names starting with Žď. Print all non-ASCII characters from string. Etc.

Basically one really needs character-wise handling nearly everywhere. Conversely, I can't think of any use-case where I'm interested in the underlying representation rather than it's meaning.

It can just use byte indexes, retaining both the simplicity and efficiency.

For efficiency there is bytearray or any other similar vector-structure with fixed-size elements (e.g. DataFrame). I really don't want to use UTF-8 string neither like str::get_nearest_meaningful_character_direction_right(str[50]) nor like str.get_character_on_index(50).

And here UTF-8 is actually beneficial in that such an act can be detected in time.

Such detection is inherently O(n) => not feasible doing it before (or during) each string operation.

Night-walker · 2014-04-28T20:51:25Z

Ok, if you're so sure, I'll call you when I get into trouble with encodings in Dao sometime in the future :) Btw looking at http://en.wikipedia.org/wiki/GB_18030 (and the corresponding "See also" paragraph) makes me sure I'll contact you very soon :)

Like UTF-8, GB18030 is a superset of ASCII. Just as I said.

Find all words containing ď starting from the fiftieth character.

Starting from 50th character? Not realistic. Maybe starting from some position, but not from Nth character.

Count characters in a word

OK, that's fair. But it's a matter of calling something like char_count() or iterating via iterate_char {...} -- no need to take character size into account.

Find all names starting with Žď.

If you don't make assumption that 'Žď' occupies 2 bytes, i.e. you don't use hard-coded literals with their hard-coded sizes, no problem. Even with a low-level approach:

pat = 'Žď'
str = '<some text>'
pos = 0

while (pos = str.find(pat, pos), pos > 0){
    io.writeln(str[pos : pos + %pat - 1])
    pos += %pat
}

Print all non-ASCII characters from string.

Virtually the same as above:

pat = '[^ABCDEF...]'
str = '<some text>'
pos = 0

while (match = str.match(pat, pos), match != none){
    io.writeln(str[match.start : match.end])
    pos = match.end + 1
}

Basically one really needs character-wise handling nearly everywhere.

But it doesn't mean that one has to use some special, explicitly character-wise handling everywhere. It works fine without it.

For efficiency there is bytearray or any other similar vector-structure with fixed-size elements (e.g. DataFrame). I really don't want to use UTF-8 string neither like str::get_nearest_meaningful_character_direction_right(str[50]) nor like str.get_character_on_index(50).

You won't have to do that because it's absolutely meaningless.

And here UTF-8 is actually beneficial in that such an act can be detected in time.
Such detection is inherently O(n) => not feasible doing it before (or during) each string operation.

But there is no need to make it before any string operation. Character-wise operation will inevitably detect it, obviously with zero overhead. For byte-wise operation it doesn't make sense to care about characters.

Let me be a smart-ass a bit, if I may.

I spent hours in considering all possible variants of revising strings in Dao. I inspected other languages with regard to string handling (particularly UTF-8), read various discussions, manifestos, cries of pain, documentation and historical notes regarding encodings and text representations and multilingual support and whatever else.

UTF-8 is not my favorite string representation -- I tend to like UTF-16 more. But having only UTF-8/byte strings in Dao is:

generally simpler and safer than having two kinds of strings, given the implicit dualism of string
a lot more practical than having only wchar_t strings, given their inherent flaws
much simpler then having only UTF-16 strings, given high reliance of Dao on C and C++

If you think one can't live without wide strings, look at Ruby. It has only byte strings. Default operations work byte-wise. There are additional methods to (rarely) work with characters in an explicit way. Ruby exist for two decades, it works, and a lot of people are using it, and it's used extensively on the Web.

There is no real problem in handling non-ASCII text via byte strings. At least I don't see anything proving the opposite.

daokoder · 2014-04-29T04:01:27Z

I have been thinking about an approach to support fast access to characters by index, it is O(1) in the best cases and O(n) in the worst. This approach attaches an auxiliary array to the string when it is accessed by char index, and stays there as long as the string is not modified. This array store pairs of numbers (may be in short ints), where the first indicates the width (in bytes) of a char, and the second stores the number of continuous occurrences of the char. In the best cases, there are only a few pairs of such numbers, so the byte location of a char of certain index can be computed efficiently. In typical cases, there may be more, but shouldn't be many, computing the indices should still be a lot faster than O(n).

The down side is that, even if a string is never accessed by char indices, each string will still require at least 12 bytes (for 64 bits machine, 8 bytes for 32 bits) more space (two short int fields and one pointer field). Another downside is, in order to use short int fields, it may only be able to support string of a couple tens of thousand characters in the worst cases. But worst case scenarios are usually not a big concern.

Of course, another obvious approach is to pre-index all the chars before char access by index. This way a single access will be guaranteed to be O(1). But it clearly take a lot more space than the above approach, and is not feasible to store it along with the string. And if the pre-indexing is to be done each time, it would be just too expensive. Though when accessing all the chars in a single loop, this is the preferred approach, but the common scenario for this is to access all chars from the first to the last sequentially, then there is no need for any kind auxiliary array.

So my approach should be much more preferable for general cases. My only concern is the extra 12 bytes space for each string, actually only 4 bytes more with respect to the current string data structure, maybe not a big deal.

Night-walker · 2014-04-29T06:11:10Z

I wouldn't worry that much about accessing individual characters. It just doesn't happen as "I want Nth char", which I have been trying to show.

I specifically spent time on trying to find a case when one may actually need to jump over N characters forward or backward, in order to prove that byte strings are inconvenient and dangerous, which was what I believed in. I didn't find any realistic, typical case.

It is surprising, but almost all real-world tasks on byte strings cannot be compromised by multi-byte characters. You still should better not be careless about such strings, but that applies to dual (MBS/WCS) strings as well. Maybe even more so, as the only way to ensure that you have a wide string in Dao is to check it manually for any routine parameter and returned value -- which is considerably more cumbersome and error-prone than simply knowing that all strings are always MBS and don't making any assumptions about character size.

daokoder · 2014-04-29T06:36:35Z

I wouldn't worry that much about accessing individual characters. It just doesn't happen as "I want Nth char", which I have been trying to show.

Worry or not is not an issue, I am just trying to consider possible options.

BTW, the base overhead in my approach is actually 8 bytes, so the string structure would have the same size as now.

Night-walker · 2014-04-29T06:56:53Z

Well, if you intend to leave all basic string operations byte-wise, then there is nothing to argue. It is of course feasible to provide character cache, as far as it is used only when the user explicitly uses something like str.char(i), that is issues an inherently sequential-access character-wise operation.

daokoder · 2014-05-13T16:54:06Z

Are you sure? It should work.

On Linux, I guess. But not on Mac OSX. I did seem to remember it worked before, I don't remember if it was only on Linux.

The default (presumed) encoding of Dao strings is now de facto UTF-8. It's thus simpler, more efficient and safe to not guess the encoding, so the user can always know that strings are always treated in consistently and predictably. Working with any other encoding should just be a matter of explicit string::convert() after reading and before writing to a text stream.

You forget that terminals do not alway support UTF-8. So when printing out a string, conversion may be necessary, otherwise, you won't be able to display some strings properly with simple io.writeln().

Night-walker · 2014-05-13T17:20:49Z

On Linux, I guess. But not on Mac OSX. I did seem to remember it worked before, I don't remember if it was only on Linux.

Naturally, there is iswideogram() and friends on BSD systems including Mac OS X. On other systems just use iswalpha().

You forget that terminals do not alway support UTF-8. So when printing out a string, conversion may be necessary, otherwise, you won't be able to display some strings properly with simple io.writeln().

Printing to a terminal is an edge use case, albeit it's still the same as writing to a file. It's better to know exactly what encoding will be assumed rather then trying to guess how it will behave depending on what you have in a string. All these possible implicit changes are quite error-prone and hard to diagnose.

Night-walker · 2014-05-14T07:27:16Z

For DMBString_AppendWCS(), use standard MB_CUR_MAX macro instead of MAX_CHAR_PER_WCHAR -- it will be more space-efficient. Also I'd suggest to differentiate string API for working with local encoding -- it's currently not apparent from the function names what's what.

daokoder · 2014-05-14T07:57:36Z

Also I'd suggest to differentiate string API for working with local encoding -- it's currently not apparent from the function names what's what.

Which APIs are you talking about?

Night-walker · 2014-05-14T08:02:19Z

Those functions aren't exported, so it doesn't matter actually. I meant DMBString_AppendWCS() and DWCString_AppendMBS().

daokoder · 2014-05-14T09:32:30Z

Right, they are just internal functions, and are not meant to be exported.

Night-walker · 2014-05-15T06:15:32Z

OK, let me sum it up regarding conversion into local encoding when writing to stream.

Such conversion is not performed upon reading from a stream, why perform it when writing, it's counter-intuitive and brings more problems then it resolves.
In most cases, the locale is not set manually in the code. Then the default locale is "C", which will lead to sad results with printing non-ASCII characters.
For writing a file in UTF-8 (which must be the encoding of choice for files), one will have to set the locale accordingly, but even then stream writing will "mill the wind", performing needless processing.
Setting the locale, eh? Locale names aren't cross-platform, which adds even more "fun" when you just need to put stuff into a file. You can't even rely that there is something beside "", "C" and "POSIX" available.
Printing to the terminal in Unicode? What for? Internationalization? Most console utilities are used with English interface anyway. Well, some like GCC try to flirt with the user, printing in his language -- proceed to the point 6 below.
Terminal emulator may not support UTF-8? Well, at least you'll see the ASCII part. If that terminal emulator uses encoding which has no relation to the set environmental locale, you'll just see garbage. This is the case for Windows, and it is resolved by set LANG=C, not the other way around. Thanks, GCC, but no. Failed attempt.

Isn't putting string::convert() before writing to a stream so much simpler then dealing with the above problems?

daokoder · 2014-05-15T06:36:10Z

Don't worry for this, I already changed it. Now conversion to local encoding (assuming UTF-8 without guessing) is done only in the following cases:

Writing to stdout and stderr (correction, this is not done yet, but it is planned);
Writing with format %s (%S for no conversion);

In all other situations, there is no implicit conversion. If one wants to avoid conversion even in these cases, there is the option of using formatted writing with %S. This may not be perfect, but it should provide consistent behavior.

Regarding locale, it is better not to rely too much on it, it is not consistently supported across different platforms.

Night-walker · 2014-05-15T06:51:37Z

Writing with format %s (%S for no conversion);

I would rather flip the meanings, it's so easy to write %s mistakenly by habit.

Writing to stdout and stderr (correction, this is not done yet, but it is planned);

I think providing %s and %S is enough; treating different streams in different way breaks the abstraction.

Regarding locale, it is better not to rely too much on it, it is not consistently supported across different platforms.

Exactly. That's why it's better to avoid its implicit use for converting strings to local encoding.

Night-walker · 2014-05-16T05:44:05Z

int dao_character( uint_t ch )
 {
 #ifdef BSD
    return (ch == '_' || iswalnum(ch) || iswideogram(ch) || iswphonogram(ch) );
 #elif defined(LINUX)
    return (ch == '_' || iswalnum(ch));
 #else
    uint_t ch2 = ch;
    if( sizeof(wint_t) == 2 && ch > 0xFFFF ) ch = 0; /* for isw*(); */
    return (ch == '_' || iswalnum(ch) || dao_cjk(ch2));
 #endif
 }

sizeof(wint_t) cannot be 2, you should check wchar_t instead.

const uint_t dao_cjk_charts[][2] = 
{
    {0x3400, 0x4DBF},   /* Extension A; */
    {0x4E00, 0x9FFF},   /* Basic Block; */
    {0xF900, 0xFAFF},   /* Extension F; */
    {0x20000, 0x2A6DF}, /* Extension B; */
    {0x29100, 0x2A6DF}, /* Extension B; */
    {0x2A700, 0x2B73F}, /* Extension C; */
    {0x2B740, 0x2B81F}, /* Extension D; */
};

Extension B has two overlapping ranges? Smells suspicious. Also it's strange that you included CJK Compatibility Ideographs (0xFA900-0xFAFF) as Extension F (which is not implemented yet), omitting CJK Compatibility (0x3300-0x33FF), CJK Compatibility Forms (0xFE30-0xFE4F) and CJK Compatibility Ideographs Supplement (0x2F800-0x2FA1F).

daokoder · 2014-05-16T10:15:20Z

I would rather flip the meanings, it's so easy to write %s mistakenly by habit.

It makes sense to flip the meaning. Or maybe we can use %ls for writing as local string.

Exactly. That's why it's better to avoid its implicit use for converting strings to local encoding.

But it is certain that the native terminals always support the local encoding. So when writing to stdout and stderr, it is better to convert the string to local encoding.

sizeof(wint_t) cannot be 2

The size of wint_t is 2 on Windows XP.

Extension B has two overlapping ranges? Smells suspicious. Also it's strange that you included CJK Compatibility Ideographs (0xFA900-0xFAFF) as Extension F (which is not implemented yet), omitting CJK Compatibility (0x3300-0x33FF), CJK Compatibility Forms (0xFE30-0xFE4F) and CJK Compatibility Ideographs Supplement (0x2F800-0x2FA1F).

My mistake here.

Night-walker · 2014-05-16T11:00:39Z

But it is certain that the native terminals always support the local encoding.

But they may not use that encoding by default. And there may be two local encodings -- one for console, another for everything else, which is how it is on Windows in my case.

The size of wint_t is 2 on Windows XP.

OK, but that doesn't mean it can't be 4 while sizeof(wchar_t) is 2. Better to not make any assumptions.

daokoder · 2014-05-16T11:16:44Z

And there may be two local encodings -- one for console, another for everything else, which is how it is on Windows in my case.

Seriously?

another for everything else

What are those things?

OK, but that doesn't mean it can't be 4 while sizeof(wchar_t) is 2. Better to not make any assumptions.

The parameter type of isw*() is a wint_t, not a wchar_t. If the size of wint_t is 4 and wchar_t is 2, there is no reason to believe these functions will convert or truncate the input wint_t to wchar_t.

Night-walker · 2014-05-16T11:45:56Z

Seriously?

You think I'm trying to trick you? :) The "normal" local encoding for Cyrillic languages is windows-1251, while the cursed cmd uses DOS-866 for compatibility with legacy soft, "ancient mammoth shit" :).

The parameter type of isw*() is a wint_t, not a wchar_t. If the size of wint_t is 4 and wchar_t is 2, there is no reason to believe these functions will convert or truncate the input wint_t to wchar_t.

wint_t is required to be able to contain any wchar_t value and WEOF, the size of which is not specified. WEOF can possibly be a 32-bit value, while this does not require wchar_t to be of the same size.

daokoder · 2014-05-16T14:08:02Z

You think I'm trying to trick you? :)

Of course no:), it was just too surprising to me.

The "normal" local encoding for Cyrillic languages is windows-1251, while the cursed cmd uses DOS-866 for compatibility with legacy soft, "ancient mammoth shit" :).

What happens when you write windows-1251 string to cmd? Displayed as random symbols?

Anyway, now I agree with you that no implicit conversion should be done.

wint_t is required to be able to contain any wchar_t value and WEOF, the size of which is not specified. WEOF can possibly be a 32-bit value, while this does not require wchar_t to be of the same size.

I read about this before. But I simple made a rationale assumption to do it the way I did. But given the above example, it seems that rationale assumptions cannot be made for certain platforms. So I will make the change you suggested.

daokoder · 2014-05-16T14:58:36Z

I am also considering to add a method in stream to set the string printing mode to support automatic string conversion.

Night-walker · 2014-05-16T15:42:17Z

What happens when you write windows-1251 string to cmd? Displayed as random symbols?

Yup, garbage. It's always an issue to reckon with, for instance, when making queries via console to a database. It is possible to change the default encoding, but it's nontrivial and I suppose few ever care to do it. And it's the same with PowerShell. Linux terminals and Bash are a blessing comparing to what Windows provides.

I am also considering to add a method in stream to set the string printing mode to support automatic string conversion.

That would be better. As well as adding such mode for reading from stream.

dumblob · 2014-05-17T09:20:59Z

I agree with @Night-walker with the removal of implicit conversion and also with flipping of the meaning of %s and %S (resulting in %s being in ASCII and %S conversion to current locale) while not agreeing with the %ls (less characters is always better from my experience; also in Dao lower/upper case variants are already used for such binary decisions of meanings so it feels consistent).

The idea of adding a method in stream to set/unset printing mode suits me as well.

Night-walker · 2014-05-24T07:52:21Z

DString_ToLower() and DString_ToUpper() are not implemented correctly. The functions tolower() and toupper() handle only ASCII characters (damnable C library!). You'll have to decode the characters as in the regex engine and then use wchar_t functions on them, writing to another string -- since the length of the result is not known.

daokoder · 2014-07-05T09:11:13Z

You'll have to decode the characters as in the regex engine and then use wchar_t functions on them, writing to another string -- since the length of the result is not known.

That would be too much. Maybe it could be supported by modules.

Night-walker · 2014-07-05T09:52:03Z

That is a too basic task to take it out of the core, while leaving ASCII-aware-only variant may be error-prone. The implementation is trivial anyway, and reserving the same amount of space in the new string ensures that no overhead will appear for ASCII data.

daokoder · 2014-07-05T10:39:11Z

The implementation is trivial anyway, and reserving the same amount of space in the new string ensures that no overhead will appear for ASCII data.

Yes, the implementation is trivial, but I was just not sure if it worths the effort as I think case conversion is not very important to be included in the core. Anyway, it may be better to keep it and improve it in the same way as handling it in the regex.

dumblob · 2014-07-05T10:45:17Z

I'm not aware of many cases where only ASCII tolower toupper methods could be useful. In all other cases only character-wise tolower toupper methods make sense. So I'm also for the implementation using wchar_t.

Night-walker · 2014-07-05T10:56:53Z

A lot of text processing tasks (including trivial ones) require normalization of a string to certain character case. Here it is better to overdo then to underdo.

daokoder · 2014-08-09T07:41:38Z

Now DString_ToLower() and DString_ToUpper() are changed to support unicode.

Night-walker · 2014-08-09T09:14:48Z

Then I suppose we're done here.

daokoder · 2014-08-09T12:17:46Z

Then let's close it.

Night-walker mentioned this issue May 24, 2014

Adjusting string interface #187

Closed

daokoder closed this as completed Aug 9, 2014

dumblob referenced this issue in basilTeam/basil Jan 9, 2021

Pushing lots of uncommitted changes. Still very WIP.

2ade60d

Handling of encodings of Dao strings #174

Handling of encodings of Dao strings #174

Comments

daokoder commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

daokoder commented Apr 28, 2014

Night-walker commented Apr 28, 2014

Night-walker commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

daokoder commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

daokoder commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

Night-walker commented Apr 28, 2014

daokoder commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

dumblob commented Apr 28, 2014

Night-walker commented Apr 28, 2014

daokoder commented Apr 29, 2014

Night-walker commented Apr 29, 2014

daokoder commented Apr 29, 2014

Night-walker commented Apr 29, 2014

daokoder commented May 13, 2014

Night-walker commented May 13, 2014

Night-walker commented May 14, 2014

daokoder commented May 14, 2014

Night-walker commented May 14, 2014

daokoder commented May 14, 2014

Night-walker commented May 15, 2014

daokoder commented May 15, 2014

Night-walker commented May 15, 2014

Night-walker commented May 16, 2014

daokoder commented May 16, 2014

Night-walker commented May 16, 2014

daokoder commented May 16, 2014

Night-walker commented May 16, 2014

daokoder commented May 16, 2014

daokoder commented May 16, 2014

Night-walker commented May 16, 2014

dumblob commented May 17, 2014

Night-walker commented May 24, 2014

daokoder commented Jul 5, 2014

Night-walker commented Jul 5, 2014

daokoder commented Jul 5, 2014

dumblob commented Jul 5, 2014

Night-walker commented Jul 5, 2014

daokoder commented Aug 9, 2014

Night-walker commented Aug 9, 2014

daokoder commented Aug 9, 2014