Compact String implementation #7083

Open
mattwarren opened this Issue Sep 7, 2016 · 66 comments

Projects

None yet
@mattwarren
Contributor
mattwarren commented Sep 7, 2016 edited

Background

I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.

I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)

Note: this optimisation work because many types of strings in web-based applications can actually be encoded as ASCII or ISO-8859-1 (Latin-1), for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.

Implementation

One way of implementing this is to change the internal string layout to the following, with an extra m_type field:

public sealed class String : IComparable, ICloneable, IConvertible, IEnumerable
    , IComparable<String>, IEnumerable<char>, IEquatable<String>
{
    [NonSerialized]private int  m_stringLength;
    [NonSerialized]private byte m_type;
    [NonSerialized]private char m_firstChar;
    ....

Assuming that there are no padding implications for adding a field like this?

Another approach is to have an abstract String base class and then use either a UTF16 or Latin1 implementation via virtual dispatch, but this may have performance implications.

Then the (pseudo) code for the equals method become:

public boolean Equals(string other) 
{
    if (this.type != other.type)
       return false;
    if (type == LATIN1)
        return StringLatin1.Equals(this, other);
    else 
        return StringLatinUTF16.Equals(this, other);
}

Obviously I know this is not a small undertaking, it would require changes to the String and StringBuilder classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extra if (type == LATIN1) check are as fast as possible (length, indexOf, etc).

Advantages

  1. less memory usage overall (as-per @davidfowl "At the top of every ASP.NET profile… strings! ")
  2. strings become more cache-friendly, which may give better performance

Disadvantages

  1. Breaks COM or P/Invoke interop that relies on the current string format
  2. If very few strings in the application are ASCII or ISO-8859-1 (Latin-1) this will have an overhead for no gain.
  3. Probably something else I haven't thought about ;-)

Research

This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @nickcraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:

...

Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object

Actual Encoding that the "System.String" could be stored as (with corresponding data size)
    3,347,868,352 bytes are ASCII
        5,078,902 bytes are ISO-8859-1 (Latin-1)
      169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)

Compression Summary:
    1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
      169,000,908 bytes Uncompressed (as Unicode/UTF-16)
       30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression

In this case the String data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.

Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base

@benaadams
Contributor
benaadams commented Sep 7, 2016 edited

Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing.

Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two)

@benaadams
Contributor

Interestingly the current string format does already have an IsAscii flag on it

Though it considers ' and - to not be Ascii 😦

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

@akoeplinger
Member
akoeplinger commented Sep 7, 2016 edited

ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it)

I'm sure he can go into great details about the pros/cons 😄

@mattwarren
Contributor

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

Hmm, I didn't think about compatibility with code that uses fixed against strings, yeah that's a problem. Looks like the mono experiment solved it by disabling fixed on strings.

@benaadams
Contributor
benaadams commented Sep 7, 2016 edited

Looks like the mono experiment solved it by disabling fixed on strings.

Could always stackalloc length * 2 bytes for fixed start then copy and widen with _mm_unpacklo_epi8 & _mm_unpackhi_epi8 then compress back at fixed closed. Would be slow though...

Or allow both fixed byte* and fixed char*

@fanoI
fanoI commented Sep 8, 2016 edited

Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'ù'...

Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct.

[1] De facto .NET String are really UCS2 not UTF16.

@benaadams
Contributor
benaadams commented Sep 8, 2016 edited

Why use ASCII instead of Latin 1

ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two.

UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion.

Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc.

e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8.

@migueldeicaza
Member

The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code.

The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy.

@migueldeicaza
Member

I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-)

http://www.unicode.org/faq/utf_bom.html#utf16-11

@HansOlavS

@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not.

Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1...

@benaadams
Contributor
benaadams commented Sep 8, 2016 edited

@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.

  • String is the correct format for Windows p/invoke, platform internationalisation and UI display.
  • ASCII is the correct format for a lot of protocols with their limited command set and dislike of getting involved with network byte order and endianness (and a straight subset of String with char narrowing).
  • UTF8 is now the general standard for most text; both a direct superset of ASCII and endianness free, as well as having a straight forward vector route to UTF16.

Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage.

Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol € as its not included in Latin-1.

The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties.

It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation).

However, I'm not sure that provides a great advantage over using Span<byte> or btye[] and the Encoder namespace?

@HansOlavS

Disclaimer: this is way out of my league... ;)

I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses.

If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding).

Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption.

@HansOlavS

And regarding special cases like €, I don't care as that is an edge-case and I'm happy to have it stored in UTF16.

@migueldeicaza
Member

ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy.

Anything th

@migueldeicaza migueldeicaza reopened this Sep 8, 2016
@migueldeicaza
Member

Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii

@HansOlavS

I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood.

@fanoI
fanoI commented Sep 8, 2016

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127?

A part from Java, Python has doing a similar thing to the one I was thinking:
http://legacy.python.org/dev/peps/pep-0393/

characters that are not representable is UCS2 would use UCS4 (4 Byte).

If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long!

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

@zabulus
zabulus commented Sep 8, 2016 edited

Why don't use separate type?

@evincarofautumn

@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080–U+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080–U+009F, which contains rarely used control characters; for example, encoding € as U+0091 (“Private Use 1”).

Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps.

I don’t like that this optimisation privileges some languages over others, but frankly it’s because the character encodings are already biased.

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

@mattwarren
Contributor
mattwarren commented Sep 8, 2016 edited

@zabulus

Wh(y) don't use separate type?

Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve.

However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it.

@zabulus
zabulus commented Sep 9, 2016 edited

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

I see one more concern about such strings, except unsafe/fixed is about marshaling such strings with P/Invoke. Specially in case using the new strings in W(ide) versions of Win32 API.

@fanoI
fanoI commented Sep 9, 2016 edited

As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic.
If I'm not wrong the algorith that there work for the ASCII part of an UTF8 string would work to convert a Latin1 string to UFT16 too:
https://woboq.com/blog/utf-8-processing-using-simd.html

It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster...

@benaadams
Contributor
benaadams commented Sep 9, 2016 edited

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

Characters 128 - 159 differ between the Windows-1252 variant and UTF16

To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127

Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or)

@mattwarren
Contributor

Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:

  1. Has something like this been considered before, are there fundamental reasons why it can't be or won't be done in the CLR?
  2. If not, is this something that would be considered in the future or are things like System.Text.Utf8 the recommended solution?
@evincarofautumn

@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw). UTF-8 would be slightly more involved.

@mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

@mattwarren
Contributor
mattwarren commented Sep 9, 2016 edited

Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

@evincarofautumn that's interesting to know, thanks

@benaadams
Contributor

Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw).

Though not for the Windows-1252 variant due to the conflict in characters 128 - 159

@fanoI
fanoI commented Sep 10, 2016

OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for '€' and other characters in the block 128 - 159 strings containing them will be not compressible.

@vancem
Member
vancem commented Sep 12, 2016

To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now. What was clear now and has held true for quite sometime is that

Typical apps have 20% of their GC heap as strings.
Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.

Now the first issue you have is while this saves GC heap size, it is not guaranteed to have a SPEED improvement. Lets just assume that the new representation is equally good at everything, but half the size. For things that fit in cache with the bigger size, you probably will get very little speed improvement. We know from work comparing 32 vs 64 bit applications, that the GC heap expands by 1.6X when you got to 64 bit (pointers are bigger), and that results in a 10-15% performance loss. Thus we can compute that changing the representation of strings is likely to save 10% of the GC heap and thus is likely to save about 1.5 to 2.5% throughput. This is significant, but not dramatic. Keep in mind this assumes that all operations are equal.

Now it turns out that all operations are NOT equal, and thankfully the advantage is toward the NEW layout (if done well) most of the time. We know that the most common operations on strings include

Comparision
Copy/Substring
IndexOf
Hashing

The good news is that for all of these operations, are SCANS and thus if the encodings are SHORTER they are FASTER (fewer iterations). This bodes well.

What are the disadvantages: As you might suspect compatibility is the big issue. There are two basic issues

  1. Interop - on the boundary from managed to native code (where native code needs UTF16), a copy will need to be made. In general this is not a huge deal as interop has its own overheads and we already have logic for doing this copy if needed. However we do expect that there will be some scenarios where interop with string marshaling will be important. We are effectively trading off this scenario (which I believe is OK for the general case, but there WILL be some set of less-than-happy customers (they will lose out))
  2. The C# Fixed operator on strings. This is a thornier issue. There are two aspects. First is correctness, and the second is performance. You an try to 'solve' the correctness issue by making a copy of the string when you do a 'fixed' Obviously this makes perf bad (and typically the whole point of fixed is perf) but at least is correct. It turns out that this is not as easy as you would like because we don't actually have an 'operation' which takes a string and returns its first character. Instead, we have a declared offset that IL compilers used to do the arithmetic themselves. Because IL code generators may generat this code in many ways, this makes it potentially hard (in the limit impossible) to figure out when to make the copy.

So the real thorny issue is this issue of 'fixed'. I believe the semantics of it are solvable (first going forward you introduce a real operator, and for compat you simply recognize certain patterns that are emitted by our existing compilers and simply fail on the others.

However this only solves the correctness issue. The perf issue is still there. Most likely you solve this by introducing a string iterator, and ask people to move to it, and/or use JIT optimization to allow some patterns to simply work well (e.g. foreach over the string using integer indexes).

Because this is not perfect, we are likely to allow some compat flag where you can ask to keep things the old way (UTF16) for those who have performance issues because of it.

This probably requires a bit of an audit of where 'fixed' is being used and determining just how bad the proposed solution above is, so we can determine what the guidance will be for replacing 'fixed' in the important cases at least.

But you can see the broad outline above: Changing strings will have a 10% improvement for most GC heap sizes and is likely to yield a 1.5-2.5% throughput improvement, and maybe more (because the code was heavy in string operations that benefit from the new format), but there will definately be peopel who are impacted (those using fixed, and those with hot interop paths passing strings) The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).

Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trvial but not huge.

Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations (we will make an string iterator class?, exactly what compile magic will we have? ...). Also having microfbenchmarks for important cases of 'fixed' (before and after optimization), as well as a microbenchmark for interop would be useful.

Finally, I note that all of this just lets us CHOOSE what representation we want. We actually have ALOT of flexibility on exactly what the representation is. My recommendation is that allow that to be flexible (it does not have to be UTF8 or ascii with a type code. I am actually partial to at runtime fixed Huffman compression (which allows it to work well on Chinese). However I lets set that aside for right now. We can start with one of the suggestions above and change to something else later with basically no impact.

'

@whoisj
whoisj commented Sep 12, 2016 edited

I would love, love, love it if string were synonymous with IString<T> where T : byte or char or uint or similar and not String, that would allow custom implementations of IString such as StringUtf8, StringAscii, StringUtf16, etc. Of course use of p/invoke would require the signature o use the correct implementation of IString.

I for one would love to see utf 8 everywhere.

@whoisj
whoisj commented Sep 12, 2016

The C# Fixed operator on strings

Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters. In fact, I'd say changing to the suggested model would help developers avoid a whole class of bugs due to very poor assumptions.

As for unsafe operations, the underlying array should remain accessible regardless of the impl, otherwise we will be throwing an immense amount of performance optimization out with the bath water.

@benaadams
Contributor

Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations

Using wider types long* on x64 and int* on x86 and Vector<ushort> over strings; which would also perform faster if the string was a more compact type.

Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters.

Normally a validate to ascii is a step with a fallback to using the Encoding namespace that accepts char* when not ascii. Equally there is a pressure to keep ascii as byte[] for as long as possible for the performance, but then there is no interop with any function that accepts strings so the type must be widened at some point and converted to string.

An example of widening with validation is the unrolled AsciiUtilities.TryGetAsciiString is Kestrel internals.

@mattwarren
Contributor

@vancem thanks for taking the time to answer my query so comprehensively! There's a lot of trade-offs I hadn't even considered, particularly around the impact on fixed code.

It's nice to know that someone has been thinking about this and that it may, someday, make it's way into the CLR

but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).

Yeah I can image, for fun I modified a version of the CoreCLR source, so that String had the following layout:

public sealed class String 
{
    private int  m_stringLength;
    private byte m_type;
    private char m_firstChar;
    ....

I struggled to get a "hello world" app to run, I got stuck fixing StringBuilder. The main problem is that when String is broken, nothing works, you don't even get meaningful error messages because they rely on String and/or StringBuilder working!

@evincarofautumn

A systemic change like this has to be done very incrementally to retain any ability to debug the inevitable failures. It’s a difficult and disruptive feature to implement. But, if I might be so bold, it’s our responsibility as systems programmers and programming language implementors to do exactly this kind of hard work for the sake of others. I can’t name an industry in which an improvement of 10% on a highly valued metric would be considered marginal.

@jkotas
Member
jkotas commented Sep 14, 2016

A systemic change like this has to be done very incrementally

Absolutely agree. A very good first step would be to design and implement alternative expansion of fixed statement for strings that is compatible with this scheme.

@vancem
Member
vancem commented Sep 14, 2016

Yes, the first steps are to fix the string abstraction so that it is 'complete' (that is there is only a modest amount of code that depends on its layout. That includes dealing with a scheme for dealing with 'fixed' on strings and fixing up runtime/class library code to use the new abstractions. Note that this work does NOT break anything and can be done incrementally. After that abstraction is in place, we are in a much better place to actually change the internal format.

@benaadams
Contributor

A slight wrinkle with fixed expansion is: if a 8 bit string; was changed to a 16 bit string during the time when it was fixed (by writing to the high bytes) it would need reallocating as a 16 bit string when leaving the fixed statement.

@vancem
Member
vancem commented Sep 14, 2016

My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that. The original data is never modified. You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).

@benaadams
Contributor
benaadams commented Sep 14, 2016 edited

You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).

You can currently; its a pointer to the data with no copy - else fixed would be very slow (doesn't mean you should though).

Obviously trouble lies that way if the string has ever been used as its flags such as isAscii which is lazily evalated would go out of sync; or has been passed to anything or is an embedded/interned string when it will all go to hell...

@vancem
Member
vancem commented Sep 15, 2016

While as you say, preventing the mutation of strings using the 'fixed' is not enforced, it is illegal (large parts of the system assume that strings are read-only. including the security system. If we found anyone doing this (within or outside Microsoft), we would tell them to fix their code.

There is also a limit to what we will do for compatibility's sake. Fixing legitimate uses of relatively rare but useful feature is one thing, making broken code work is another.

We should not spend effort trying to make the mutation of string case work. The problem is bad enough without it, and we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).

@benaadams
Contributor

we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).

Seems fair 😄 And if you are doing it for speed you will likely change you code to work with the new way. Probably would want a string with a byte* ctor for stackalloc.

@mattwarren mattwarren added a commit to mattwarren/mattwarren.github.io that referenced this issue Sep 19, 2016
@mattwarren mattwarren Initial version of "Compact String in the CLR" (i.e. dotnet/coreclr#7083
)
c67ad3c
@paulomorgado

Will there be explicit conversions between types in the new string? string.AsAscii()? string.AsUtf8()? string.AsUtf16()?

Is there a way to know what the type the string is? string.IsAscii()? string.IsUtf8()? string.IsUtf16()?

I don't think char can be one or two bytes depending where it came from. Will there be a string.ToByteArray()?

@svick
Contributor
svick commented Sep 21, 2016

@paulomorgado I think the idea here is to change the implementation details of string, not the public interface. I don't see why would you need any of those methods (except maybe for debugging, to ensure some specific string is compact).

And char is always two bytes, so for example, the indexer would be implemented something like this pseudocode:

public char this[int index]
{
    get
    {
        if (type == LATIN1)
            return (char)bytePointer[index];
        else 
            return charPointer[index];
    }
}

Converting string to a byte array can already be done using Encoding.ASCII.GetBytes(). I imagine that method would be optimized to just create a copy of the underlying byte array, when applicable.

@paulomorgado

@svick, I understood that and I think that's a good idea because it, as usual, will benefit already built applications.

However, if the change is made, future code might rely on knowing the difference.

Web APIs will almost exclusively use UTF-8 (and ASCII for some parts). On the other hand, Windows APIs will use UTF-16. What about mixed applications like Windows applications that use web APIs?

Let's take a feed reader Windows application as an example. All communications with the feed providers will use UTF-8, but all presentation will use UTF-16. Instead of relying on the runtime/framework to do the conversion, the Windows application developer might decide to handle all strings as UTF-16 (because that's what Windows uses) while the developer of the web API might choose to use UTF-8 strings (because that's what the web uses).

Is this such an odd or unusual scenario?

@vancem
Member
vancem commented Sep 21, 2016

At least initially we would completely hide internal details of the representation from the client of string (thus it will not know if how the string is encoded). This proposal at least initially would be no NOT have multiple representations but to uniformly use UTF8 internally. Now maybe we would take that next step (of having multiple possible representations) but first thing first. First we make it so that we don't HAVE to be UTF16.

@ufcpp ufcpp referenced this issue in ufcpp/UfcppSample Sep 22, 2016
Open

文字コードがらみ #91

@fanoI
fanoI commented Sep 23, 2016 edited

No the thread proposal was not to use UTF-8 that would 3 bytes for a lot of characters that in UFT-16 would be represented with 2 but to use ASCII or Latin1 when possible using 1 byte and to use UTF-16 (2 byte) for anything else.

If not what we achieve using UTF-8?

  1. A lot of trivial string will occupy more memory that what they do now
  2. Operator [] is never a simply operation
  3. Conversion to Char (that will remain UTF-16) will be never trivial
@jamesqo
Contributor
jamesqo commented Sep 24, 2016

@vancem You mention:

My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that.

I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.

@jamesqo
Contributor
jamesqo commented Sep 24, 2016

I know I'm a little bit late to the party, but here are my thoughts as someone who has written code for String:

  1. This would slow down existing codepaths that convert from other UTF-16 types to String. For example, the string(char[]) constructor would get a lot slower.
  2. It would require reworking most or all of the Encoding implementations, and double the amount of code for String. Subtle bugs could show up in the UTF-16 implementation of methods, and they would be harder to spot.
  3. Even if there are memory savings from introducing this change, it may still slow down CPU-bound code that processes strings because of an additional switch on m_type. This may also prevent String methods from being inlined, or if they are it will introduce more jitted code bloat.

I generally agree, though, that the memory savings we can gain from UTF-8 strings are desirable. But perhaps we should take a look in the other direction:

  • Make UTF-8 strings easier to use. We can add built-in compiler support so people can just write
var utf8 = u8"Hello, world!";

to get the memory benefits.

  • Make Utf8String a built-in type in mscorlib, so people don't have to reference an external library.

Apps that benefit significantly from using UTF-8 strings in string processing code can do so, the rest of the time they can stick to UTF-16.

@ig-sinicyn

@jamesqo

+1. And one more moment. Corefxlab has a low-memory solution for dynamic string construction: System.Text.Formatting.
Note that this one covers not only memory consumption for the final strings but allocations and overall performance on operations over strings too (which is much ore important for generic web site/service app).

I don't think there's a reason to keep both.

@svick
Contributor
svick commented Sep 24, 2016

@jamesqo I'm not sure using Utf8String would help much:

Regarding the number of conversions:

  • If you have char[], then you have to convert whether you're using compact string or Utf8String.
  • If you using a library that takes a string, then you have to convert if you're using Utf8String, but not if you're using compact string.

As for the amount of code, a lot of that additional code would still be required even with Utf8String, just maybe placed somewhere else.

@jamesqo
Contributor
jamesqo commented Sep 24, 2016

If you have char[], then you have to convert whether you're using compact string or Utf8String.

Yes; I was saying if you have a char[] today you want to convert to a string, or vice versa, you don't have to perform any conversions between encodings. The proposed change would make existing code that calls string(char[]), ToCharArray, CopyTo, etc. much slower since we can no longer just memcpy the characters.

@bbowyersmyth
Contributor

If new string('\0', length) continued to return a UTF-16 structured string then that would handle most of the fixed modifications in the system, and probably elsewhere. That doesn't help with ptr increments needing to be byte instead of ushort on compact strings though.

Compact strings are not a clear win due to the expense of determining if you can create one or not. If dollars are being spent on string I would vote to put them all on UTF-8 rather than the pursuit of better ASCII. They solve the memory issue as well as serialization.

@benaadams
Contributor

The proposed change would make existing code that calls string(char[]) ...

Just create a UTF-16 string for that and use current paths

@whoisj
whoisj commented Sep 26, 2016

I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.

Yes please. If C# started performing memcpy every time I needed a pointer on a string, I'd be forced back into "use native for strings" model; and while I do love my C, I do found strings something C handles rather poorly (compared to a memory managed language like C#).

As somebody who is forced into reading and writing UTF8 constantly for compat with Linux binaries, I use UTF8 A LOT: I really love 😍 the idea that C# could treat as arbitrarily encoded strings as first class citizens. I'm also a huge believer in UTF8 everywhere, if for no other reason than a frighteningly large proportion of developers believe that indexing a char[] is akin to indexing a UTF16 encoded string (hint they're not).

Lastly, there are several important scenarios where having a fixed (char* ptr = mystring) is necessary, for both read and write because performance is critical. Let's just consider directory diving for a second:

Allocating a rather largish string up front (say string buffer = new string('\0', 4096); yes, yes the Windows shell is limited to 255 useful chars), passing the pointer up and own the stack, modifying the value as necessary, and then just using it as a buffer to pass to p/invoke is far preferable to allocating thousands of temporary string objects and just absolutely abusing the heap, garbage collector, and my send of right vs. wrong.

Is there someplace I can official register my opinion that "fixed should never copy"?

@whoisj
whoisj commented Sep 26, 2016 edited

@ig-sinicyn

I don't think there's a reason to keep both.

Well, there's the whole Windows compatibility thing. NetFx, on Windows, calls into the Win32 API a lot and having to convert to and from wchar_t values at every call site would get prohibitively expensive. Ideally, the developer would know which strings are expected to interact with externals which require wchar_t and have access to the necessary formats internally, thus minimizing the number of encoding operations performed.

@ig-sinicyn

@whoisj sorry, my previous comment was unclear definitely.

My point is, if there are good old utf16 strings and there is a specialized library for high-throughput utf8 string construction, why do we need ascii strings at all?

@jamesqo
Contributor
jamesqo commented Sep 26, 2016

@benaadams CopyTo, ToCharArray would still suffer for ASCII data. Also I don't know how many strings are created from char* or char[], but I'm guessing that just using UTF-16 for all of those would significantly diminish the memory savings from this change.

@whoisj
whoisj commented Sep 27, 2016

why do we need ascii strings at all?

Because ascii is not encoded beyond byte value equals character. While UTF8 can, and does, correctly encoded compatible ancii characters it also encodes beyond and there are occasions when you need to know that the ascii value is simply that: the ascii value.

Also, many consoles still rely on ANSI VT100 strings for colorized output, and having an Encoding.ASCII value which performs no value checking or assertions would allow for efficient VT100 usage.

@whoisj
whoisj commented Sep 27, 2016

@jamesqo I complete agree. However, string.ToCharArray isn't very safe as there are many code points outside of the 16-bit range. You should never assume that char is a character. Honestly, char should likely be a 24-bit type (Unicode is a 20-bit standard after all).

@evincarofautumn

To clarify some points here:

  • Strings are immutable; if you’re writing into a string, you’re invoking undefined behavior.
  • The Unicode range needs (almost) 21 bits (0–10FFFF), but even a fixed-width encoding like UTF-32 (where 1 code point = 1 code unit) doesn’t make indexing magically correct, because code points aren’t characters. “UTF-24” would be deathly slow due to misalignment.
  • VT100 codes are representable in regular ASCII. (I’m not sure if you meant to imply they’re not.)
  • This is not a very expensive optimisation; you pay a small CPU cost to compress and decompress strings as needed, and you get a fairly large memory savings. And presumably it would be opt-in.
  • My implementation in Mono doesn’t cause fixed to perform a copy—but I had to go and manually patch a bunch of code to either be encoding-aware or use a new encoding-agnostic API. Copying is the only safe option for managing the transition. This is charity, basically; we could just say “unsafe code is unsafe, end of story” and break people’s code freely, but we try not to.
@whoisj
whoisj commented Sep 28, 2016

Strings are immutable; if you’re writing into a string, you’re invoking undefined behavior.

If you're writing to string you "do not own" then I agree, otherwise it should be considered unsafe necessary evil. I regularly have to byte* utf8str = stackalloc byte[strlen]; or similar currently.

“UTF-24” would be deathly slow due to misalignment.

Yes, yes it would. Hence the existence of UTF32, which is a significant memory waster. 😉

VT100 codes are representable in regular ASCII. (I’m not sure if you meant to imply they’re not.)

I did not.

This is not a very expensive optimisation; you pay a small CPU cost to compress and decompress strings as needed, and you get a fairly large memory savings. And presumably it would be opt-in.

I do not disagree, as I've stated I'm all for UTF8 everywhere. However, I'm still of the belief that 16-bits in insufficient for a char and having constant time index operators on string is basically recommending writing errors for many developers.

Copying is the only safe option for managing the transition. This is charity, basically; we could just say “unsafe code is unsafe, end of story” and break people’s code freely, but we try not to.

Sure. Under the assumption that C# is for part-time developers or developers who care nothing about performance; but then why bother even considering UTF8 strings? Personally, I believe C# needs to get more focused on performance, or at least offer better avenues for developers to write performance critical code in C#.

Making an operator like fixed suddenly perform a memcpy of an arbitrarily large allocation seems like a trap and not a feature at all. We should always consider the edge cases here: the 2 GiB string and single char string when deciding language features.

@benaadams
Contributor
benaadams commented Sep 28, 2016 edited

Currently strings are \0 terminated if that was same for byte[] strings then you could do fixed byte* or fixed char* for both as the terminator would round its size up to the nearest char if it was odd number and be an ignored overflow (as now) if it was an even number; so both would be safe (though you would have to know type when indexing on the pointer; to know what Length means)

@richardtallent
richardtallent commented Oct 13, 2016 edited

I'm not such a fan of mucking around in the internals of String, and would be more inclined to support a separate type where compact strings are desirable, with easy conversion back and forth and a compatible interface (e.g., create a common IString interface).

For the internal implementation of this new type, I would suggest breaking up the string value into runs of 8-bit and 16-bit substrings and storing them in an array of structs, where each struct contains a flag (to determine the number of bytes per character) and a byte array. E.g.:

struct CompactSubstring {
   bool IsTwoByte;
   byte[] Value;
}

public class CompactString : IString {
  CompactSubstring[] substrings;
}

(Twists on the implementation could involve using a linked list of substrings, or using a String for the 2-byte substrings--lots of pros and cons, I think the above approach would be fastest to copy in memory.)

This would be nearly as compact as UTF-8 for 7-bit ASCII, and potentially more compact than UTF-8 for strings with characters > 127. It would avoid issues that UTF-8 presents with common methods like Substring, while not being limited strictly to ASCII values.

Operations could be optimized for the 1-byte substrings, improving performance over String for the same values. Some operations could even be faster. For example, if you call IndexOf() for high-order character, it could skip past all 1-byte substrings during the search.

The implementation could establish a minimum number of characters before it creates a 1-byte substring, which would avoid potential higher overhead when a string flips back and forth between low and high characters.

@whoisj
whoisj commented Oct 13, 2016

breaking up the string value into runs of 8-bit and 16-bit substrings

And breaking with every standard convention on text encoding established, pretty much everywhere? How about interop with native stacks which are expecting null terminated runs of encoded bits?

Personally, I'd love to see C# support 4 kinds of strings: ascii, utf-8, utf-16, and utf-32 strings for maximum compatibility with pretty much every native stack that exists. Everything else is just having fun with structures to be different.

@richardtallent

@whoisj, as I saw it, the original question was related to how strings are stored in memory (especially related to caching), not so much about how they are passed around.

Whether we use a traditional encoding (as you're suggesting) or a data structure (as I'm suggesting), translation to and from a normal UCS-2 representation would be needed for most native calls, would they not? (I don't deal with a lot of interop, so I plead ignorance on how string values are usually passed to native libraries on various platforms.)

I think the structured approach could result in faster conversions than converting traditional encodings, because the converter would be able to work with regions of characters (i.e., each of a known width) rather than having to convert one character at a time (as would be required with, say, UTF-8).

What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.

@svick
Contributor
svick commented Oct 14, 2016

@richardtallent

What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.

See #7083 (comment) for why changing string instead of creating a new type might be preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment