New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Slicing #120

Closed
stephentoub opened this Issue Jan 28, 2015 · 98 comments

Comments

@stephentoub
Member

stephentoub commented Jan 28, 2015

(Note: this proposal was briefly discussed in #98, the C# design notes for Jan 21, 2015. It has not been updated based on the discussion that's already occurred on that thread.)

Background

Arrays are extremely prevalent in C# code, as they are in most programming languages, and it’s very common to hand arrays around from one method to another.

Problem

However, it’s also very common to only want to share a portion of an array. This is typically achieved either by copying that portion out into its own array, or by passing around the array along with range indicators for which portion of the array is intended to be used. The former can lead to inefficiencies due to unnecessary copies of non-trivial amounts of data, and the latter can lead both to more complicated code as well as to lack of trust that the intended subset is the only subset that’s actually going to being used.

Solution: Slice<T>

To address this common need, .NET and C# should support "slices." A slice, represented by the Slice<T> value type, is a subset of an array or other contiguous region of memory, including both unmanaged memory and other slices. The act of creating such a slice is referred to as "slicing," and beyond the support on the Slice<T>, the C# language would include language syntax for declaring slices, slicing off pieces of arrays or other slices, and reading from and writing to them.

An array is represented using array brackets:

int[] array = …;

Similarly, a slice would be represented using square brackets that contain a colon between them:

int[:] slice = …; // same as "Slice<int> slice = ..."

The presence of the colon maps to the syntax for creating slices, which would use an inclusive 'from' index before the colon and an exclusive 'to' index after the colon to indicate the range that should be sliced (omission of either index would simply imply the start of the array or the end of the array, respectively, and omission of both would mean the entire array):

int[] primes = new int[] { 2, 3, 5, 7, 9, 11, 13 };
int item = primes[1];   // Regular array access, producing the value 3
int[:] a = primes[0:3]; // A slice with elements {2, 3, 5} 
int[:] b = primes[1:2]; // A slice with elements {3} 
int[:] c = primes[:5];  // A slice with elements {2, 3, 5, 7, 9} 
int[:] d = primes[2:];  // A slice with elements {5, 7, 9, 11, 13} 
int[:] e = primes[:];   // A slice with elements {2, 3, 5, 7, 9, 11, 13} 
int[:] f = a[1:2];      // A slice with elements {3}

Arrays could also be implicitly converted to slices (via an implicit conversion operator on the slice type), with the resulting slice representing the entire array, as if both 'from' and 'to' indices had been omitted from the slicing operation:

int[:] g = primes[:];   // A slice with elements {2, 3, 5, 7, 9, 11, 13} 
int[:] h = primes;      // A slice with elements {2, 3, 5, 7, 9, 11, 13}
int[:] i = h[:];        // A slice with elements {2, 3, 5, 7, 9, 11, 13}

A slice could also be used in a similar manner to arrays, reading from and writing to them via indexing:

int[:] somePrimes = primes[1:3];  // A slice with elements { 3, 5 }

Debug.Assert(primes is Array);// true
Debug.Assert(somePrimes is Slice<int>);   // true

Debug.Assert(somePrimes.Length == 2);     // true
Debug.Assert(somePrimes[0] == primes[1]); // true
Debug.Assert(somePrimes[1] == primes[2]); // true
somePrimes[0] = 17;
Debug.Assert(primes[1] == 17);            // true

As demonstrated in this code example, slicing wouldn’t make a copy of the original data; rather, it would simply create an alias for a particular region of the larger range. This allows for efficient referencing and handing around of a sub-portion of an array without necessitating inefficient copying of data. However, if a copy is required, the ToArray method of Slice<T> could be used to forcibly introduce such a copy, which could then be stored as either an array or as a slice (since arrays implicitly convert to slices):

int[:] aliased = primes[1:3];        // Alias of a portion of the original array
int[:] copied  = primes[1:3].Copy(); // Copy  of a portion of the original array

This gives developers the flexibility as to whether they want the recipient of the slice to be working with the original array or not, minimizing unnecessary copies and ensuring that only the appropriate areas of the larger region are used (by design, there would be no way through the public surface area of Slice<T> nor through the C# language syntax to get back from a slice to the larger entity from which it was sliced).

As creating slices would be very efficient, methods that would otherwise be defined to take an array, an offset, and a count can then be defined to just take a slice.

Solution: ReadOnlySlice<T>

In addition to Slice<T>, the .NET Framework could also includes a ReadOnlySlice<T> type, which would be almost identical to Slice<T> except that it would not provide any way for writing to the slice. A Slice<T> would be implicitly convertible to a ReadOnlySlice<T>, but not the other way around.

As with slicing an array, creation of a ReadOnlySlice<T> wouldn’t copy data, but rather would create a read-only alias to the original data; this means that while you couldn’t change the contents of a ReadOnlySlice<T> through it, if you had a writable reference to the underlying data, you could still manipulate it:

int[]  primes= new int[] { 2, 3, 5, 7, 9, 11, 13 };
int[:] a = primes[1:3];     // A slice with elements {3, 5}
ReadOnlySlice<int> b = a;   // A read-only slice with elements {3, 5}
Debug.Assert(a[0] == 3);    // true
Debug.Assert(b[0] == 3);    // true
b[0] = 42;                  // Error: no set accessor available
a[0] = 42;                  // Ok
Debug.Assert(b[0] == 42);   // true

While C# would not have special syntax to represent a ReadOnlySlice<T>, it could still have knowledge of the type. In particular, there is a very commonly-used type in C# that behaves like an array but that’s immutable: string. It’s very common for developers to want to slice off substrings from strings, and historically this has been a relatively expensive operation, as it involves allocating a new string object and copying the string data to it. With ReadOnlySlice<T>, the compiler could provide built-in support for slicing off substrings represented as ReadOnlySlice<char>. This could be done using the same slicing syntax as exists for arrays.

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[0:5];

This would allow for substrings to be taken and handed around in a very efficient manner. In addition to new methods on String like Slice (a call to which is what the slicing syntax on strings would compile down to), String would also support an explicit conversion from a ReadOnlySlice<char> back to a string. This would enable developers to work with substrings efficiently, and then only create a copy as a string when actually needed.

Further, just as the C# compiler today has support for concatenating strings and switching on strings, it could also have support for concatenating ReadOnlySlice<char> and switching on ReadOnlySlice<char>:

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[:5];
ReadOnlySlice<char> world = helloWorld[7:];
switch(hello) { // no allocation necessary to switch on a ReadOnlySlice<T>
    case "hello": Hello(); break;
    case "world": World(); break;
}
Debug.Assert(hello + world == "helloworld"); // only a single allocation needed for the concatenation
@mikedn

This comment has been minimized.

Show comment
Hide comment
@mikedn

mikedn Jan 28, 2015

How do you make ReadOnlySlice work with both arrays and strings? Access the array/string via IList<T>?

mikedn commented Jan 28, 2015

How do you make ReadOnlySlice work with both arrays and strings? Access the array/string via IList<T>?

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jan 28, 2015

Member

@mikedn, in this proposal, slices would support operating over any region of memory, whether it was from an array or a native pointer or the char* to data in a string. Its implementation would require interacting with internals in the runtime, rather than operating over a publicly-exposed abstraction like IList<T>... you could of course do the latter, but at a non-trivial performance cost for certain scenarios.

Member

stephentoub commented Jan 28, 2015

@mikedn, in this proposal, slices would support operating over any region of memory, whether it was from an array or a native pointer or the char* to data in a string. Its implementation would require interacting with internals in the runtime, rather than operating over a publicly-exposed abstraction like IList<T>... you could of course do the latter, but at a non-trivial performance cost for certain scenarios.

@theoy theoy added the Language-C# label Jan 28, 2015

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 28, 2015

Wow! Roslyn starts yielding it fruits!
It is a very welcomed feature.
I see its usage in API for batched processing. And many other places of course.

omariom commented Jan 28, 2015

Wow! Roslyn starts yielding it fruits!
It is a very welcomed feature.
I see its usage in API for batched processing. And many other places of course.

@Porges

This comment has been minimized.

Show comment
Hide comment
@Porges

Porges Jan 28, 2015

I'd much prefer that existing BCL classes that 'only take a T[]' were extended to support IList<T> (or IReadOnlyList<T> as the case may be), then we don't need additional CLR magic. Under this model {ReadOnly}Slice<T> are just wrappers around I{ReadOnly}List<T> with constrained offset/length (much like ArraySegment<T>). Copy() etc can be supported too.

There's a sort-of tangential issue around being able to treat unmanaged memory as T[] which the proposal mentions. I'd like to be able to do this for (e.g.) passing byte* to Streams (without first copying it into a byte[]), but this could probably be handled as (another!) Stream method.

Porges commented Jan 28, 2015

I'd much prefer that existing BCL classes that 'only take a T[]' were extended to support IList<T> (or IReadOnlyList<T> as the case may be), then we don't need additional CLR magic. Under this model {ReadOnly}Slice<T> are just wrappers around I{ReadOnly}List<T> with constrained offset/length (much like ArraySegment<T>). Copy() etc can be supported too.

There's a sort-of tangential issue around being able to treat unmanaged memory as T[] which the proposal mentions. I'd like to be able to do this for (e.g.) passing byte* to Streams (without first copying it into a byte[]), but this could probably be handled as (another!) Stream method.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 28, 2015

@Porges, I think it wouldn't provide efficiency of raw arrays.

omariom commented Jan 28, 2015

@Porges, I think it wouldn't provide efficiency of raw arrays.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Jan 29, 2015

This is one of those things that I'd really prefer could be handled by the runtime itself (with C# support in conjunction, of course). By that I mean have the ability directly in the runtime to define an array that is a range within another array where the runtime would manage the appropriate offset and bound checking. I know that ArraySegment<T> exists and can be used as an IList<T> but if you have a method that accepts only arrays that doesn't help much.

To keep within the same syntax:

byte[] b1 = new byte[500];
byte[] b2 = b1[10:10];
b2[0] = 123;
Debug.Assert(b1[10] == 123);
b1[11] = 234;
Debug.Assert(b2[1] == 234);
b2[-1] = 123; // throws IndexOutOfRangeException();
b2[10] = 123; // throws IndexOutOfRangeException();

A similar mechanism would be useful for substrings, where instead of actually copying the portion of the original string into a new string the substring would retain a reference to the original string with an offset and length:

string s1 = "Hello World!";
string s2 = s1[6:5];
Debug.Assert(s2 == "World");

The one disadvantage to both being that it keeps a root reference to the original array or string around for the lifetime of the slice.

This is one of those things that I'd really prefer could be handled by the runtime itself (with C# support in conjunction, of course). By that I mean have the ability directly in the runtime to define an array that is a range within another array where the runtime would manage the appropriate offset and bound checking. I know that ArraySegment<T> exists and can be used as an IList<T> but if you have a method that accepts only arrays that doesn't help much.

To keep within the same syntax:

byte[] b1 = new byte[500];
byte[] b2 = b1[10:10];
b2[0] = 123;
Debug.Assert(b1[10] == 123);
b1[11] = 234;
Debug.Assert(b2[1] == 234);
b2[-1] = 123; // throws IndexOutOfRangeException();
b2[10] = 123; // throws IndexOutOfRangeException();

A similar mechanism would be useful for substrings, where instead of actually copying the portion of the original string into a new string the substring would retain a reference to the original string with an offset and length:

string s1 = "Hello World!";
string s2 = s1[6:5];
Debug.Assert(s2 == "World");

The one disadvantage to both being that it keeps a root reference to the original array or string around for the lifetime of the slice.

@redknightlois

This comment has been minimized.

Show comment
Hide comment
@redknightlois

redknightlois Jan 29, 2015

Having done some work already on Array Slices (https://github.com/Codealike/arrayslice) I will share some of the gotchas I had to deal with...

If slices are implemented in C# as a native construct understood by the compiler, you will make math oriented programmer like myself pretty happy. Which will be probably the ones that are seriously interested in having such a construct for performance reasons. IEnumerable, IList, etc have such a big performance impact that they are provided for convenience and/or interop with application code only. (see at the arrayslice link the performance impact).

While the implementation details with structs or classes, readonly or not are very important at the language design level the biggest issue is behind the language surface.

Today as it stands I know of 3 ways to handle this:

  • Implement an Slice class which just "overrides" the index. (our implementation does that using IL rewriting for performance, Roslyn will just generate the proper IL).
  • Implement it as some kind of IEnumerable (skip + take)
  • Do it properly where slices are actually arrays at the runtime level.

The first has a very important drawback, if your code doesn't support Slice you are screwed. Implicit converting to an array would not work... you have to pass the whole array (defeating the purpose of the Slice) or copy the array, a no-no for the audience that would really use it...

The second is clear, performance... again a no-no for the intended audience.

The third, AFAIK no support at the runtime level to actually create an array with "shared" memory. If there is IL to be able to do that, I am more than interested to know how :) ... therefore unless we can allow that at the runtime level slices will be useless or clash with code already written.

Needless to say, that where this is going is that there is a serious need of a generic numeric constraints too (both fixed and floating) to really make C# shine for performance math code. Traits if implemented properly would work for that.

I look forward to have an experience akind to what you have in Matlab or Python in terms of flexibility (not in syntax :D)

Federico

Having done some work already on Array Slices (https://github.com/Codealike/arrayslice) I will share some of the gotchas I had to deal with...

If slices are implemented in C# as a native construct understood by the compiler, you will make math oriented programmer like myself pretty happy. Which will be probably the ones that are seriously interested in having such a construct for performance reasons. IEnumerable, IList, etc have such a big performance impact that they are provided for convenience and/or interop with application code only. (see at the arrayslice link the performance impact).

While the implementation details with structs or classes, readonly or not are very important at the language design level the biggest issue is behind the language surface.

Today as it stands I know of 3 ways to handle this:

  • Implement an Slice class which just "overrides" the index. (our implementation does that using IL rewriting for performance, Roslyn will just generate the proper IL).
  • Implement it as some kind of IEnumerable (skip + take)
  • Do it properly where slices are actually arrays at the runtime level.

The first has a very important drawback, if your code doesn't support Slice you are screwed. Implicit converting to an array would not work... you have to pass the whole array (defeating the purpose of the Slice) or copy the array, a no-no for the audience that would really use it...

The second is clear, performance... again a no-no for the intended audience.

The third, AFAIK no support at the runtime level to actually create an array with "shared" memory. If there is IL to be able to do that, I am more than interested to know how :) ... therefore unless we can allow that at the runtime level slices will be useless or clash with code already written.

Needless to say, that where this is going is that there is a serious need of a generic numeric constraints too (both fixed and floating) to really make C# shine for performance math code. Traits if implemented properly would work for that.

I look forward to have an experience akind to what you have in Matlab or Python in terms of flexibility (not in syntax :D)

Federico

@Miista

This comment has been minimized.

Show comment
Hide comment
@Miista

Miista Feb 12, 2015

In what way is a slice different from an array? Couldn't the type simply be int[] slice = …?
It's still just an array.

Also I believe I would be more clear if you used the range operator (..) to mark a slice

int[] array = …;
int[] slice = array[0..2];
int[] head = array[..2];
int[] tail = array[2..];

Miista commented Feb 12, 2015

In what way is a slice different from an array? Couldn't the type simply be int[] slice = …?
It's still just an array.

Also I believe I would be more clear if you used the range operator (..) to mark a slice

int[] array = …;
int[] slice = array[0..2];
int[] head = array[..2];
int[] tail = array[2..];
@redknightlois

This comment has been minimized.

Show comment
Hide comment
@redknightlois

redknightlois Feb 12, 2015

@Miista AFAIK arrays in the CLR are not just a bunch of memory, the GC have to track it down so there should be a descriptor somewhere, etc. Therefore, the slice part of that bunch of memory is not exactly an array. The CLR/Roslyn guys surely could give a more detailed answer as I am interested into knowing that too. :)

@Miista AFAIK arrays in the CLR are not just a bunch of memory, the GC have to track it down so there should be a descriptor somewhere, etc. Therefore, the slice part of that bunch of memory is not exactly an array. The CLR/Roslyn guys surely could give a more detailed answer as I am interested into knowing that too. :)

@prasannavl

This comment has been minimized.

Show comment
Hide comment
@prasannavl

prasannavl Apr 10, 2015

Awesome feature. Been waiting for this since the inception of C# itself 🎱

@stephentoub - While its use is numerous, I'm curious of how this is going to be used practically in case of strings.

In your example, you used the simplest case of switch,

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[:5];
ReadOnlySlice<char> world = helloWorld[7:];
switch(hello) { // no allocation necessary to switch on a ReadOnlySlice<T>
    case "hello": Hello(); break;
    case "world": World(); break;
}
Debug.Assert(hello + world == "helloworld");

I'm curious here as to how a string is compared to a ReadOnlySlice<char>, since switch case requires compile time constant values of the same type (without additional compiler support.)

That being aside, for any practical advantage in efficiency while dealing with strings, the compiler needs to support allocation free representation of strings, since you almost always have to recreate a string from the ReadOnlySlice<char> again (which triggers allocation) to do anything useful from it, other than the switch case (even which I still don't see how, without compiler tweaks).

Unless, a String.FromSlice or something of that nature is provided, which internally creates a string that represents the same area of memory, I see the string slicing to be quite-pointless.

Now, considering, a String.FormSlice, or a Slice.ToSourceFormat, or anything of that nature is provided, can this not be directly simplified to directly providing the type itself, with controlled mutability, than a new type called Slice?

Example,

string helloWorld = "hello, world";

 // Internally built from the memory representation 
 // i.e, only the 'string' type is allocated (which acts a wrapper itself to the chars),
 // but simply representing the same area of memory
 // Conceptual pseudo: (String.FromSlice(String.Slice(helloWorld, 0, 5))
 // But can be efficiently done without the middle conversions directly.
string hello = helloWorld[:5];

string world = helloWorld[7:]; // Internally built from the memory representation again
switch(hello) { 
    case "hello": Hello(); break;
    case "world": World(); break;
}

Now, my point being, instead of creating a new Type called Slice, or ReadOnlySlice, since this will anyway require a reverse conversion at some-point reducing the potential gain of efficiency, why not directly return the arrays, and simply provide a direct way to create an array from an existing representation of another underlying memory of the array of the same type?

int[] x = {1, 2, 3, 4, 5};

// It returns a new array that internally maps directly to the array of x.
// Again, the returned int array type implicitly represents the same area of memory.
int[] slice = x[1:4];

// Alternatively, 
int[] slice = Array.Slice(x, 1, 4);

// Readonly version: (Reuse existing types)
ImmutableArray[] slice = Array.ImmutableSlice(x, 1, 4);

This ensures compatibility with all existing APIs, and no requirement for the API to be dealing with slices differently. IMO, API shouldn't have to think about where its a slice or an array. As far as they are concerned, they are getting a unit of data to operate on. The sender can decide whether its a slice that operates directly (conceptually similar to refs), or a copy.

Does this not make sense, as I really see no practical benefit, and use case in separating it as a brand new type - Only more potential decisions to dealt with, polluting the APIs with another set of overloads.

Awesome feature. Been waiting for this since the inception of C# itself 🎱

@stephentoub - While its use is numerous, I'm curious of how this is going to be used practically in case of strings.

In your example, you used the simplest case of switch,

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[:5];
ReadOnlySlice<char> world = helloWorld[7:];
switch(hello) { // no allocation necessary to switch on a ReadOnlySlice<T>
    case "hello": Hello(); break;
    case "world": World(); break;
}
Debug.Assert(hello + world == "helloworld");

I'm curious here as to how a string is compared to a ReadOnlySlice<char>, since switch case requires compile time constant values of the same type (without additional compiler support.)

That being aside, for any practical advantage in efficiency while dealing with strings, the compiler needs to support allocation free representation of strings, since you almost always have to recreate a string from the ReadOnlySlice<char> again (which triggers allocation) to do anything useful from it, other than the switch case (even which I still don't see how, without compiler tweaks).

Unless, a String.FromSlice or something of that nature is provided, which internally creates a string that represents the same area of memory, I see the string slicing to be quite-pointless.

Now, considering, a String.FormSlice, or a Slice.ToSourceFormat, or anything of that nature is provided, can this not be directly simplified to directly providing the type itself, with controlled mutability, than a new type called Slice?

Example,

string helloWorld = "hello, world";

 // Internally built from the memory representation 
 // i.e, only the 'string' type is allocated (which acts a wrapper itself to the chars),
 // but simply representing the same area of memory
 // Conceptual pseudo: (String.FromSlice(String.Slice(helloWorld, 0, 5))
 // But can be efficiently done without the middle conversions directly.
string hello = helloWorld[:5];

string world = helloWorld[7:]; // Internally built from the memory representation again
switch(hello) { 
    case "hello": Hello(); break;
    case "world": World(); break;
}

Now, my point being, instead of creating a new Type called Slice, or ReadOnlySlice, since this will anyway require a reverse conversion at some-point reducing the potential gain of efficiency, why not directly return the arrays, and simply provide a direct way to create an array from an existing representation of another underlying memory of the array of the same type?

int[] x = {1, 2, 3, 4, 5};

// It returns a new array that internally maps directly to the array of x.
// Again, the returned int array type implicitly represents the same area of memory.
int[] slice = x[1:4];

// Alternatively, 
int[] slice = Array.Slice(x, 1, 4);

// Readonly version: (Reuse existing types)
ImmutableArray[] slice = Array.ImmutableSlice(x, 1, 4);

This ensures compatibility with all existing APIs, and no requirement for the API to be dealing with slices differently. IMO, API shouldn't have to think about where its a slice or an array. As far as they are concerned, they are getting a unit of data to operate on. The sender can decide whether its a slice that operates directly (conceptually similar to refs), or a copy.

Does this not make sense, as I really see no practical benefit, and use case in separating it as a brand new type - Only more potential decisions to dealt with, polluting the APIs with another set of overloads.

@jdh30

This comment has been minimized.

Show comment
Hide comment
@jdh30

jdh30 Apr 16, 2015

F# already has slices for both arrays and strings. Sadly, they deep copy which makes them too slow for many applications (I only use them in code golf). Aliasing is definitely the way to go. Provided the slice supports stride it could also help when hoisting bounds checks.

I wish .NET provided overloads for functions like System.Double.Parse that accepted string, start index and length rather than just string. I often find my parsing code is much slower than necessary because this API design incurs huge allocation rates from unnecessary objects.

jdh30 commented Apr 16, 2015

F# already has slices for both arrays and strings. Sadly, they deep copy which makes them too slow for many applications (I only use them in code golf). Aliasing is definitely the way to go. Provided the slice supports stride it could also help when hoisting bounds checks.

I wish .NET provided overloads for functions like System.Double.Parse that accepted string, start index and length rather than just string. I often find my parsing code is much slower than necessary because this API design incurs huge allocation rates from unnecessary objects.

@jnm2

This comment has been minimized.

Show comment
Hide comment
@jnm2

jnm2 Apr 18, 2015

Contributor

Is there any chance we could get string slices at the same time? Not just ReadOnlySlice<char>?

Contributor

jnm2 commented Apr 18, 2015

Is there any chance we could get string slices at the same time? Not just ReadOnlySlice<char>?

@gafter

This comment has been minimized.

Show comment
Hide comment
@gafter

gafter Apr 18, 2015

Member

@jnm2 Yes, that would be part of the point. string.Substring would be more efficient than today.

Member

gafter commented Apr 18, 2015

@jnm2 Yes, that would be part of the point. string.Substring would be more efficient than today.

@jdh30

This comment has been minimized.

Show comment
Hide comment
@jdh30

jdh30 Apr 18, 2015

@gafter: I don't think you would want to break backward compatibility as Java has had some trouble with string slices keeping large strings reachable too long, i.e. memory leaks.

jdh30 commented Apr 18, 2015

@gafter: I don't think you would want to break backward compatibility as Java has had some trouble with string slices keeping large strings reachable too long, i.e. memory leaks.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Apr 18, 2015

@gafter Sounds like there is possibly movement on allowing a string to represent a range within another string? That would be awesome as it would make slicing both performant and usable within all existing API. I kind of agree with @jdh30 though that maybe it should be supported through a new member of string rather than string.Substring as some code might not expect the much larger original string to retain a root reference.

Would the same be possible with arrays?

@gafter Sounds like there is possibly movement on allowing a string to represent a range within another string? That would be awesome as it would make slicing both performant and usable within all existing API. I kind of agree with @jdh30 though that maybe it should be supported through a new member of string rather than string.Substring as some code might not expect the much larger original string to retain a root reference.

Would the same be possible with arrays?

@Przemyslaw-W

This comment has been minimized.

Show comment
Hide comment
@Przemyslaw-W

Przemyslaw-W Apr 18, 2015

If this comes with proper GC integration, then there will be no need for
new API. GC just needs to deep understand slices and when original string
(or array) is no longer referenced other than via slice, then parts which
are not referenced by any slice can be collected.

2015-04-18 17:10 GMT+02:00 HaloFour notifications@github.com:

@gafter https://github.com/gafter Sounds like there is possibly
movement on allowing a string to represent a range within another string?
That would be awesome as it would make slicing both performant and usable
within all existing API. I kind of agree with @jdh30
https://github.com/jdh30 though that maybe it should be supported
through a new member of string rather than string.Substring as some code
might not expect the much larger original string to retain a root
reference.

Would the same be possible with arrays?


Reply to this email directly or view it on GitHub
#120 (comment).

If this comes with proper GC integration, then there will be no need for
new API. GC just needs to deep understand slices and when original string
(or array) is no longer referenced other than via slice, then parts which
are not referenced by any slice can be collected.

2015-04-18 17:10 GMT+02:00 HaloFour notifications@github.com:

@gafter https://github.com/gafter Sounds like there is possibly
movement on allowing a string to represent a range within another string?
That would be awesome as it would make slicing both performant and usable
within all existing API. I kind of agree with @jdh30
https://github.com/jdh30 though that maybe it should be supported
through a new member of string rather than string.Substring as some code
might not expect the much larger original string to retain a root
reference.

Would the same be possible with arrays?


Reply to this email directly or view it on GitHub
#120 (comment).

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Apr 18, 2015

@Przemyslaw-W

If the GC could pull off being able to collect a large string from which at least one slice was taken then that would be great and does allay our concerns. My concern is that the slices would be treated as having references back to the parent string and thus keep it from being eligible for collection.

Another (tiny) reason to have a separate method is that we could establish a convention through which any type can be sliced. If a slice operation could function against any type that had a resolvable Slice(int,int) method (instance or extension) then the functionality could be provided to additional types. Off of the top of my head I could see slicing benefiting strings, arrays, any form of indexable collection, IEnumerable (via Skip+Take) and tuples.

@Przemyslaw-W

If the GC could pull off being able to collect a large string from which at least one slice was taken then that would be great and does allay our concerns. My concern is that the slices would be treated as having references back to the parent string and thus keep it from being eligible for collection.

Another (tiny) reason to have a separate method is that we could establish a convention through which any type can be sliced. If a slice operation could function against any type that had a resolvable Slice(int,int) method (instance or extension) then the functionality could be provided to additional types. Off of the top of my head I could see slicing benefiting strings, arrays, any form of indexable collection, IEnumerable (via Skip+Take) and tuples.

@Przemyslaw-W

This comment has been minimized.

Show comment
Hide comment
@Przemyslaw-W

Przemyslaw-W Apr 18, 2015

Yeah, such open convention would be really great. And I think such API need
to be introduced anyway, as arrays do not have "SubArray" method now. But
still, we can have cake and eat it too. If proper GC update comes together,
then Substring can be rewritten to internally use slicing. However, If GC
won't play together, then I agree it would be better to leave current
Substring implementation as is.

2015-04-18 21:38 GMT+02:00 HaloFour notifications@github.com:

@Przemyslaw-W https://github.com/Przemyslaw-W

If the GC could pull off being able to collect a large string from which
at least one slice was taken then that would be great and does allay our
concerns. My concern is that the slices would be treated as having
references back to the parent string and thus keep it from being eligible
for collection.

Another (tiny) reason to have a separate method is that we could establish
a convention through which any type can be sliced. If a slice operation
could function against any type that had a resolvable Slice(int,int)
method (instance or extension) then the functionality could be provided to
additional types. Off of the top of my head I could see slicing benefiting
strings, arrays, any form of indexable collection, IEnumerable (via
Skip+Take) and tuples.


Reply to this email directly or view it on GitHub
#120 (comment).

Yeah, such open convention would be really great. And I think such API need
to be introduced anyway, as arrays do not have "SubArray" method now. But
still, we can have cake and eat it too. If proper GC update comes together,
then Substring can be rewritten to internally use slicing. However, If GC
won't play together, then I agree it would be better to leave current
Substring implementation as is.

2015-04-18 21:38 GMT+02:00 HaloFour notifications@github.com:

@Przemyslaw-W https://github.com/Przemyslaw-W

If the GC could pull off being able to collect a large string from which
at least one slice was taken then that would be great and does allay our
concerns. My concern is that the slices would be treated as having
references back to the parent string and thus keep it from being eligible
for collection.

Another (tiny) reason to have a separate method is that we could establish
a convention through which any type can be sliced. If a slice operation
could function against any type that had a resolvable Slice(int,int)
method (instance or extension) then the functionality could be provided to
additional types. Off of the top of my head I could see slicing benefiting
strings, arrays, any form of indexable collection, IEnumerable (via
Skip+Take) and tuples.


Reply to this email directly or view it on GitHub
#120 (comment).

@JamesNK

This comment has been minimized.

Show comment
Hide comment
@JamesNK

JamesNK Apr 20, 2015

Member

If slice is added, will it work with IList? IList is much more commonly used than raw arrays. A nice syntax for getting ranges of data should work with the most commonly used data structure.

Member

JamesNK commented Apr 20, 2015

If slice is added, will it work with IList? IList is much more commonly used than raw arrays. A nice syntax for getting ranges of data should work with the most commonly used data structure.

@xen2

This comment has been minimized.

Show comment
Hide comment
@xen2

xen2 Apr 20, 2015

Ideally it would be great if such slice would not require allocation (i.e. be encoded in a struct that can be passed by ref/copy).

If not, it would result in two allocation and two indirections most of the time (and increased object number for GC).

Rust is doing something similar already: https://doc.rust-lang.org/std/slice/

Of course, if runtime can be modified, other options might be possible too.

xen2 commented Apr 20, 2015

Ideally it would be great if such slice would not require allocation (i.e. be encoded in a struct that can be passed by ref/copy).

If not, it would result in two allocation and two indirections most of the time (and increased object number for GC).

Rust is doing something similar already: https://doc.rust-lang.org/std/slice/

Of course, if runtime can be modified, other options might be possible too.

@prasannavl

This comment has been minimized.

Show comment
Hide comment
@prasannavl

prasannavl Apr 23, 2015

I don't understand why there are many comments about slicing in IEnumerable or IList. It simply doesn't make sense, since they aren't a contiguous representation of memory. They aren't even a direct representation of memory. They are very high level structures. The conceptual slicing of them is already possible, and is no different from using Skip, Take, and their relatives. We're talking about efficient referencing to existing memory, which really, only applies to arrays, or be extended to objects overall - in which case the garbage collector itself has to be tweaked, which changes a lot more dynamics, bringing the whole language closer to C/C++. If this indeed is a proposal, it seems completely out of scope of this thread.

I think the focus here should be only on arrays. If array are accomplished the right way, ILists can easily be extended, by perhaps another interface, that allows access to IList's source array, which in turn can be sliced.

I don't understand why there are many comments about slicing in IEnumerable or IList. It simply doesn't make sense, since they aren't a contiguous representation of memory. They aren't even a direct representation of memory. They are very high level structures. The conceptual slicing of them is already possible, and is no different from using Skip, Take, and their relatives. We're talking about efficient referencing to existing memory, which really, only applies to arrays, or be extended to objects overall - in which case the garbage collector itself has to be tweaked, which changes a lot more dynamics, bringing the whole language closer to C/C++. If this indeed is a proposal, it seems completely out of scope of this thread.

I think the focus here should be only on arrays. If array are accomplished the right way, ILists can easily be extended, by perhaps another interface, that allows access to IList's source array, which in turn can be sliced.

@xen2

This comment has been minimized.

Show comment
Hide comment
@xen2

xen2 Apr 23, 2015

@weitzhandler Probably don't need syntactic sugar for a simple Skip+Take. I agree with @prasannavl that arrays should be the priority here.

xen2 commented Apr 23, 2015

@weitzhandler Probably don't need syntactic sugar for a simple Skip+Take. I agree with @prasannavl that arrays should be the priority here.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 18, 2015

I'm from this CoreCLR issue and as @jkotas noted, the issue seems to be related to this one and could be implemented as an addition to the proposed slices.

In short: It's about creating a "view" of an array in order to wrap the same block of data in different element types in order to satisfy the needs of different APIs or libraries. Feel free to join the discussion or move it here alltogether, in case this would be a suitable extension to slices.

ghost commented May 18, 2015

I'm from this CoreCLR issue and as @jkotas noted, the issue seems to be related to this one and could be implemented as an addition to the proposed slices.

In short: It's about creating a "view" of an array in order to wrap the same block of data in different element types in order to satisfy the needs of different APIs or libraries. Feel free to join the discussion or move it here alltogether, in case this would be a suitable extension to slices.

@JeffreySax

This comment has been minimized.

Show comment
Hide comment
@JeffreySax

JeffreySax May 19, 2015

There are really two separate issues here:

  1. Adding slicing syntax to the language.
  2. Implementing slices for strings, arrays, and other objects.

For many scenarios, especially in technical computing, just allowing the syntax solves the most painful issue of readability. Libraries can take care of the implementation for specific types. This way, the language side of the feature can be fully designed without having to handle all the complications of an efficient implementation. Significantly, and unlike what @MadsTorgersen writes about @stephentoub 's proposal in #2136, it does not require any changes in the CLR.

In fact, this can all be rather simple if you define a slice to be a built-in type with a specific construction syntax: from:to = new Slice(from, to) and, hopefully, from:stride:to = new Slice(from, to, stride). So a 'slice' here is the set of integer indexes, not the actual view of the larger object.

'Open' slices (where the from and/or to part is omitted) are only allowed as arguments in an indexer property. In this case, the missing value is replaced with a suitable bound:

  • For the lower bound, if the instance has a method or extension method GetLowerBound(dimension), the lower bound is set to a call to this method; otherwise, the lower bound is 0.
  • For the upper bound, if the instance has a method or extension method GetUpperBound(dimension), the upper bound is set to a call to this method; otherwise, if the instance has a Length or Count property, the value of this property is used; otherwise, if the instance implements ICollection or IList<T>, the corresponding Count property is used; otherwise a compile-time error is generated.

Indexing with a slice then just maps to an indexer call. Alternatively, it can map to an extension method (GetSlice or SetSlice) with the same signature. This allows libraries to add slicing support to types like strings and arrays after the fact.

This solution works for 1D objects like arrays, lists, and strings, but also for 2D and higher-dimensional arrays. It doesn't fix the type of the slice/view. It doesn't make any assumptions about how slicing is implemented, or how slices would be used.

There are really two separate issues here:

  1. Adding slicing syntax to the language.
  2. Implementing slices for strings, arrays, and other objects.

For many scenarios, especially in technical computing, just allowing the syntax solves the most painful issue of readability. Libraries can take care of the implementation for specific types. This way, the language side of the feature can be fully designed without having to handle all the complications of an efficient implementation. Significantly, and unlike what @MadsTorgersen writes about @stephentoub 's proposal in #2136, it does not require any changes in the CLR.

In fact, this can all be rather simple if you define a slice to be a built-in type with a specific construction syntax: from:to = new Slice(from, to) and, hopefully, from:stride:to = new Slice(from, to, stride). So a 'slice' here is the set of integer indexes, not the actual view of the larger object.

'Open' slices (where the from and/or to part is omitted) are only allowed as arguments in an indexer property. In this case, the missing value is replaced with a suitable bound:

  • For the lower bound, if the instance has a method or extension method GetLowerBound(dimension), the lower bound is set to a call to this method; otherwise, the lower bound is 0.
  • For the upper bound, if the instance has a method or extension method GetUpperBound(dimension), the upper bound is set to a call to this method; otherwise, if the instance has a Length or Count property, the value of this property is used; otherwise, if the instance implements ICollection or IList<T>, the corresponding Count property is used; otherwise a compile-time error is generated.

Indexing with a slice then just maps to an indexer call. Alternatively, it can map to an extension method (GetSlice or SetSlice) with the same signature. This allows libraries to add slicing support to types like strings and arrays after the fact.

This solution works for 1D objects like arrays, lists, and strings, but also for 2D and higher-dimensional arrays. It doesn't fix the type of the slice/view. It doesn't make any assumptions about how slicing is implemented, or how slices would be used.

@prasannavl

This comment has been minimized.

Show comment
Hide comment
@prasannavl

prasannavl May 19, 2015

@JeffreySax, I'm guessing you completely misunderstood what slicing really means here. Please read the last few comments, and dotnet/coreclr#1015.

PS: It does require language support. Syntax and API is completely useless without a way to get a consistent view of a certain memory area from CLR.

@JeffreySax, I'm guessing you completely misunderstood what slicing really means here. Please read the last few comments, and dotnet/coreclr#1015.

PS: It does require language support. Syntax and API is completely useless without a way to get a consistent view of a certain memory area from CLR.

@redknightlois

This comment has been minimized.

Show comment
Hide comment
@redknightlois

redknightlois May 19, 2015

@JeffreySax Take a look at the link I posted on my comment at the start of the thread. The implementation using common constructs, with no CLR support for views of memory, as @prasannavl explains it, defeat the purpose of Slices by themselves.

The difference in performance is embarrasing, to give a hint with a microbenchmark (which can be totally flawed but its the best we have) we have something like this:

Access via delimited array: 258ms. (this uses indexers)
Access via array segment: 68ms.
Access via inline no checks delimited array: 45ms. (this use forced inlining)
Access without offset: 38ms. (this uses explicit code offsets)
Access via array slice: 38ms. (this uses IL manipulation)

Source: https://github.com/Codealike/arrayslice

As you can see, even paying 5% for Slices defeat the purpose of its use. Slices is no different to a custom IList if you dont care about performance, it is performance what makes Slices attractive and useful as a feature.

@JeffreySax Take a look at the link I posted on my comment at the start of the thread. The implementation using common constructs, with no CLR support for views of memory, as @prasannavl explains it, defeat the purpose of Slices by themselves.

The difference in performance is embarrasing, to give a hint with a microbenchmark (which can be totally flawed but its the best we have) we have something like this:

Access via delimited array: 258ms. (this uses indexers)
Access via array segment: 68ms.
Access via inline no checks delimited array: 45ms. (this use forced inlining)
Access without offset: 38ms. (this uses explicit code offsets)
Access via array slice: 38ms. (this uses IL manipulation)

Source: https://github.com/Codealike/arrayslice

As you can see, even paying 5% for Slices defeat the purpose of its use. Slices is no different to a custom IList if you dont care about performance, it is performance what makes Slices attractive and useful as a feature.

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams May 19, 2015

Contributor

As an aside its a bit like TypedArrays in js?

Specifically the new TypedArray(buffer [, byteOffset [, length]]); constructor.

Contributor

benaadams commented May 19, 2015

As an aside its a bit like TypedArrays in js?

Specifically the new TypedArray(buffer [, byteOffset [, length]]); constructor.

@JeffreySax

This comment has been minimized.

Show comment
Hide comment
@JeffreySax

JeffreySax May 19, 2015

@redknightlois I strongly disagree that "it is performance what makes Slices attractive and useful as a feature." The principal motivation for adding slicing syntax is code clarity. Yes, it would be nice to have CLR-backed slices of strings and arrays, but it is not necessary to derive lots of benefit from the feature. It is common for numerical code to have 3-4 slicing operations in a single expression. Such code would benefit immensely from having slicing syntax.

F# has had slicing syntax from the beginning. Row and column slices of 2D collections were added in version 3.1. Clearly there was a demand for it. A similar demand exists for C#/VB.

How slices are implemented is not a C# language issue. If a specific implementation is particularly challenging, that should not keep the C# team from moving forward with such a useful feature.

@redknightlois I strongly disagree that "it is performance what makes Slices attractive and useful as a feature." The principal motivation for adding slicing syntax is code clarity. Yes, it would be nice to have CLR-backed slices of strings and arrays, but it is not necessary to derive lots of benefit from the feature. It is common for numerical code to have 3-4 slicing operations in a single expression. Such code would benefit immensely from having slicing syntax.

F# has had slicing syntax from the beginning. Row and column slices of 2D collections were added in version 3.1. Clearly there was a demand for it. A similar demand exists for C#/VB.

How slices are implemented is not a C# language issue. If a specific implementation is particularly challenging, that should not keep the C# team from moving forward with such a useful feature.

@prasannavl

This comment has been minimized.

Show comment
Hide comment
@prasannavl

prasannavl May 19, 2015

@JeffreySax,

In a hypothetical case, where I agree with you - C# already has slices. x.Skip(2).Take(10); If you want a refined way - Create a small wrapper class. Still not enough? Then, what you're looking for is sugar for the iterator syntax. Now, I don't think this is the best thread to ask for that.

Why, you ask?

  • Here's the shamelessly lifted problem statement from the first post with the large title "Problem":

However, it’s also very common to only want to share a portion of an array. This is typically achieved either by copying that portion out into its own array, or by passing around the array along with range indicators for which portion of the array is intended to be used. The former can lead to inefficiencies due to unnecessary copies of non-trivial amounts of data, and the latter can lead both to more complicated code as well as to lack of trust that the intended subset is the only subset that’s actually going to being used.

@JeffreySax,

In a hypothetical case, where I agree with you - C# already has slices. x.Skip(2).Take(10); If you want a refined way - Create a small wrapper class. Still not enough? Then, what you're looking for is sugar for the iterator syntax. Now, I don't think this is the best thread to ask for that.

Why, you ask?

  • Here's the shamelessly lifted problem statement from the first post with the large title "Problem":

However, it’s also very common to only want to share a portion of an array. This is typically achieved either by copying that portion out into its own array, or by passing around the array along with range indicators for which portion of the array is intended to be used. The former can lead to inefficiencies due to unnecessary copies of non-trivial amounts of data, and the latter can lead both to more complicated code as well as to lack of trust that the intended subset is the only subset that’s actually going to being used.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 3, 2016

Since 1.0 we have a lot of API accepting and return arrays instead of enumerables and readonly wrappers. We have to live with it.

Imo, a good trade off would be if slices implemented all the relevant methods of array and strings (with the same efficiency) and BCL classes gradually implemented overrides of the popular methods like TryParse etc.

omariom commented Jan 3, 2016

Since 1.0 we have a lot of API accepting and return arrays instead of enumerables and readonly wrappers. We have to live with it.

Imo, a good trade off would be if slices implemented all the relevant methods of array and strings (with the same efficiency) and BCL classes gradually implemented overrides of the popular methods like TryParse etc.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Jan 3, 2016

@omariom

Do you want another allocation when creating a slice?

Want? No, but it's probably unavoidable to at least require an object allocation if the result were to be a string or array, short of some severe CLR/GC shenanigans. But I'd prefer a slightly-more-expensive version rather than a largely-useless-due-to-lack-of-support-throughout-the-ecosystem version.

HaloFour commented Jan 3, 2016

@omariom

Do you want another allocation when creating a slice?

Want? No, but it's probably unavoidable to at least require an object allocation if the result were to be a string or array, short of some severe CLR/GC shenanigans. But I'd prefer a slightly-more-expensive version rather than a largely-useless-due-to-lack-of-support-throughout-the-ecosystem version.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 3, 2016

Just checked how it is in other langs.
In Go, Rust and D slices are value types and have separate syntax from arrays and srtings.

omariom commented Jan 3, 2016

Just checked how it is in other langs.
In Go, Rust and D slices are value types and have separate syntax from arrays and srtings.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 3, 2016

I would prefer to have slices now for my own work rather than waiting till runtime is ready to implement them natively, uniformly nd most efficiently - it could be done in the next version.

omariom commented Jan 3, 2016

I would prefer to have slices now for my own work rather than waiting till runtime is ready to implement them natively, uniformly nd most efficiently - it could be done in the next version.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 3, 2016

@stephentoub
What will the Slice's indexer be returning? Reference or value?

omariom commented Jan 3, 2016

@stephentoub
What will the Slice's indexer be returning? Reference or value?

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Jan 4, 2016

Member

@stephentoub What will the Slice's indexer be returning?

My hope would be a ref return, but we'll see. You can see current experimentation at https://github.com/dotnet/corefxlab/tree/master/src/System.Slices.

Member

stephentoub commented Jan 4, 2016

@stephentoub What will the Slice's indexer be returning?

My hope would be a ref return, but we'll see. You can see current experimentation at https://github.com/dotnet/corefxlab/tree/master/src/System.Slices.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Jan 4, 2016

@stephentoub I've already sent a couple of pull requests there )

omariom commented Jan 4, 2016

@stephentoub I've already sent a couple of pull requests there )

@jods4

This comment has been minimized.

Show comment
Hide comment
@jods4

jods4 Jan 5, 2016

@omariom

TL;DR sorry that comment ended up very long. Most interesting idea is probably point 3 at the end.

You raised some good points that got me into thinking.
Making slices behave like strings / array might be harder than I first imagined.

Let's talk about strings (arbitrarily, I think everything applies equally to arrays).

Why I still believe that efficient interop with string is required
Playing the devil's advocate: if a splice is a struct that is not compatible with strings, than you already have it today. If you are concerned with perfs in your string handling, most of the core string functions today accept a string + an offset + a length. Having a struct wrapping this info is more convenient than having 3 variables, but it's quite the same in the end.

In particular it creates two worlds: the (few) functions that have optimized versions for people who needs perf, and the rest of the world who uses string. The problem with that vision is that sometimes people who need perf also need more advanced functions and re-implementing everything is just plain wrong. The other problem is that often at some points both worlds collide.

One example and then I move on to implementation ideas:
Say you want to write an efficient XML parser. It does not take long to understand that calling Substring() on every syntactical piece is going to allocate lots of objects. So say you use the new slices for that.
When the user processes the file and wants to know each tag name, what do you return? A string probably. So a copy has to be made.
But say that you are able to read an attribute value as a slice (yay for perf). If you know this attribute is a number and want to parse it, do you have a int.Parse overload that takes a slice? Probably not so you need to make a copy at that point (and your parser is not as efficient as it could have been), or you implement your own int.Parse accepting slices (very wrong).

Can string/slice compatibility even work?
That problem sure is hard. You are right that a true string is more than a length and a pointer to a char buffer. It also starts with the syncblock and the vtable pointer. These can't be removed from a reference type, so to be 100% compatible with a string, a slice should have them as well... maybe?

Here are all the solutions that I can think of:

  1. Let's do the opposite! Implement everything as Slice<char> and make string implicitly convertible to Slice<char> (which is easy to do and very cheap, copy the length and pointer in a struct).
    Great idea if we started today. Maybe MS can pull this off in the BCL but there are 14 years worth of existing libraries out there :(
    Crazy idea: the JIT could do that? Convert any method that takes a string as a method that takes a Slice<char> and modify calling site accordingly. That seems a bit crazy to me but hey...
    Of course the main issue is that some methods can't be converted, e.g. if they lock the string, or use it as an object or interface, etc. In those cases the JIT should convert Slice<char> parameters to proper strings instead...
  2. Let's make Slice<char> a reference type. I think that easily solves most issues, as we could adopt exactly the same layout as string.
    It would mean heap allocation, something we strive to avoid in perf critical code. This is truly a huge drawback, probably not acceptable.
    But what if... .NET could allocate references on the stack? This would mostly solve the issue here and boost performance of many other cases. Doing this is a hard problem and requires careful escape analysis, but if it could work even just in basic cases, there could be lots of benefits in terms of perf.
  3. Make Slice<char> implicitly and efficiently (no copy) convertible to string.
    That's rather easy: allocate one new string ref and make its char buffer point in the middle of the existing slice target.
    That's a bit of a compromise: we incur one allocation, but we don't copy the buffer and we can use any existing method or return to any user code with a plain string.
    In the end, because of the "reference type" baggage of string, 0 allocation is probably not achievable anyway.

One big issue that remains is that if you can return a "slice" string, it can live longer than the underlying buffer, which might cause trouble to GC (keeping a huge buffer alive for a tiny substring).

jods4 commented Jan 5, 2016

@omariom

TL;DR sorry that comment ended up very long. Most interesting idea is probably point 3 at the end.

You raised some good points that got me into thinking.
Making slices behave like strings / array might be harder than I first imagined.

Let's talk about strings (arbitrarily, I think everything applies equally to arrays).

Why I still believe that efficient interop with string is required
Playing the devil's advocate: if a splice is a struct that is not compatible with strings, than you already have it today. If you are concerned with perfs in your string handling, most of the core string functions today accept a string + an offset + a length. Having a struct wrapping this info is more convenient than having 3 variables, but it's quite the same in the end.

In particular it creates two worlds: the (few) functions that have optimized versions for people who needs perf, and the rest of the world who uses string. The problem with that vision is that sometimes people who need perf also need more advanced functions and re-implementing everything is just plain wrong. The other problem is that often at some points both worlds collide.

One example and then I move on to implementation ideas:
Say you want to write an efficient XML parser. It does not take long to understand that calling Substring() on every syntactical piece is going to allocate lots of objects. So say you use the new slices for that.
When the user processes the file and wants to know each tag name, what do you return? A string probably. So a copy has to be made.
But say that you are able to read an attribute value as a slice (yay for perf). If you know this attribute is a number and want to parse it, do you have a int.Parse overload that takes a slice? Probably not so you need to make a copy at that point (and your parser is not as efficient as it could have been), or you implement your own int.Parse accepting slices (very wrong).

Can string/slice compatibility even work?
That problem sure is hard. You are right that a true string is more than a length and a pointer to a char buffer. It also starts with the syncblock and the vtable pointer. These can't be removed from a reference type, so to be 100% compatible with a string, a slice should have them as well... maybe?

Here are all the solutions that I can think of:

  1. Let's do the opposite! Implement everything as Slice<char> and make string implicitly convertible to Slice<char> (which is easy to do and very cheap, copy the length and pointer in a struct).
    Great idea if we started today. Maybe MS can pull this off in the BCL but there are 14 years worth of existing libraries out there :(
    Crazy idea: the JIT could do that? Convert any method that takes a string as a method that takes a Slice<char> and modify calling site accordingly. That seems a bit crazy to me but hey...
    Of course the main issue is that some methods can't be converted, e.g. if they lock the string, or use it as an object or interface, etc. In those cases the JIT should convert Slice<char> parameters to proper strings instead...
  2. Let's make Slice<char> a reference type. I think that easily solves most issues, as we could adopt exactly the same layout as string.
    It would mean heap allocation, something we strive to avoid in perf critical code. This is truly a huge drawback, probably not acceptable.
    But what if... .NET could allocate references on the stack? This would mostly solve the issue here and boost performance of many other cases. Doing this is a hard problem and requires careful escape analysis, but if it could work even just in basic cases, there could be lots of benefits in terms of perf.
  3. Make Slice<char> implicitly and efficiently (no copy) convertible to string.
    That's rather easy: allocate one new string ref and make its char buffer point in the middle of the existing slice target.
    That's a bit of a compromise: we incur one allocation, but we don't copy the buffer and we can use any existing method or return to any user code with a plain string.
    In the end, because of the "reference type" baggage of string, 0 allocation is probably not achievable anyway.

One big issue that remains is that if you can return a "slice" string, it can live longer than the underlying buffer, which might cause trouble to GC (keeping a huge buffer alive for a tiny substring).

@Thaina

This comment has been minimized.

Show comment
Hide comment
@Thaina

Thaina Jan 16, 2016

I am going against your proposal because we already have ArraySegment and ReadOnlyCollection and also IEnumerable. The thing we really lack is the functionality of working with it like the actual array

Which is, instead, the feature return by ref. So we should put you slicer back into ArraySegment instead

Thaina commented Jan 16, 2016

I am going against your proposal because we already have ArraySegment and ReadOnlyCollection and also IEnumerable. The thing we really lack is the functionality of working with it like the actual array

Which is, instead, the feature return by ref. So we should put you slicer back into ArraySegment instead

@alrz

This comment has been minimized.

Show comment
Hide comment
@alrz

alrz Jan 22, 2016

Contributor

It would be nice if we could capture slices within array patterns,

swich(array) {
  case { var first, int[:] slice , var last }: ...
  // or perhaps
  case { var first, var slice.. , var last }: ...
}

or something like that.

Contributor

alrz commented Jan 22, 2016

It would be nice if we could capture slices within array patterns,

swich(array) {
  case { var first, int[:] slice , var last }: ...
  // or perhaps
  case { var first, var slice.. , var last }: ...
}

or something like that.

@jods4

This comment has been minimized.

Show comment
Hide comment
@jods4

jods4 Jan 22, 2016

@alrz I think I would prefer a third syntax, similar to your second: { var first, var ...slice, var last}
I think it's more consistent with other langages (e.g. destructuring in ES6) and it doesn't preclude var usage (unlike your first suggestion). In langages that supports this, that syntax is usually consistent with spread operator (should C# ever get that?), and params arguments (although C# has a different take on this one).

jods4 commented Jan 22, 2016

@alrz I think I would prefer a third syntax, similar to your second: { var first, var ...slice, var last}
I think it's more consistent with other langages (e.g. destructuring in ES6) and it doesn't preclude var usage (unlike your first suggestion). In langages that supports this, that syntax is usually consistent with spread operator (should C# ever get that?), and params arguments (although C# has a different take on this one).

@alrz

This comment has been minimized.

Show comment
Hide comment
@alrz

alrz Jan 22, 2016

Contributor

@jods4 I'm agree the first one is ambiguous, but what do you mean by "more consistent with other languages"?

Contributor

alrz commented Jan 22, 2016

@jods4 I'm agree the first one is ambiguous, but what do you mean by "more consistent with other languages"?

@jods4

This comment has been minimized.

Show comment
Hide comment
@jods4

jods4 Jan 22, 2016

@alrz I was thinking about destructuring arrays in other languages, which is quite similar to pattern matching (albeit unconditionally).
But to be honest my impression wasn't correct about that. After doing some actual research there are as many variations as there are languages. A few examples:
ES6 does it like I suggested: let [first, ...middle, last] = array.
Coffeescript does the opposite (your way): [first, middle..., last] = array.
Ruby uses a star (splat): first, *rest = [1, 2, 3]
Clojure uses ampersand: let [[first & rest] vector]

So... forget about that comment! Altough I still like ES6/TS syntax ;)

jods4 commented Jan 22, 2016

@alrz I was thinking about destructuring arrays in other languages, which is quite similar to pattern matching (albeit unconditionally).
But to be honest my impression wasn't correct about that. After doing some actual research there are as many variations as there are languages. A few examples:
ES6 does it like I suggested: let [first, ...middle, last] = array.
Coffeescript does the opposite (your way): [first, middle..., last] = array.
Ruby uses a star (splat): first, *rest = [1, 2, 3]
Clojure uses ampersand: let [[first & rest] vector]

So... forget about that comment! Altough I still like ES6/TS syntax ;)

@DerpMcDerp

This comment has been minimized.

Show comment
Hide comment
@DerpMcDerp

DerpMcDerp Feb 22, 2016

Having a count is far more common than having the index of the end element so it would be more convenient to programmers if the syntax foo[i:n] meant [i, i+n) rather than [i, n).

Or an alternative is that current proposal could be kept intact but an implicit variable $i could be created within the scope of the rhs representing the value computed on the lhs:

foo[a + b:$i + n]; // $i == a + b
foo[a:/* $i's scope begins here */ $i + 2 /* $i's scope ends here */];

Having a count is far more common than having the index of the end element so it would be more convenient to programmers if the syntax foo[i:n] meant [i, i+n) rather than [i, n).

Or an alternative is that current proposal could be kept intact but an implicit variable $i could be created within the scope of the rhs representing the value computed on the lhs:

foo[a + b:$i + n]; // $i == a + b
foo[a:/* $i's scope begins here */ $i + 2 /* $i's scope ends here */];
@Unknown6656

This comment has been minimized.

Show comment
Hide comment
@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Apr 4, 2016

I haven't followed through the entire discussion, but what's the current idea on having "array views" / slices that have a different type than the original array they share data with, i.e. CoreClr issue 1015? Will this be possible within reasonable constraints?

General idea:

byte[] someRawData = /*...*/

// Create a different view on the same data without copying 
// (for performance and library communication reasons)
MyStruct[] interpretedBlittableData = Array.CreateView<MyStruct>(someRawData, /*...*/);
// (The above throws an exception if MyStruct isn't considered safe for this)

Could probably be considered a slicing addon.

ilexp commented Apr 4, 2016

I haven't followed through the entire discussion, but what's the current idea on having "array views" / slices that have a different type than the original array they share data with, i.e. CoreClr issue 1015? Will this be possible within reasonable constraints?

General idea:

byte[] someRawData = /*...*/

// Create a different view on the same data without copying 
// (for performance and library communication reasons)
MyStruct[] interpretedBlittableData = Array.CreateView<MyStruct>(someRawData, /*...*/);
// (The above throws an exception if MyStruct isn't considered safe for this)

Could probably be considered a slicing addon.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Apr 12, 2016

@ilexp
In the prototype it is possible with Cast method.

omariom commented Apr 12, 2016

@ilexp
In the prototype it is possible with Cast method.

@gminorcoles

This comment has been minimized.

Show comment
Hide comment
@gminorcoles

gminorcoles Apr 29, 2016

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

@juliusfriedman

This comment has been minimized.

Show comment
Hide comment
@juliusfriedman

juliusfriedman Apr 30, 2016

My programming seems to reveal that all methods which take a plain array must also be able to take Ilist.

It seems easy to comprehend an offset and length parameter being automatically implemented to an overload of the method if the parameter is Ilist.

Furthermore I think you can skip the boxing Slice requires by creating an 'Adapter' method on IList combined with the above.

The semantic of readonly is already enforced by IList which also solves a majority of the other issues with immutable data.

My programming seems to reveal that all methods which take a plain array must also be able to take Ilist.

It seems easy to comprehend an offset and length parameter being automatically implemented to an overload of the method if the parameter is Ilist.

Furthermore I think you can skip the boxing Slice requires by creating an 'Adapter' method on IList combined with the above.

The semantic of readonly is already enforced by IList which also solves a majority of the other issues with immutable data.

@choikwa

This comment has been minimized.

Show comment
Hide comment
@choikwa

choikwa Jun 18, 2016

Couple of things that triggered me

  • Slice as its own type as opposed to Slice returning child array type
  • Slice as mutable reference to original array by default
  • Lack of mentioning aliasing analysis work for overlapping references

The easiest solution is to deepcopy everywhere and make everything mutable and let CLR/RyuJIT deal with it, but essentially those are what kills performance. What gives performance is constant/immutable references and copy-as-needed per mutation. String is already immutable which should make it easy to slice. I'm neither aliasing nor GC expert, so I can't comment on the problem of 'small ref holding onto large superset'. I'm hoping this is a problem experts have previously tackled and have mitigation strategies.

choikwa commented Jun 18, 2016

Couple of things that triggered me

  • Slice as its own type as opposed to Slice returning child array type
  • Slice as mutable reference to original array by default
  • Lack of mentioning aliasing analysis work for overlapping references

The easiest solution is to deepcopy everywhere and make everything mutable and let CLR/RyuJIT deal with it, but essentially those are what kills performance. What gives performance is constant/immutable references and copy-as-needed per mutation. String is already immutable which should make it easy to slice. I'm neither aliasing nor GC expert, so I can't comment on the problem of 'small ref holding onto large superset'. I'm hoping this is a problem experts have previously tackled and have mitigation strategies.

@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Jun 18, 2016

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

That sounds like it wouldn't provide the same performance as accessing a regular array, which would defeat the purpose of one of the use cases for slicing: Providing high-performance view access to an array / memory block without copying.

What gives performance is constant/immutable references and copy-as-needed per mutation.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability. How is it more performant if you have to copy all the data on every mutation? Slices being mutable allows lightweight, high-performance array / memory views for both read access and mutation. If they're immutable, you'll get only half the use cases and have to use costly copy operations for the other half. You can still decide to copy a mutable array / slice if you want to - or pass it around as an IReadOnlyList<T> if you don't want others to mutate it.

ilexp commented Jun 18, 2016

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

That sounds like it wouldn't provide the same performance as accessing a regular array, which would defeat the purpose of one of the use cases for slicing: Providing high-performance view access to an array / memory block without copying.

What gives performance is constant/immutable references and copy-as-needed per mutation.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability. How is it more performant if you have to copy all the data on every mutation? Slices being mutable allows lightweight, high-performance array / memory views for both read access and mutation. If they're immutable, you'll get only half the use cases and have to use costly copy operations for the other half. You can still decide to copy a mutable array / slice if you want to - or pass it around as an IReadOnlyList<T> if you don't want others to mutate it.

@cesarsouza

This comment has been minimized.

Show comment
Hide comment
@cesarsouza

cesarsouza Jun 18, 2016

+1 against having slice as an array of indices. It would be better to learn with frameworks/libraries that got it right, like for example Python's NumPy. Slices should represent views of the original array and are interpretable by ordinary functions just like ordinal arrays. It should be totally transparent for called functions whether they are processing an int[] or an int[5:10] or however an slice should be defined.

To be honest, I couldn't completely understand from the above discussion why sometimes touching the compiler is seem as something to be avoided. In my view, this is a critical feature for #10378 that cannot be left half-baked (such as for example having only a pure BCL solution). Also, deprecating ArraySegment, and re-implementing it in terms of array slices should also be considered as an option. It is not like there weren't any breaking changes since .NET 1.0.

Array slices (or more generally, safe memory views) are absolutely necessary for the success of C# as a language for high-performance computing. Right now, Python is taken way more seriously for high-performance computing than C#, and this really shouldn't have been the case (Python is a fine language though, but it was C# that initially proposed the non-compromise solution of handling unsafe contexts for more performant code, for example - as such, the fact that we are not being able to fulfill one of the first premises of the language might be a sign that even large or possibly breaking changes should be considered at this point).

cesarsouza commented Jun 18, 2016

+1 against having slice as an array of indices. It would be better to learn with frameworks/libraries that got it right, like for example Python's NumPy. Slices should represent views of the original array and are interpretable by ordinary functions just like ordinal arrays. It should be totally transparent for called functions whether they are processing an int[] or an int[5:10] or however an slice should be defined.

To be honest, I couldn't completely understand from the above discussion why sometimes touching the compiler is seem as something to be avoided. In my view, this is a critical feature for #10378 that cannot be left half-baked (such as for example having only a pure BCL solution). Also, deprecating ArraySegment, and re-implementing it in terms of array slices should also be considered as an option. It is not like there weren't any breaking changes since .NET 1.0.

Array slices (or more generally, safe memory views) are absolutely necessary for the success of C# as a language for high-performance computing. Right now, Python is taken way more seriously for high-performance computing than C#, and this really shouldn't have been the case (Python is a fine language though, but it was C# that initially proposed the non-compromise solution of handling unsafe contexts for more performant code, for example - as such, the fact that we are not being able to fulfill one of the first premises of the language might be a sign that even large or possibly breaking changes should be considered at this point).

@ilexp ilexp referenced this issue Jun 21, 2016

Closed

Span<T> #5851

25 of 28 tasks complete
@juliusfriedman

This comment has been minimized.

Show comment
Hide comment
@juliusfriedman

juliusfriedman Jun 21, 2016

Why do you need a 'Slice' or ArrayView to have high performance?

Any method which takes an Array should have an offset and length parameter, if the method takes IList then you can easily mock something up....as can be seen here: Array

They key thing to take away from that example would be using ref combined with System.Runtime.InteropServices.Marshal.UnsafeAddrOfPinnedArrayElement

Finally and in closing the GC and JIT changes on their own will definitely be enough on their own to seriously consider C# for a high performance solution (if as for some reason it's not already...), Mono or Otherwise; I don't see this Span or ArrayView issue being anything more than fluff for people who are coming from languages and in such cases if MS does accommodate those programmers I feel the best way would be to enable [=] as an implicit cast to [] and ensure that no further allocation allocation occurs.

If you look at the Reference Source you will see that there is an ArrayContracts which can be expanded upon and thus makes such a concept quite easy to achieve and gives the possibility for native 'Span' as an array with any Length and starting Offset it wants.

Furthermore in later versions of the framework it very well could be possible to allow such methods which take plain [] to take [=] / [:] and quite simply the access to member [0] would be adjusted to the add the Offset of the Contract instance just like they are in other languages but there is obviously no way to achieve this without modification to the compiler and possibly the run-time which yes will make C# or .Net in general comparable to some features found in other languages but will provide no other tangible benefit to performance therein whatsoever and will definitely not enhance the API of any .Net library which already takes an array (IList), offset and length as parameters.

There are plenty of other things besides 'Slicing' which can help performance such as reference type stack allocations, SIMD, inter alia` which should be looked into way before time is wasted on this.

juliusfriedman commented Jun 21, 2016

Why do you need a 'Slice' or ArrayView to have high performance?

Any method which takes an Array should have an offset and length parameter, if the method takes IList then you can easily mock something up....as can be seen here: Array

They key thing to take away from that example would be using ref combined with System.Runtime.InteropServices.Marshal.UnsafeAddrOfPinnedArrayElement

Finally and in closing the GC and JIT changes on their own will definitely be enough on their own to seriously consider C# for a high performance solution (if as for some reason it's not already...), Mono or Otherwise; I don't see this Span or ArrayView issue being anything more than fluff for people who are coming from languages and in such cases if MS does accommodate those programmers I feel the best way would be to enable [=] as an implicit cast to [] and ensure that no further allocation allocation occurs.

If you look at the Reference Source you will see that there is an ArrayContracts which can be expanded upon and thus makes such a concept quite easy to achieve and gives the possibility for native 'Span' as an array with any Length and starting Offset it wants.

Furthermore in later versions of the framework it very well could be possible to allow such methods which take plain [] to take [=] / [:] and quite simply the access to member [0] would be adjusted to the add the Offset of the Contract instance just like they are in other languages but there is obviously no way to achieve this without modification to the compiler and possibly the run-time which yes will make C# or .Net in general comparable to some features found in other languages but will provide no other tangible benefit to performance therein whatsoever and will definitely not enhance the API of any .Net library which already takes an array (IList), offset and length as parameters.

There are plenty of other things besides 'Slicing' which can help performance such as reference type stack allocations, SIMD, inter alia` which should be looked into way before time is wasted on this.

@choikwa

This comment has been minimized.

Show comment
Hide comment
@choikwa

choikwa Jun 27, 2016

@cesarsouza Python's Achilles heel is the GIL, and Python devs have the penchant for single-threaded performance over asynchronous, multithreaded operations. It is fine as a scripting language for synchronous tasks.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability.

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

choikwa commented Jun 27, 2016

@cesarsouza Python's Achilles heel is the GIL, and Python devs have the penchant for single-threaded performance over asynchronous, multithreaded operations. It is fine as a scripting language for synchronous tasks.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability.

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

@GeirGrusom

This comment has been minimized.

Show comment
Hide comment
@GeirGrusom

GeirGrusom Jun 28, 2016

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

Arrays are mutable... making a immutable slice doesn't magically change that.

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

Arrays are mutable... making a immutable slice doesn't magically change that.

@gafter gafter referenced this issue Feb 26, 2017

Open

Champion "slicing" / Range #185

1 of 5 tasks complete
@gafter

This comment has been minimized.

Show comment
Hide comment
@gafter

gafter Mar 22, 2017

Member

This proposal is now tracked at dotnet/csharplang#185

Member

gafter commented Mar 22, 2017

This proposal is now tracked at dotnet/csharplang#185

@gafter gafter closed this Mar 22, 2017

@vors vors referenced this issue Dec 22, 2017

Merged

Fix range operator #5732

6 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment