Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice can be simplified does not produce the same code #47629

Open
msedi opened this issue Sep 11, 2020 · 9 comments
Open

Slice can be simplified does not produce the same code #47629

msedi opened this issue Sep 11, 2020 · 9 comments
Labels
Area-Compilers Code Gen Quality Room for improvement in the quality of the compiler's generated code help wanted The issue is "up for grabs" - add a comment if you are interested in working on it New Language Feature - Range Range
Milestone

Comments

@msedi
Copy link

msedi commented Sep 11, 2020

Currently the IDE suggest with Span not to use slice but instead use the Range indexer (IDE0057: Slice can be simplified to this). So the original code:

new Vector<T>(B.Slice(Idx))

is then translated to

new Vector<T>(B[Idx..])

the resulting generated code is different (which is OK)

    public void WithSlice() {
        ReadOnlySpan<float> A = new float[100];
        
        for (int i=0; i<A.Length; i+= Vector<float>.Count)
            new Vector<float>(A.Slice(i));   
    }
    
       public void WithRange() {
        ReadOnlySpan<float> A = new float[100];
        
        for (int i=0; i<A.Length; i+= Vector<float>.Count)
            new Vector<float>(A[i..]);
    }

and the result taken from sharplab:

    public void WithSlice()
    {
        ReadOnlySpan<float> readOnlySpan = new float[100];
        for (int i = 0; i < readOnlySpan.Length; i += Vector<float>.Count)
        {
            new Vector<float>(readOnlySpan.Slice(i));
        }
    }

    public void WithRange()
    {
        ReadOnlySpan<float> readOnlySpan = new float[100];
        for (int i = 0; i < readOnlySpan.Length; i += Vector<float>.Count)
        {
            ReadOnlySpan<float> readOnlySpan2 = readOnlySpan;
            int length = readOnlySpan2.Length;
            int num = i;
            int length2 = length - num;
            new Vector<float>(readOnlySpan2.Slice(num, length2));
        }
    }

but the JIT Asm seems to be different. I didn't make any performance comparisons, but for my environment it is crucial that the suggested rewrite does not alter the performance of the code.

@stephentoub stephentoub transferred this issue from dotnet/runtime Sep 11, 2020
@svick
Copy link
Contributor

svick commented Sep 11, 2020

@stephentoub This was intentionally opened in dotnet/runtime. The assumption here is that while the IL is different, the JIT should be able to optimize both versions equally. See dotnet/csharplang#3879 and #47626 for previous discussions of this.

@stephentoub
Copy link
Member

stephentoub commented Sep 11, 2020

@svick The C# compiler and the types it's generating use of are resulting in significantly more complex code for the JIT to then need to unwind. Such issues are covered by existing issues like dotnet/runtime#11848 and dotnet/runtime#11870. If this issue is about the JIT, it should just be closed.

That said, the C# compiler is doing a disservice here by using Slice(start, count) rather than just Slice(start). The latter has less code in it, less checks to be performed, and less code required to invoke it. I don't know that the JIT would ever be able to make them identical; in theory it could, in practice that's a whole lot of analysis. This was all raised when the indexing feature was introduced, and it was decided by the language team that such level of optimization wasn't important.

If the analyzer from dotnet/roslyn is encouraging this transformation and folks expect the resulting code to be identical, then the dotnet/roslyn compiler should be updated to abide. There's no guarantee the JIT will produce the same asm, otherwise.

@CyrusNajmabadi
Copy link
Member

(IDE0057: Slice can be simplified to this).

Note that suggestions do not, and have never, assured that you will get the exact same performance (or necessarily even the exact same behavior). The suggestion here is merely that, making you aware that you could write the above in a different fashion if you would prefer to do so. This is why it's only a suggestion, and not a warning (or an error).

In this case, we are suggesting this because the type abides by the Slice-pattern that the language/compiler recognizes, and for most people the code may be clearer using the new dedicated syntax for this. If you don't want to use the new language form due to the potential for increased performance, you can suppress the issue here, or disable that feature entirely :)

--

Now, as to the compiler. I believe @agocke worked on this. Andy, what are your thoughts on this:

That said, the C# compiler is doing a disservice here by using Slice(start, count) rather than just Slice(start). The latter has less code in it, less checks to be performed, and less code required to invoke it.

Seems like we could likely take a reasonably safe change to allow that given that the risk of breakage would be quite low. Given the benefits @stephentoub mentions, it may be worth it.

@agocke
Copy link
Member

agocke commented Sep 11, 2020

That said, the C# compiler is doing a disservice here by using Slice(start, count) rather than just Slice(start). The latter has less code in it, less checks to be performed, and less code required to invoke it.

I don't think it's necessarily a bad idea, just wasn't strictly necessary to do as part of the feature. Certainly the produced IL could be improved, regardless of which method is being called, That it wasn't was purely a question of implementation budget, not technical possibility.

Slice(start) could also be recognized as an acceptable pattern -- although it is a slight breaking change in the case that Slice(start, count) has different behavior. I'm not particularly worried about that scenario, though.

@agocke
Copy link
Member

agocke commented Sep 11, 2020

Also, there are comments around about the JIT being unable to "see through" the Range type. That's true, but that was a comment on the old design where the syntax required a Range indexer. The example above, and most code generally, use the "pattern" indexer, which is just sugar for calling the Slice method. At best we are probably saving a few int comparisons by changing the code here, and I'm not sure the JIT output is appreciably different.

@stephentoub
Copy link
Member

At best we are probably saving a few int comparisons by changing the code here, and I'm not sure the JIT output is appreciably different.

To be clear, we're talking about a very micro-optimization that will have little bearing on the majority of code developers write; for most code bases, span[lower..] is perfectly fine. It's just the kind of thing that impacts lower-level APIs like Int32.Parse(span) such that we avoid using span[lower..] in dotnet/runtime and need to avoid such offered transformations by the IDE, and it'd be nice if we didn't have to / didn't have to think about it.

Here's the current asm difference...
For private Span<byte> Slice(Span<byte> span) => span.Slice(50);:

; Program.Slice(System.Span`1<Byte>)
       sub       rsp,28
       mov       rax,[r8]
       mov       ecx,[r8+8]
       cmp       ecx,32
       jb        short M01_L00
       add       ecx,0FFFFFFCE
       add       rax,32
       mov       [rdx],rax
       mov       [rdx+8],ecx
       mov       rax,rdx
       add       rsp,28
       ret
M01_L00:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
; Total bytes of code 43

and for private Span<byte> Range(Span<byte> span) => span[50..];

; Program.Range(System.Span`1<Byte>)
       sub       rsp,28
       mov       rax,[r8]
       mov       ecx,[r8+8]
       lea       r8d,[rcx+0FFCE]
       mov       r9d,r8d
       add       r9,32
       mov       ecx,ecx
       cmp       r9,rcx
       ja        short M01_L00
       add       rax,32
       mov       [rdx],rax
       mov       [rdx+8],r8d
       mov       rax,rdx
       add       rsp,28
       ret
M01_L00:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
; Total bytes of code 54

@msedi
Copy link
Author

msedi commented Sep 14, 2020

will have little bearing on the majority of code developers

In the end for me it's OK, because I know now that it's different to use one or the other. If would be nice if we talk about numbers. I can try to create a benchmark with the situation I currently have.

Today we benchmarked our vectorized math operations when we switched from using regular arrays to Span.
We saw a performance drop of around a factor of 2-3 (netcoreapp3.1). We need to check this also with (net5.0).
But this is another story and maybe not due to this problem.

@msedi
Copy link
Author

msedi commented Sep 14, 2020

OK here's the benchmark. I have provided it for both, net5.0 and netcoreapp3.1. At a first look it doesn't seem to be much (~0.5ms) but we are doing this at least 1.000.000 times which would be 500.000ms, we is not negligible. As I said I can live with Slice() without using Range . And @stephentoub is right, that in most cases it wouldn't be a problem, because you have both options. You only need to be aware that it has a bit difference in performance,.

I'm also happy that the array-based method seems to be worse than the Span<T> method.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.17134.1667 (1803/April2018Update/Redstone4)
Intel Xeon Silver 4116 CPU 2.10GHz, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.8.20417.9
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.40711, CoreFX 5.0.20.40711), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.40711, CoreFX 5.0.20.40711), X64 RyuJIT


|            Method |     Mean |     Error |    StdDev |
|------------------ |---------:|----------:|----------:|
|  AddSpanWithSlice | 2.556 ms | 0.0343 ms | 0.0321 ms |
|  AddSpanWithRange | 2.930 ms | 0.0567 ms | 0.0776 ms |
| AddArrayWithSlice | 3.771 ms | 0.0257 ms | 0.0240 ms |
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.17134.1667 (1803/April2018Update/Redstone4)
Intel Xeon Silver 4116 CPU 2.10GHz, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.8.20417.9
  [Host]     : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT


|            Method |     Mean |     Error |    StdDev |
|------------------ |---------:|----------:|----------:|
|  AddSpanWithSlice | 2.622 ms | 0.0519 ms | 0.0950 ms |
|  AddSpanWithRange | 2.850 ms | 0.0303 ms | 0.0283 ms |
| AddArrayWithSlice | 3.915 ms | 0.0779 ms | 0.2185 ms |

Benchmark code attached in a situation we are currently using for image processing.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

using System;
using System.Numerics;

namespace ConsoleApp8
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<VectorizedBenchmark>();
        }
    }

    public class VectorizedBenchmark
    {
        int N = 2048 * 2048;
        float[] DataA;
        float[] DataB;
        float[] Result;

        [GlobalSetup]
        public void GlobalSetup()
        {
            DataA = new float[N];
            DataB = new float[N];
            Result = new float[N];
            for (int i = 0; i < N; i++)
            {
                DataA[i] = i;
                DataB[i] = i;
            }
        }

        [Benchmark]
        public float[] AddSpanWithSlice()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            ReadOnlySpan<float> A = DataA;
            ReadOnlySpan<float> B = DataA;
            Span<float> C = Result;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(A.Slice(i)) + new Vector<float>(B.Slice(i))).CopyTo(C.Slice(i));
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddSpanWithRange()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            ReadOnlySpan<float> A = DataA;
            ReadOnlySpan<float> B = DataA;
            Span<float> C = Result;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(A[i..]) + new Vector<float>(B[i..])).CopyTo(C[i..]);
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddArrayWithSlice()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(DataA, i) + new Vector<float>(DataB, i)).CopyTo(Result, i);
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

    }
}

@msedi
Copy link
Author

msedi commented Sep 14, 2020

I just wanted to point another benchmark on another system. I just wanted to point that there was a long way from coming from the "old" without-Span<T> world, where we did everything either in regular C#, going to unsafe code with pointer and then using Span<T> and finally using Vector<T> - Honestly thanks for the improvement in speed . The last method was a suggestion from @benaadams (thanks for that). One interesting finding is though that the AddVectorizedSpanWithRange is definitely slower on this example computer than on the first.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1082 (1909/November2018Update/19H2)
Intel Xeon CPU E5-2670 0 2.60GHz, 2 CPU, 32 logical and 16 physical cores
.NET Core SDK=5.0.100-preview.6.20318.15
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.30506, CoreFX 5.0.20.30506), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.30506, CoreFX 5.0.20.30506), X64 RyuJIT


|                     Method |     Mean |     Error |    StdDev |
|--------------------------- |---------:|----------:|----------:|
| AddVectorizedSpanWithSlice | 2.953 ms | 0.0143 ms | 0.0111 ms |
| AddVectorizedSpanWithRange | 4.143 ms | 0.0163 ms | 0.0128 ms |
|         AddVectorizedArray | 4.043 ms | 0.0166 ms | 0.0139 ms |
|            AddArrayClassic | 8.396 ms | 0.0050 ms | 0.0045 ms |
|             AddArrayUnsafe | 4.180 ms | 0.0036 ms | 0.0030 ms |
|   AddVectorizedArrayUnsafe | 2.512 ms | 0.0149 ms | 0.0132 ms |

Just for reference, here's is the code:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

using System;
using System.Numerics;
using System.Runtime.CompilerServices;

namespace ConsoleApp8
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<VectorizedBenchmark>();
        }
    }

    public class VectorizedBenchmark
    {
        int N = 2048 * 2048;
        float[] DataA;
        float[] DataB;
        float[] Result;

        [GlobalSetup]
        public void GlobalSetup()
        {
            DataA = new float[N];
            DataB = new float[N];
            Result = new float[N];
            for (int i = 0; i < N; i++)
            {
                DataA[i] = i;
                DataB[i] = i;
            }
        }

        [Benchmark]
        public float[] AddVectorizedSpanWithSlice()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            ReadOnlySpan<float> A = DataA;
            ReadOnlySpan<float> B = DataA;
            Span<float> C = Result;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(A.Slice(i)) + new Vector<float>(B.Slice(i))).CopyTo(C.Slice(i));
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddVectorizedSpanWithRange()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            ReadOnlySpan<float> A = DataA;
            ReadOnlySpan<float> B = DataA;
            Span<float> C = Result;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(A[i..]) + new Vector<float>(B[i..])).CopyTo(C[i..]);
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddVectorizedArray()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            for (; i < L; i += StepSize)
            {
                (new Vector<float>(DataA, i) + new Vector<float>(DataB, i)).CopyTo(Result, i);
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddArrayClassic()
        {
            for (int i = 0; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }

            return Result;
        }

        [Benchmark]
        public float[] AddArrayUnsafe()
        {
            unsafe
            {
                fixed (float* dataA = DataA)
                fixed (float* dataB = DataB)
                fixed (float* result = Result)
                {
                    float* dataAptr = dataA, dataBptr = dataB, resultptr = result;
                    for (int i = 0; i < N; i++, dataAptr++, dataBptr++, resultptr++)
                    {
                        *resultptr = *dataAptr + *dataBptr;
                    }
                }
            }

            return Result;
        }

        [Benchmark]
        public float[] AddVectorizedArrayUnsafe()
        {
            int StepSize = Vector<float>.Count;
            int L = N - (N % StepSize);
            int i = 0;

            ref float A = ref DataA[0];
            ref float B = ref DataB[0];
            ref float C = ref Result[0];

            for (; i < L; i += StepSize)
            {
                var vectorA = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref A, i)));
                var vectorB = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref A, i)));
                var vectorC = vectorA + vectorB;

                Unsafe.WriteUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref Unsafe.Add(ref C, i)), vectorC);
            }

            for (; i < N; i++)
            {
                Result[i] = DataA[i] + DataB[i];
            }
            return Result;
        }
    }
}

@sharwell sharwell changed the title IDE0057: Slice can be simplified does not produce the same code Slice can be simplified does not produce the same code Oct 14, 2020
@sharwell sharwell added Code Gen Quality Room for improvement in the quality of the compiler's generated code New Language Feature - Range Range labels Oct 14, 2020
@jcouv jcouv added this to the Compiler.Next milestone Mar 17, 2022
@agocke agocke added the help wanted The issue is "up for grabs" - add a comment if you are interested in working on it label Jun 9, 2023
@jaredpar jaredpar modified the milestones: Compiler.Next, Backlog Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Code Gen Quality Room for improvement in the quality of the compiler's generated code help wanted The issue is "up for grabs" - add a comment if you are interested in working on it New Language Feature - Range Range
Projects
None yet
Development

No branches or pull requests

9 participants