Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up for … in xs -> … in computed collections #16948

Merged
merged 9 commits into from Apr 8, 2024

Conversation

brianrourkeboll
Copy link
Contributor

@brianrourkeboll brianrourkeboll commented Mar 25, 2024

Description

Semi-related followup to #16650 and #16832.

  • Use Array.map for [|for … in xs -> …|] when xs is an array. We see speedups up to 10×, and allocate between ¼ and ⅓ as much.
  • Use List.map for [for … in xs -> …] when xs is a list. We see more moderate speedups of ~1.2×.
  • Add language feature (1d49027).

Examples

let xs : _ array = [||]
let ys = [|for x in xs -> f x|] // Now = Array.map f xs.
let xs : _ list = []
let ys = [for x in xs -> f x] // Now = List.map f xs.

Benchmarks

Arrays

Source
module MicroPerf.ComputedCollections.Arrays.For

open BenchmarkDotNet.Attributes

[<MemoryDiagnoser>]
[<CategoriesColumn>]
[<HideColumns("Method")>]
[<BenchmarkCategory("ComputedCollections", "Arrays", "For")>]
type ComputedCollections_Arrays_For () =
    let ``[|1..1_000|]`` = [|1..1_000|]
    let ``[|1..10_000|]`` = [|1..10_000|]
    let ``[|1..100_000|]`` = [|1..100_000|]
    let ``[|1..1_000_000|]`` = [|1..1_000_000|]

    [<Benchmark; BenchmarkCategory("[|1..1_000|]")>]
    member _.M1 () = [|for x in ``[|1..1_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..10_000|]")>]
    member _.M2 () = [|for x in ``[|1..10_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..100_000|]")>]
    member _.M3 () = [|for x in ``[|1..100_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..1_000_000|]")>]
    member _.M4 () = [|for x in ``[|1..1_000_000|]`` -> x|]
| Job     | Categories                                      | Mean           | Error         | StdDev        | Ratio | Gen0     | Gen1     | Gen2     | Allocated   | Alloc Ratio |
|-------- |------------------------------------------------ |---------------:|--------------:|--------------:|------:|---------:|---------:|---------:|------------:|------------:|
| Current | [|1..100_000|],ComputedCollections,Arrays,For   | 1,063,832.9 ns |   7,594.63 ns |   7,104.02 ns |  1.00 |  35.1563 |  35.1563 |  35.1563 |  1415.27 KB |        1.00 |
| Preview | [|1..100_000|],ComputedCollections,Arrays,For   |   257,712.2 ns |  14,550.04 ns |  42,901.11 ns |  0.28 |   5.3711 |   5.3711 |   5.3711 |   390.64 KB |        0.28 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..10_000|],ComputedCollections,Arrays,For    |    42,213.9 ns |     209.39 ns |     163.48 ns |  1.00 |   0.6714 |   0.0610 |        - |   167.44 KB |        1.00 |
| Preview | [|1..10_000|],ComputedCollections,Arrays,For    |     4,076.9 ns |      37.29 ns |      34.88 ns |  0.10 |   0.1602 |        - |        - |    39.09 KB |        0.23 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..1_000_000|],ComputedCollections,Arrays,For | 7,616,795.5 ns | 139,302.01 ns | 123,487.61 ns |  1.00 | 156.2500 | 156.2500 | 156.2500 | 12098.81 KB |        1.00 |
| Preview | [|1..1_000_000|],ComputedCollections,Arrays,For | 1,898,152.2 ns |  37,795.17 ns |  49,144.40 ns |  0.25 |  56.6406 |  56.6406 |  56.6406 |  3906.29 KB |        0.32 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..1_000|],ComputedCollections,Arrays,For     |     4,104.8 ns |      24.64 ns |      20.58 ns |  1.00 |   0.0458 |        - |        - |    12.19 KB |        1.00 |
| Preview | [|1..1_000|],ComputedCollections,Arrays,For     |       619.2 ns |       4.87 ns |       4.56 ns |  0.15 |   0.0162 |        - |        - |     3.93 KB |        0.32 |

Lists

Source
module MicroPerf.ComputedCollections.Lists.For

open BenchmarkDotNet.Attributes

[<MemoryDiagnoser>]
[<CategoriesColumn>]
[<HideColumns("Method")>]
[<BenchmarkCategory("ComputedCollections", "Lists", "For")>]
type ComputedCollections_Lists_For () =
    let ``[1..1_000]`` = [1..1_000]
    let ``[1..10_000]`` = [1..10_000]
    let ``[1..100_000]`` = [1..100_000]
    let ``[1..1_000_000]`` = [1..1_000_000]

    [<Benchmark; BenchmarkCategory("[1..1_000]")>]
    member _.M1 () = [for x in ``[1..1_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..10_000]")>]
    member _.M2 () = [for x in ``[1..10_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..100_000]")>]
    member _.M3 () = [for x in ``[1..100_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..1_000_000]")>]
    member _.M4 () = [for x in ``[1..1_000_000]`` -> x]
| Job     | Categories                                   | Mean         | Error       | StdDev      | Ratio | RatioSD | Gen0    | Gen1    | Allocated   | Alloc Ratio |
|-------- |--------------------------------------------- |-------------:|------------:|------------:|------:|--------:|--------:|--------:|------------:|------------:|
| Current | [1..100_000],ComputedCollections,Lists,For   |   715.781 us |   9.2431 us |   8.6460 us |  1.00 |    0.00 | 12.6953 |  5.8594 |  3125.04 KB |        1.00 |
| Preview | [1..100_000],ComputedCollections,Lists,For   |   595.263 us |   3.6642 us |   3.2482 us |  0.83 |    0.01 | 12.6953 |  5.8594 |     3125 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..10_000],ComputedCollections,Lists,For    |    60.064 us |   0.4032 us |   0.3772 us |  1.00 |    0.00 |  1.2817 |  0.3052 |   312.54 KB |        1.00 |
| Preview | [1..10_000],ComputedCollections,Lists,For    |    51.431 us |   0.3833 us |   0.3585 us |  0.86 |    0.01 |  1.2817 |  0.3052 |    312.5 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..1_000],ComputedCollections,Lists,For     |     5.825 us |   0.0588 us |   0.0550 us |  1.00 |    0.00 |  0.1297 |       - |    31.29 KB |        1.00 |
| Preview | [1..1_000],ComputedCollections,Lists,For     |     4.838 us |   0.0420 us |   0.0393 us |  0.83 |    0.01 |  0.1297 |       - |    31.25 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..1_000_000],ComputedCollections,Lists,For | 8,292.331 us | 146.5085 us | 190.5025 us |  1.00 |    0.00 | 62.5000 | 46.8750 | 31250.04 KB |        1.00 |
| Preview | [1..1_000_000],ComputedCollections,Lists,For | 7,595.720 us | 146.7525 us | 299.7764 us |  0.92 |    0.03 | 46.8750 | 31.2500 |    31250 KB |        1.00 |

Questions

  • This doesn't actually have anything to do with integral ranges. Should it have its own language feature flag?
    • Gave it its own language feature in 1d49027.

Notes/followups

  • This literally just calls Array.map or List.map, which results in the emission and invocation of an extra closure type. Since this optimization is being called at the IlxGen stage, the closure won't be inlined, since we're already past optimization. It would probably be even faster in most cases if we just emitted while-loops instead, at the expense of adding more code to the compiler.
  • This does not take a comprehensive approach to improving iteration of all collections used with for in list or array comprehensions. That could be done — use a fast integer iteration for arrays even when inside of a list comprehension; use the fast list iteration for lists even when inside of an array comprehension; use a fast integer iteration for ResizeArray<_>, or IList, or ICollection; etc.

Checklist

  • Test cases added.
  • Performance benchmarks added in case of performance changes.
  • Release notes entry updated.

* Use `Array.map` for [|for … in xs -> …|]` when `xs` is an array.

* Use `List.map` for [for … in xs -> …]` when `xs` is a list.
Copy link
Contributor

github-actions bot commented Mar 25, 2024

❗ Release notes required


✅ Found changes and release notes in following paths:

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/8.0.300.md
LanguageFeatures.fsi docs/release-notes/.Language/preview.md

@brianrourkeboll brianrourkeboll marked this pull request as ready for review March 26, 2024 17:47
@brianrourkeboll brianrourkeboll requested a review from a team as a code owner March 26, 2024 17:47
@T-Gro
Copy link
Member

T-Gro commented Mar 26, 2024

The gains for arrays achieved by precise allocation (as part of Array.map) are very nice, this is a good contribution.

What I am fearing is that the lack of inlining might give us a degradation, e.g. for small arrays which previously relied on mapping being inlined - for inlining-crucial statically resolved Fsharp.Core operations such as equality.
https://sharplab.io/#v2:EYLgtghglgdgNAGxAMwM4B8wHsAmBXBAUwAIBZATwBVDUAXVAWAChmjbiA3CBPQyrAMJYwABwgAnEgF5iAJlaF2sEXnYyA2ugCMAblnoAugvaTUBNcU3Is44gA9isRzBXsAtAD5Hye8QA8Xlw8fILCYpLEtAAWhDC+hAioJHYAVLLEhkA===

For that to happen, I guess this would need to move to TcArrayOrListComputedExpression in CheckComputationExpressions, and not be part of Optimizer ;; even though this is semantically an optimization.

Or is there another way to achieve inlining of the mapping?

@brianrourkeboll
Copy link
Contributor Author

brianrourkeboll commented Mar 26, 2024

@T-Gro

What I am fearing is that the lack of inlining might give us a degradation, e.g. for small arrays which previously relied on mapping being inlined - for inlining-crucial statically resolved Fsharp.Core operations such as equality.

That's a good callout — my guess, without looking, is that this isn't a problem, since this transformation is happening after the main optimization phases, so I think that the lambda representing the mapping should already have any specializations in place (if that were going to happen).

Or is there another way to achieve inlining of the mapping?

If my guess above is wrong, then yes, we could simply emit while-loops here directly (i.e., by hand), with the body inlined. But again, since LowerComputedListOrArrayExpr is called in IlxGen.fs, I think any specialization of the body that was going to happen would need to have already happened anyway.

I went ahead and threw your example code into the emitted IL tests, and this is what I see:

Full IL
.assembly extern runtime { }
.assembly extern FSharp.Core { }
.assembly assembly
{
  .custom instance void [FSharp.Core]Microsoft.FSharp.Core.FSharpInterfaceDataVersionAttribute::.ctor(int32,
                                                                                                      int32,
                                                                                                      int32) = ( 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 ) 

  
  

  .hash algorithm 0x00008004
  .ver 0:0:0:0
}
.mresource public FSharpSignatureData.assembly
{
  
  
}
.mresource public FSharpOptimizationData.assembly
{
  
  
}
.module assembly.exe

.imagebase {value}
.file alignment 0x00000200
.stackreserve 0x00100000
.subsystem 0x0003       
.corflags 0x00000001    





.class public abstract auto ansi sealed assembly
       extends [runtime]System.Object
{
  .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 07 00 00 00 00 00 ) 
  .class auto ansi serializable sealed nested assembly beforefieldinit result@11
         extends class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int32,int32>
  {
    .field static assembly initonly class assembly/result@11 @_instance
    .method assembly specialname rtspecialname instance void  .ctor() cil managed
    {
      .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
      .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
      
      .maxstack  8
      IL_0000:  ldarg.0
      IL_0001:  call       instance void class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int32,int32>::.ctor()
      IL_0006:  ret
    } 

    .method public strict virtual instance int32 Invoke(int32 x) cil managed
    {
      
      .maxstack  8
      IL_0000:  nop
      IL_0001:  ldarg.1
      IL_0002:  ldc.i4.2
      IL_0003:  beq.s      IL_0007

      IL_0005:  ldarg.1
      IL_0006:  ret

      IL_0007:  ldarg.1
      IL_0008:  ldc.i4.2
      IL_0009:  mul
      IL_000a:  ret
    } 

    .method private specialname rtspecialname static void  .cctor() cil managed
    {
      
      .maxstack  10
      IL_0000:  newobj     instance void assembly/result@11::.ctor()
      IL_0005:  stsfld     class assembly/result@11 assembly/result@11::@_instance
      IL_000a:  ret
    } 

  } 

  .field static assembly int32[] input@10
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .field static assembly int32[] result@11
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .method public specialname static int32 get_valueToCompare() cil managed
  {
    .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
    .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
    
    .maxstack  8
    IL_0000:  ldc.i4.2
    IL_0001:  ret
  } 

  .method public specialname static int32[] get_input() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldsfld     int32[] assembly::input@10
    IL_0005:  ret
  } 

  .method public specialname static int32[] get_result() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldsfld     int32[] assembly::result@11
    IL_0005:  ret
  } 

  .method private specialname rtspecialname static void  .cctor() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldc.i4.0
    IL_0001:  stsfld     int32 '<StartupCode$assembly>'.$assembly::init@
    IL_0006:  ldsfld     int32 '<StartupCode$assembly>'.$assembly::init@
    IL_000b:  pop
    IL_000c:  ret
  } 

  .method assembly specialname static void staticInitialization@() cil managed
  {
    
    .maxstack  8
    IL_0000:  nop
    IL_0001:  ldc.i4.2
    IL_0002:  newarr     [runtime]System.Int32
    IL_0007:  dup
    IL_0008:  ldc.i4.0
    IL_0009:  ldc.i4.1
    IL_000a:  stelem     [runtime]System.Int32
    IL_000f:  dup
    IL_0010:  ldc.i4.1
    IL_0011:  ldc.i4.2
    IL_0012:  stelem     [runtime]System.Int32
    IL_0017:  stsfld     int32[] assembly::input@10
    IL_001c:  ldsfld     class assembly/result@11 assembly/result@11::@_instance
    IL_0021:  call       int32[] assembly::get_input()
    IL_0026:  call       !!1[] [FSharp.Core]Microsoft.FSharp.Collections.ArrayModule::Map<int32,int32>(class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<!!0,!!1>,
                                                                                                       !!0[])
    IL_002b:  stsfld     int32[] assembly::result@11
    IL_0030:  ret
  } 

  .property int32 valueToCompare()
  {
    .get int32 assembly::get_valueToCompare()
  } 
  .property int32[] input()
  {
    .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 09 00 00 00 00 00 ) 
    .get int32[] assembly::get_input()
  } 
  .property int32[] result()
  {
    .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 09 00 00 00 00 00 ) 
    .get int32[] assembly::get_result()
  } 
} 

.class private abstract auto ansi sealed '<StartupCode$assembly>'.$assembly
       extends [runtime]System.Object
{
  .field static assembly int32 init@
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
  .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
  .method public static void  main@() cil managed
  {
    .entrypoint
    
    .maxstack  8
    IL_0000:  call       void assembly::staticInitialization@()
    IL_0005:  ret
  } 

} 

The specialized equality check (beq.s) is preserved in the Invoke method of the generated closure, at least in this case:

.method public strict virtual instance int32 Invoke(int32 x) cil managed
{
  
  .maxstack  8
  IL_0000:  nop
  IL_0001:  ldarg.1
  IL_0002:  ldc.i4.2
  IL_0003:  beq.s      IL_0007

  IL_0005:  ldarg.1
  IL_0006:  ret

  IL_0007:  ldarg.1
  IL_0008:  ldc.i4.2
  IL_0009:  mul
  IL_000a:  ret
} 

If I take your example code and change it to make it generic at the module boundary, then the compiler already will not specialize the equality at compile-time anyway. Basically, I think that the lack of inlining of the closure here will not affect this particular thing.

If you want me to run more benchmarks of various scenarios, though, or to add more emitted IL tests, or to look into ways of getting rid of the closure, let me know.

Copy link
Member

@vzarytovskii vzarytovskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@vzarytovskii vzarytovskii enabled auto-merge (squash) April 5, 2024 12:42
@brianrourkeboll
Copy link
Contributor Author

There is some more work that could be done to enable this optimization in more scenarios, e.g., when the for … in xs -> … expression is the last expression in the comprehension, but one or more let-bindings or sequential expressions precede it.

For example, I don't love that this gets optimized —

let y = f ()
[
    for x in xs -> x + y
]

— but this does not —

[
    let y = f ()
    for x in xs -> x + y
]

I started solving this, but I have not had a chance to finish it yet. I think that this PR is probably worthwhile on its own — in fact, I believe that the same phenomenon applies to the optimization introduced in #16832 anyway — so I can just open a new PR to address this issue once I have time.

@vzarytovskii vzarytovskii merged commit 468813d into dotnet:main Apr 8, 2024
32 checks passed
@psfinaki
Copy link
Member

psfinaki commented Apr 8, 2024

Good stuff Brian. This can go as is, follow ups are appreciated :)

As for language flags, I think similar to Edgar's approach with attributes, you can merge these flags into one umbrella flag given you come up with a good name. But this is optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants