Speed up `for … in xs -> …` in computed collections #16948

brianrourkeboll · 2024-03-25T23:30:03Z

Description

Semi-related followup to #16650 and #16832.

Use Array.map for [|for … in xs -> …|] when xs is an array. We see speedups up to 10×, and allocate between ¼ and ⅓ as much.
Use List.map for [for … in xs -> …] when xs is a list. We see more moderate speedups of ~1.2×.
Add language feature (1d49027).

Examples

let xs : _ array = [|…|]
let ys = [|for x in xs -> f x|] // Now = Array.map f xs.

let xs : _ list = […]
let ys = [for x in xs -> f x] // Now = List.map f xs.

Benchmarks

Arrays

Source

module MicroPerf.ComputedCollections.Arrays.For

open BenchmarkDotNet.Attributes

[<MemoryDiagnoser>]
[<CategoriesColumn>]
[<HideColumns("Method")>]
[<BenchmarkCategory("ComputedCollections", "Arrays", "For")>]
type ComputedCollections_Arrays_For () =
    let ``[|1..1_000|]`` = [|1..1_000|]
    let ``[|1..10_000|]`` = [|1..10_000|]
    let ``[|1..100_000|]`` = [|1..100_000|]
    let ``[|1..1_000_000|]`` = [|1..1_000_000|]

    [<Benchmark; BenchmarkCategory("[|1..1_000|]")>]
    member _.M1 () = [|for x in ``[|1..1_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..10_000|]")>]
    member _.M2 () = [|for x in ``[|1..10_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..100_000|]")>]
    member _.M3 () = [|for x in ``[|1..100_000|]`` -> x|]

    [<Benchmark; BenchmarkCategory("[|1..1_000_000|]")>]
    member _.M4 () = [|for x in ``[|1..1_000_000|]`` -> x|]

| Job     | Categories                                      | Mean           | Error         | StdDev        | Ratio | Gen0     | Gen1     | Gen2     | Allocated   | Alloc Ratio |
|-------- |------------------------------------------------ |---------------:|--------------:|--------------:|------:|---------:|---------:|---------:|------------:|------------:|
| Current | [|1..100_000|],ComputedCollections,Arrays,For   | 1,063,832.9 ns |   7,594.63 ns |   7,104.02 ns |  1.00 |  35.1563 |  35.1563 |  35.1563 |  1415.27 KB |        1.00 |
| Preview | [|1..100_000|],ComputedCollections,Arrays,For   |   257,712.2 ns |  14,550.04 ns |  42,901.11 ns |  0.28 |   5.3711 |   5.3711 |   5.3711 |   390.64 KB |        0.28 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..10_000|],ComputedCollections,Arrays,For    |    42,213.9 ns |     209.39 ns |     163.48 ns |  1.00 |   0.6714 |   0.0610 |        - |   167.44 KB |        1.00 |
| Preview | [|1..10_000|],ComputedCollections,Arrays,For    |     4,076.9 ns |      37.29 ns |      34.88 ns |  0.10 |   0.1602 |        - |        - |    39.09 KB |        0.23 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..1_000_000|],ComputedCollections,Arrays,For | 7,616,795.5 ns | 139,302.01 ns | 123,487.61 ns |  1.00 | 156.2500 | 156.2500 | 156.2500 | 12098.81 KB |        1.00 |
| Preview | [|1..1_000_000|],ComputedCollections,Arrays,For | 1,898,152.2 ns |  37,795.17 ns |  49,144.40 ns |  0.25 |  56.6406 |  56.6406 |  56.6406 |  3906.29 KB |        0.32 |
|         |                                                 |                |               |               |       |          |          |          |             |             |
| Current | [|1..1_000|],ComputedCollections,Arrays,For     |     4,104.8 ns |      24.64 ns |      20.58 ns |  1.00 |   0.0458 |        - |        - |    12.19 KB |        1.00 |
| Preview | [|1..1_000|],ComputedCollections,Arrays,For     |       619.2 ns |       4.87 ns |       4.56 ns |  0.15 |   0.0162 |        - |        - |     3.93 KB |        0.32 |

Lists

Source

module MicroPerf.ComputedCollections.Lists.For

open BenchmarkDotNet.Attributes

[<MemoryDiagnoser>]
[<CategoriesColumn>]
[<HideColumns("Method")>]
[<BenchmarkCategory("ComputedCollections", "Lists", "For")>]
type ComputedCollections_Lists_For () =
    let ``[1..1_000]`` = [1..1_000]
    let ``[1..10_000]`` = [1..10_000]
    let ``[1..100_000]`` = [1..100_000]
    let ``[1..1_000_000]`` = [1..1_000_000]

    [<Benchmark; BenchmarkCategory("[1..1_000]")>]
    member _.M1 () = [for x in ``[1..1_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..10_000]")>]
    member _.M2 () = [for x in ``[1..10_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..100_000]")>]
    member _.M3 () = [for x in ``[1..100_000]`` -> x]

    [<Benchmark; BenchmarkCategory("[1..1_000_000]")>]
    member _.M4 () = [for x in ``[1..1_000_000]`` -> x]

| Job     | Categories                                   | Mean         | Error       | StdDev      | Ratio | RatioSD | Gen0    | Gen1    | Allocated   | Alloc Ratio |
|-------- |--------------------------------------------- |-------------:|------------:|------------:|------:|--------:|--------:|--------:|------------:|------------:|
| Current | [1..100_000],ComputedCollections,Lists,For   |   715.781 us |   9.2431 us |   8.6460 us |  1.00 |    0.00 | 12.6953 |  5.8594 |  3125.04 KB |        1.00 |
| Preview | [1..100_000],ComputedCollections,Lists,For   |   595.263 us |   3.6642 us |   3.2482 us |  0.83 |    0.01 | 12.6953 |  5.8594 |     3125 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..10_000],ComputedCollections,Lists,For    |    60.064 us |   0.4032 us |   0.3772 us |  1.00 |    0.00 |  1.2817 |  0.3052 |   312.54 KB |        1.00 |
| Preview | [1..10_000],ComputedCollections,Lists,For    |    51.431 us |   0.3833 us |   0.3585 us |  0.86 |    0.01 |  1.2817 |  0.3052 |    312.5 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..1_000],ComputedCollections,Lists,For     |     5.825 us |   0.0588 us |   0.0550 us |  1.00 |    0.00 |  0.1297 |       - |    31.29 KB |        1.00 |
| Preview | [1..1_000],ComputedCollections,Lists,For     |     4.838 us |   0.0420 us |   0.0393 us |  0.83 |    0.01 |  0.1297 |       - |    31.25 KB |        1.00 |
|         |                                              |              |             |             |       |         |         |         |             |             |
| Current | [1..1_000_000],ComputedCollections,Lists,For | 8,292.331 us | 146.5085 us | 190.5025 us |  1.00 |    0.00 | 62.5000 | 46.8750 | 31250.04 KB |        1.00 |
| Preview | [1..1_000_000],ComputedCollections,Lists,For | 7,595.720 us | 146.7525 us | 299.7764 us |  0.92 |    0.03 | 46.8750 | 31.2500 |    31250 KB |        1.00 |

Questions

This doesn't actually have anything to do with integral ranges. Should it have its own language feature flag?
- Gave it its own language feature in 1d49027.

Notes/followups

This literally just calls Array.map or List.map, which results in the emission and invocation of an extra closure type. Since this optimization is being called at the IlxGen stage, the closure won't be inlined, since we're already past optimization. It would probably be even faster in most cases if we just emitted while-loops instead, at the expense of adding more code to the compiler.
This does not take a comprehensive approach to improving iteration of all collections used with for in list or array comprehensions. That could be done — use a fast integer iteration for arrays even when inside of a list comprehension; use the fast list iteration for lists even when inside of an array comprehension; use a fast integer iteration for ResizeArray<_>, or IList, or ICollection; etc.

Checklist

Test cases added.
Performance benchmarks added in case of performance changes.
Release notes entry updated.

* Use `Array.map` for [|for … in xs -> …|]` when `xs` is an array. * Use `List.map` for [for … in xs -> …]` when `xs` is a list.

github-actions · 2024-03-25T23:30:49Z

❗ Release notes required

✅ Found changes and release notes in following paths:

Change path Release notes path Description

src/Compiler docs/release-notes/.FSharp.Compiler.Service/8.0.300.md

LanguageFeatures.fsi docs/release-notes/.Language/preview.md

…-collections-for

T-Gro · 2024-03-26T19:22:00Z

The gains for arrays achieved by precise allocation (as part of Array.map) are very nice, this is a good contribution.

What I am fearing is that the lack of inlining might give us a degradation, e.g. for small arrays which previously relied on mapping being inlined - for inlining-crucial statically resolved Fsharp.Core operations such as equality.
https://sharplab.io/#v2:EYLgtghglgdgNAGxAMwM4B8wHsAmBXBAUwAIBZATwBVDUAXVAWAChmjbiA3CBPQyrAMJYwABwgAnEgF5iAJlaF2sEXnYyA2ugCMAblnoAugvaTUBNcU3Is44gA9isRzBXsAtAD5Hye8QA8Xlw8fILCYpLEtAAWhDC+hAioJHYAVLLEhkA===

For that to happen, I guess this would need to move to TcArrayOrListComputedExpression in CheckComputationExpressions, and not be part of Optimizer ;; even though this is semantically an optimization.

Or is there another way to achieve inlining of the mapping?

brianrourkeboll · 2024-03-26T20:45:37Z

@T-Gro

What I am fearing is that the lack of inlining might give us a degradation, e.g. for small arrays which previously relied on mapping being inlined - for inlining-crucial statically resolved Fsharp.Core operations such as equality.

That's a good callout — my guess, without looking, is that this isn't a problem, since this transformation is happening after the main optimization phases, so I think that the lambda representing the mapping should already have any specializations in place (if that were going to happen).

Or is there another way to achieve inlining of the mapping?

If my guess above is wrong, then yes, we could simply emit while-loops here directly (i.e., by hand), with the body inlined. But again, since LowerComputedListOrArrayExpr is called in IlxGen.fs, I think any specialization of the body that was going to happen would need to have already happened anyway.

I went ahead and threw your example code into the emitted IL tests, and this is what I see:

Full IL

.assembly extern runtime { }
.assembly extern FSharp.Core { }
.assembly assembly
{
  .custom instance void [FSharp.Core]Microsoft.FSharp.Core.FSharpInterfaceDataVersionAttribute::.ctor(int32,
                                                                                                      int32,
                                                                                                      int32) = ( 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 ) 

  
  

  .hash algorithm 0x00008004
  .ver 0:0:0:0
}
.mresource public FSharpSignatureData.assembly
{
  
  
}
.mresource public FSharpOptimizationData.assembly
{
  
  
}
.module assembly.exe

.imagebase {value}
.file alignment 0x00000200
.stackreserve 0x00100000
.subsystem 0x0003       
.corflags 0x00000001    





.class public abstract auto ansi sealed assembly
       extends [runtime]System.Object
{
  .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 07 00 00 00 00 00 ) 
  .class auto ansi serializable sealed nested assembly beforefieldinit result@11
         extends class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int32,int32>
  {
    .field static assembly initonly class assembly/result@11 @_instance
    .method assembly specialname rtspecialname instance void  .ctor() cil managed
    {
      .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
      .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
      
      .maxstack  8
      IL_0000:  ldarg.0
      IL_0001:  call       instance void class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int32,int32>::.ctor()
      IL_0006:  ret
    } 

    .method public strict virtual instance int32 Invoke(int32 x) cil managed
    {
      
      .maxstack  8
      IL_0000:  nop
      IL_0001:  ldarg.1
      IL_0002:  ldc.i4.2
      IL_0003:  beq.s      IL_0007

      IL_0005:  ldarg.1
      IL_0006:  ret

      IL_0007:  ldarg.1
      IL_0008:  ldc.i4.2
      IL_0009:  mul
      IL_000a:  ret
    } 

    .method private specialname rtspecialname static void  .cctor() cil managed
    {
      
      .maxstack  10
      IL_0000:  newobj     instance void assembly/result@11::.ctor()
      IL_0005:  stsfld     class assembly/result@11 assembly/result@11::@_instance
      IL_000a:  ret
    } 

  } 

  .field static assembly int32[] input@10
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .field static assembly int32[] result@11
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .method public specialname static int32 get_valueToCompare() cil managed
  {
    .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
    .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
    
    .maxstack  8
    IL_0000:  ldc.i4.2
    IL_0001:  ret
  } 

  .method public specialname static int32[] get_input() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldsfld     int32[] assembly::input@10
    IL_0005:  ret
  } 

  .method public specialname static int32[] get_result() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldsfld     int32[] assembly::result@11
    IL_0005:  ret
  } 

  .method private specialname rtspecialname static void  .cctor() cil managed
  {
    
    .maxstack  8
    IL_0000:  ldc.i4.0
    IL_0001:  stsfld     int32 '<StartupCode$assembly>'.$assembly::init@
    IL_0006:  ldsfld     int32 '<StartupCode$assembly>'.$assembly::init@
    IL_000b:  pop
    IL_000c:  ret
  } 

  .method assembly specialname static void staticInitialization@() cil managed
  {
    
    .maxstack  8
    IL_0000:  nop
    IL_0001:  ldc.i4.2
    IL_0002:  newarr     [runtime]System.Int32
    IL_0007:  dup
    IL_0008:  ldc.i4.0
    IL_0009:  ldc.i4.1
    IL_000a:  stelem     [runtime]System.Int32
    IL_000f:  dup
    IL_0010:  ldc.i4.1
    IL_0011:  ldc.i4.2
    IL_0012:  stelem     [runtime]System.Int32
    IL_0017:  stsfld     int32[] assembly::input@10
    IL_001c:  ldsfld     class assembly/result@11 assembly/result@11::@_instance
    IL_0021:  call       int32[] assembly::get_input()
    IL_0026:  call       !!1[] [FSharp.Core]Microsoft.FSharp.Collections.ArrayModule::Map<int32,int32>(class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<!!0,!!1>,
                                                                                                       !!0[])
    IL_002b:  stsfld     int32[] assembly::result@11
    IL_0030:  ret
  } 

  .property int32 valueToCompare()
  {
    .get int32 assembly::get_valueToCompare()
  } 
  .property int32[] input()
  {
    .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 09 00 00 00 00 00 ) 
    .get int32[] assembly::get_input()
  } 
  .property int32[] result()
  {
    .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::.ctor(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = ( 01 00 09 00 00 00 00 00 ) 
    .get int32[] assembly::get_result()
  } 
} 

.class private abstract auto ansi sealed '<StartupCode$assembly>'.$assembly
       extends [runtime]System.Object
{
  .field static assembly int32 init@
  .custom instance void [runtime]System.Diagnostics.DebuggerBrowsableAttribute::.ctor(valuetype [runtime]System.Diagnostics.DebuggerBrowsableState) = ( 01 00 00 00 00 00 00 00 ) 
  .custom instance void [runtime]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
  .custom instance void [runtime]System.Diagnostics.DebuggerNonUserCodeAttribute::.ctor() = ( 01 00 00 00 ) 
  .method public static void  main@() cil managed
  {
    .entrypoint
    
    .maxstack  8
    IL_0000:  call       void assembly::staticInitialization@()
    IL_0005:  ret
  } 

}

The specialized equality check (beq.s) is preserved in the Invoke method of the generated closure, at least in this case:

.method public strict virtual instance int32 Invoke(int32 x) cil managed
{
  
  .maxstack  8
  IL_0000:  nop
  IL_0001:  ldarg.1
  IL_0002:  ldc.i4.2
  IL_0003:  beq.s      IL_0007

  IL_0005:  ldarg.1
  IL_0006:  ret

  IL_0007:  ldarg.1
  IL_0008:  ldc.i4.2
  IL_0009:  mul
  IL_000a:  ret
}

If I take your example code and change it to make it generic at the module boundary, then the compiler already will not specialize the equality at compile-time anyway. Basically, I think that the lack of inlining of the closure here will not affect this particular thing.

If you want me to run more benchmarks of various scenarios, though, or to add more emitted IL tests, or to look into ways of getting rid of the closure, let me know.

vzarytovskii

Thanks!

brianrourkeboll · 2024-04-05T15:21:58Z

There is some more work that could be done to enable this optimization in more scenarios, e.g., when the for … in xs -> … expression is the last expression in the comprehension, but one or more let-bindings or sequential expressions precede it.

For example, I don't love that this gets optimized —

let y = f ()
[
    for x in xs -> x + y
]

— but this does not —

[
    let y = f ()
    for x in xs -> x + y
]

I started solving this, but I have not had a chance to finish it yet. I think that this PR is probably worthwhile on its own — in fact, I believe that the same phenomenon applies to the optimization introduced in #16832 anyway — so I can just open a new PR to address this issue once I have time.

psfinaki · 2024-04-08T10:31:58Z

Good stuff Brian. This can go as is, follow ups are appreciated :)

As for language flags, I think similar to Edgar's approach with attributes, you can merge these flags into one umbrella flag given you come up with a good name. But this is optional.

Speed up for … in xs -> … in computed colls

aa1fb82

* Use `Array.map` for [|for … in xs -> …|]` when `xs` is an array. * Use `List.map` for [for … in xs -> …]` when `xs` is a list.

brianrourkeboll added 6 commits March 26, 2024 11:55

Add emitted IL tests

b0c0057

Merge branch 'main' of https://github.com/dotnet/fsharp into computed…

8dcda4b

…-collections-for

Update release notes

cc18d34

Give it its own language feature

1d49027

Update release notes

e23df5a

Fantomas

d01b0a0

brianrourkeboll marked this pull request as ready for review March 26, 2024 17:47

brianrourkeboll requested a review from a team as a code owner March 26, 2024 17:47

vzarytovskii approved these changes Apr 5, 2024

View reviewed changes

Merge branch 'main' into computed-collections-for

9000b94

vzarytovskii enabled auto-merge (squash) April 5, 2024 12:42

Merge branch 'main' into computed-collections-for

bb7c642

psfinaki approved these changes Apr 8, 2024

View reviewed changes

vzarytovskii merged commit 468813d into dotnet:main Apr 8, 2024
32 checks passed

brianrourkeboll deleted the computed-collections-for branch April 8, 2024 13:02

This was referenced Apr 9, 2024

[Automated] PRs inserted in VS build main-34809.16 #17010

Closed

[Automated] PRs inserted in VS build feature.debugger.main-34815.163 #17050

Closed

brianrourkeboll mentioned this pull request Apr 18, 2024

Computed collections: optimize simple mappings with preludes #17067

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `for … in xs -> …` in computed collections #16948

Speed up `for … in xs -> …` in computed collections #16948

brianrourkeboll commented Mar 25, 2024 •

edited

github-actions bot commented Mar 25, 2024 •

edited

T-Gro commented Mar 26, 2024

brianrourkeboll commented Mar 26, 2024 •

edited

vzarytovskii left a comment

brianrourkeboll commented Apr 5, 2024

psfinaki commented Apr 8, 2024

Speed up for … in xs -> … in computed collections #16948

Speed up for … in xs -> … in computed collections #16948

Conversation

brianrourkeboll commented Mar 25, 2024 • edited

Description

Examples

Benchmarks

Arrays

Lists

Questions

Notes/followups

Checklist

github-actions bot commented Mar 25, 2024 • edited

❗ Release notes required

T-Gro commented Mar 26, 2024

brianrourkeboll commented Mar 26, 2024 • edited

vzarytovskii left a comment

Choose a reason for hiding this comment

brianrourkeboll commented Apr 5, 2024

psfinaki commented Apr 8, 2024

Speed up `for … in xs -> …` in computed collections #16948

Speed up `for … in xs -> …` in computed collections #16948

brianrourkeboll commented Mar 25, 2024 •

edited

github-actions bot commented Mar 25, 2024 •

edited

brianrourkeboll commented Mar 26, 2024 •

edited