Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster list/array computation expressions #11592

Merged
merged 15 commits into from
Jun 2, 2021
Merged

Faster list/array computation expressions #11592

merged 15 commits into from
Jun 2, 2021

Conversation

dsyme
Copy link
Contributor

@dsyme dsyme commented May 21, 2021

This implements RFC FS-1099 - library support for faster computed list and array expressions and improves the performance of computed list and array expressions

This is drawing from the lessons of RFC FS-1087 and FS-1097 that we should (somewhat obviously) have a struct collector and just generate the synchronous code. This is what the implementation does - it looks for the compiled internal form of [ ... ] and [| ... |] and transforms to synchronous code (which is the original code with yield and yield! replaced by calls to the corresponding collector method).

For historical reasons back to F# 1.0 the compiled internal form of [ ... ] and [| ... |] is like this

`Seq.toList (seq { ...Seq.append/Seq.singleton/Seq.empty/...  })`

`Seq.toArray (seq { ...Seq.append/Seq.singleton/Seq.empty/...  })`

Previously we compiled the seq { ... } into a state machine, but still called Seq.toList and Seq.toArray. This was based
on thinking overly-influenced by LINQ and its focus on IEnumerable for everything. However IEnumerable is a
needless inversion of control and computationally expensive with extra allocations, MoveNext etc., - so
much so that LINQ is routinely avoided by some teams.
Instead, when ultimately producing lists and arrays, we can produce near-optimal synchronous code directly.
we should always have compiled these things in this way...

Notes:

  • This is an optimization that preserves the same semantics and execution order, so should work for all existing code. It involves a small addition to FSharp.Core documented in the RFC above. The optimization kicks in if the collectors are present in the referenced FSharp.Core.

  • One particular optimization is that, for lists, a yield! of a list in tailcall position simply stitches that list into the result without copying (AddManyAndClose on ListCollector<T>). This is valid because lists are immutable - and we already do this for List.append for example. In theory this could reduce some O(n) operations to O(1) though I doubt we'll see that in practice.

  • There is also an optimizations in ArrayCollector to avoid creating a ResizeArray for size 0,1 or 2 arrays. This is obviously a good optimization based on any reasonable model of allocation costs. However it may not be optimal and we can adjust this in the future - the relevant struct fields are internal and can be changed. It would be good to further measure the stack-space/allocation/data-copying tradeoffs and decide if it's worth extending this further.

  • I went through tests\walkthroughs\DebugStepping\TheBigFileOfDebugStepping.fsx and made some improvements to debugging of list, array and sequence expressions and checked that all the sample list/sequence expressions in that file debug OK. Specifically the location of the debug points associated with try and with and while and finally keywords is now correctly recovered from the internal form.

The perf results on micro samples are good and pretty much as expected from previous experiments with using state machines for list { ... } and array {... } comprehensions. Note that some other people have experimented with faster builders using reflection emit codegen too.

  • Raw perf of generating 0 or 1 element lists: ~4x faster
  • Raw perf of generating 6-10 element lists: ~4x faster
  • Raw perf of generating 0 or 1 element arrays: ~4x faster
  • Raw perf of generating 6-10 element arrays: ~2x faster

We don't expect any change in fixed size arrays or lists

I don't expect any cases where this will either be slower or use more stack in a signficant way compared to our old way of doing these (which is to create a sequence expression state machine and iterate).

C:\GitHub\dsyme\fsharp>artifacts\bin\fsc\Release\net472\fsc.exe --optimize a.fs && a
PERF: tinyVariableSizeBuiltin : 89
PERF: variableSizeBuiltin : 504
PERF: fixedSizeBase : 504
PERF: tinyVariableSizeBuiltin (array) : 194
PERF: variableSizeBuiltin (array) : 1251
PERF: fixedSizeBase (array) : 260

C:\GitHub\dsyme\fsharp>fsc.exe --optimize a.fs && a
PERF: tinyVariableSizeBuiltin : 356
PERF: variableSizeBuiltin : 1949
PERF: fixedSizeBase : 497
PERF: tinyVariableSizeBuiltin (array) : 717
PERF: variableSizeBuiltin (array) : 2511
PERF: fixedSizeBase (array) : 244
  • For correctness testing the existing tests we have are ok I think - we have zillions of computed list and array expressions in the test suites and compiler that return results of many different sizes.

  • Some IL code generation tests will likely fail, we'll need to update those

  • We need to check debug stepping (it should be possible to make this much improved if it's not alrready)

@dsyme dsyme changed the base branch from main to feature/tasks May 21, 2021 16:53
@dsyme dsyme changed the base branch from feature/tasks to main May 21, 2021 17:22
@dsyme
Copy link
Contributor Author

dsyme commented May 21, 2021

Here's the code I used for performance testing:

module Lists =

    let tinyVariableSizeBuiltin () = 
        for i in 1 .. 1000000 do
            [
               if i % 3 = 0 then 
                   yield "b"
            ] |> List.length |> ignore

    let variableSizeBuiltin () = 
        for i in 1 .. 1000000 do
            [
               yield "a"
               yield "b"
               yield "b"
               yield "b"
               yield "b"
               if i % 3 = 0 then 
                   yield "b"
                   yield "b"
                   yield "b"
                   yield "b"
               yield "c"
            ] |> List.length |> ignore

    let fixedSizeBase () = 
        for i in 1 .. 1000000 do
            [
               "a"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "c"
            ] |> List.length |> ignore

    let perf s f = 
        let t = System.Diagnostics.Stopwatch()
        t.Start()
        for i in 0 .. 5 do 
            f()
        t.Stop()
        printfn "PERF: %s : %d" s t.ElapsedMilliseconds

    perf "tinyVariableSizeBuiltin" tinyVariableSizeBuiltin

    perf "variableSizeBuiltin" variableSizeBuiltin

    perf "fixedSizeBase" fixedSizeBase
module Arrays =

    let tinyVariableSizeBuiltin () = 
        for i in 1 .. 1000000 do
            [|
               if i % 3 = 0 then 
                   yield "b"
            |] |> Array.length |> ignore

    let variableSizeBuiltin () = 
        for i in 1 .. 1000000 do
            [|
               yield "a"
               yield "b"
               yield "b"
               yield "b"
               yield "b"
               if i % 3 = 0 then 
                   yield "b"
                   yield "b"
                   yield "b"
                   yield "b"
               yield "c"
            |] |> Array.length |> ignore

    let fixedSizeBase () = 
        for i in 1 .. 1000000 do
            [|
               "a"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "b"
               "c"
            |] |> Array.length |> ignore

    let perf s f = 
        let t = System.Diagnostics.Stopwatch()
        t.Start()
        for i in 0 .. 5 do 
            f()
        t.Stop()
        printfn "PERF: %s : %d" s t.ElapsedMilliseconds

    perf "tinyVariableSizeBuiltin (array)" tinyVariableSizeBuiltin

    perf "variableSizeBuiltin (array)" variableSizeBuiltin

    perf "fixedSizeBase (array)" fixedSizeBase

match values with
| :? ('T[]) as valuesAsArray ->
for v in valuesAsArray do
this.Add v
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible this could be a bit faster and avoid so many writes into the fields of ListCollector. However ListCollector is ultimately a mutable struct on the stack, so writes will be fast, it may not end up any faster

// cook a faster iterator for lists and arrays
match values with
| :? ('T[]) as valuesAsArray ->
for v in valuesAsArray do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over arrays is considerably faster than iterating sequences

@dsyme
Copy link
Contributor Author

dsyme commented May 21, 2021

Random test failure:

2021-05-21T22:37:04.9212825Z   Failed TypeCheckOutOfMemory [96 ms]
2021-05-21T22:37:04.9213392Z   Error Message:
2021-05-21T22:37:04.9214133Z    System.UnauthorizedAccessException : Access to the path 'D:\workspace\_work\1\s\vsintegration\tests\UnitTests\watson-test.fs' is denied.
2021-05-21T22:37:04.9214949Z   Stack Trace:
2021-05-21T22:37:04.9215732Z      at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
2021-05-21T22:37:04.9216431Z    at System.IO.File.InternalDelete(String path, Boolean checkHost)
2021-05-21T22:37:04.9217044Z    at System.IO.File.Delete(String path)
2021-05-21T22:37:04.9218022Z    at Tests.Compiler.Watson.Check.FscLevelException[TException](String simulationCode) in D:\workspace\_work\1\s\vsintegration\tests\UnitTests\Tests.Watson.fs:line 50
2021-05-21T22:37:04.9219314Z    at Tests.Compiler.Watson.WatsonTests.TypeCheckOutOfMemory() in D:\workspace\_work\1\s\vsintegration\tests\UnitTests\Tests.Watson.fs:line 115

@dsyme
Copy link
Contributor Author

dsyme commented May 21, 2021

I went through tests\walkthroughs\DebugStepping\TheBigFileOfDebugStepping.fsx and made some general improvements to debugging of list, array and sequence expressions and checked that all the sample list/sequence expressions in that file debug OK

@dsyme
Copy link
Contributor Author

dsyme commented May 21, 2021

This is now ready (some baselines may still need updating, but nearly all are done)

Note that a lot of sequence expression state machine generation is removed from the test baselines in favour of simpler code, hence about -2000 lines net in this PR

@dsyme
Copy link
Contributor Author

dsyme commented May 22, 2021

OK, everything green, this is ready. I've updated the RFC and notes in the PR

@dsyme
Copy link
Contributor Author

dsyme commented May 25, 2021

@TIHan @cartermp @KevinRansom @vzarytovskii This is ready for your review

@dsyme
Copy link
Contributor Author

dsyme commented May 26, 2021

On review with @TIHan:

  1. Mutables are still being promoted to ref in collecting list and array code.
  2. For loops are always using IEnumerable, not "fast integer for loops" or "array loops"

eg.

let f () =
   [| let mutable x = 1  
      if today() then yield x
      if f() then yield 1 
      for i in 0.. 5 do
         if g() then yield 1  
      |]

becomes roughly:

let f () =
   let x = ref 1  
   let mutable collector = ArrayCollector<'T>()
   if today() then collector.Add(x.Value)
   if f() then collector.Add(1)
   let enum = (0..5).GetEnumerator()
   try 
     while enum.MoveNext() do
       if g() then collector.Add(1)
   finally
     enum.Dispose()
   collector.Close()

This is still always faster than the corresponding sequence expression code.

In a separate PR we could improve (2). Improving (1) is more difficult because the mutable-to-ref promotion happens well before the generation of collecting code.

Separately we could also imagine lifting the language restriction that prevents the use of span, byref capturing etc. in list and array expressions - at least for compiled code. Quotations may still have problems with these.

@kerams
Copy link
Contributor

kerams commented May 26, 2021

Would it be hard to statically analyze (in the optimization/IL gen phase) the sequence expression and learn the minimum number of elements that will be yielded? In the case of ArrayCollector, that information could be used to preallocate a better sized resize array (or indeed the final array when we can be sure of the exact number of elements) and also skip assigning to First and Second at first.

@dsyme
Copy link
Contributor Author

dsyme commented May 26, 2021

@kerams Yes in theory, though I think the cases where that would matter would be cases where the size projection was a formulae of the input sizes being iterated

@dsyme dsyme closed this May 26, 2021
@dsyme dsyme reopened this May 26, 2021
@dsyme dsyme changed the title feature/fastlist - faster list/array computation expressions Faster list/array computation expressions Jun 1, 2021
Copy link
Member

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsyme and I spoke on this last week. I quickly went over it today - most of the code changes are test changes. We will have new public APIs in FSharp.Core it looks like; they are meant for the codegen.

Looking at the core of the change, it will be great to have this and might be able to extend it further for other collections, such as ImmutableArray.

@dsyme
Copy link
Contributor Author

dsyme commented Jun 2, 2021

@dsyme and I spoke on this last week. I quickly went over it today - most of the code changes are test changes. We will have new public APIs in FSharp.Core it looks like; they are meant for the codegen.

Thanks - I'll merge this.

Looking at the core of the change, it will be great to have this and might be able to extend it further for other collections, such as ImmutableArray.

Yes. I'm not yet sure of the right generalization. In principle any synchronous consumption of a seq { ... } or Seq.map/filter/... pipeline can be given this treatment, e.g. Seq.iter or a for x in seq { ... } do ... though neither are that commonly occuring in combination.

For other collections, e.g. immutable array/block, it may depend on whether we allow block [ ... ] as a special construct, or the proposal to allow [ ... ] to be used to initialize a block in the presence of known type information.

@dsyme
Copy link
Contributor Author

dsyme commented Jun 2, 2021

Note also there are many other places inside FSharp.Core we might be able to ArrayCollector to avoid creating a ResizeArray - e.g. even just for Seq.toArray

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants