Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf optimization on diffContentMultiple #317

Conversation

mathias-brandewinder
Copy link
Contributor

Convert lastList to array to speed up indexed access for large lists.

Convert lastList to array to speed up indexed access for large lists.
@mathias-brandewinder
Copy link
Contributor Author

When dealing with large lists, the call Differ.diff(lastList.[index], nextList.[index]) ends up being slow.
I tried an alternate version, using recursion, but it seems to cause stack overflows occasionally in debug mode.

  let private diffContentMultiple (lastList: IView list, nextList: IView list) : ViewDelta list =

      let rec f acc (lastList, nextList) =
          match nextList with
          | [] -> acc |> List.rev
          | hdNext :: tlNext ->
              match lastList with
              | [] -> f (ViewDelta.From hdNext :: acc) ([], tlNext)
              | hdLast :: tlLast -> f (Differ.diff (hdLast, hdNext) :: acc) (tlLast, tlNext)

      f [] (lastList, nextList)

@JaggerJo
Copy link
Member

Interesting, how many attributes does it take to make it "slow' in your case?

I'm a bit reluctant to merge this as it allocates in a perf critical path.

Really looking forward to https://github.com/dotnet/fsharp/pull/12859/files/5d4a2d1b35e9f4afbe3c2e3d4cf71d198dbbebfc..7acde1b39f139bdacc105ce43ca3c2f7a8b06f75 so we could use immutable arrays in the DSL from the start.

@Numpsy
Copy link
Collaborator

Numpsy commented Jun 22, 2023

Is the Differ.diff(lastList.[index], nextList.[index]) -> Differ.diff (lastList.[index], next) change useful on its own?

Comment on lines 67 to 73
let lastList = lastList |> List.toArray
nextList
|> List.mapi (fun index next ->
if index < lastList.Length
then Differ.diff (lastList.[index], next)
else ViewDelta.From next
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let lastList = lastList |> List.toArray
nextList
|> List.mapi (fun index next ->
if index < lastList.Length
then Differ.diff (lastList.[index], next)
else ViewDelta.From next
)
nextList |> List.mapi (fun index next ->
if index + 1 <= lastList.Length then
Differ.diff(lastList.[index], next)
else
ViewDelta.From next
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an improvement we can merge immediately.

@mathias-brandewinder
Copy link
Contributor Author

Interesting, how many attributes does it take to make it "slow' in your case?
I'm a bit reluctant to merge this as it allocates in a perf critical path.

The difference becomes notable over 10,000 elements. This is probably not common for the typical business apps, but I have been working on a Game of Life toy implementation. At 200x200 cells, that is 40,000 elements. With the change above, this runs decently smoothly.

Rough benchmark:

type Change =
    | Difference of int
    | New of int

let newList = [ 1 .. 100_000 ]
let oldList = [ 2 .. 2 .. 100_000 ]

let original (newList: list<int>) (oldList: list<int>) =
    newList
    |> List.mapi (fun index next ->
        if index + 1 <= oldList.Length
        then Difference(oldList.[index] - newList.[index])
        else New next
        )

let modified (newList: list<int>) (oldList: list<int>) =
    let oldList = oldList |> List.toArray
    newList
    |> List.mapi (fun i next ->
        if i < oldList.Length
        then Difference (oldList.[i] - next)
        else New next
        )

#time "on"
original newList oldList |> ignore
// Real: 00:00:07.884, CPU: 00:00:07.218, GC gen0: 2, gen1: 2, gen2: 1
modified newList oldList |> ignore
// Real: 00:00:00.006, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0

The Game of Life sample is a fun benchmark, I'll keep digging for hot spots. I'd like to avoid allocating a new array here, but haven't found a clean way to do it yet :)

@JaggerJo
Copy link
Member

Interesting, how many attributes does it take to make it "slow' in your case?
I'm a bit reluctant to merge this as it allocates in a perf critical path.

The difference becomes notable over 10,000 elements. This is probably not common for the typical business apps, but I have been working on a Game of Life toy implementation. At 200x200 cells, that is 40,000 elements. With the change above, this runs decently smoothly.

Rough benchmark:

type Change =
    | Difference of int
    | New of int

let newList = [ 1 .. 100_000 ]
let oldList = [ 2 .. 2 .. 100_000 ]

let original (newList: list<int>) (oldList: list<int>) =
    newList
    |> List.mapi (fun index next ->
        if index + 1 <= oldList.Length
        then Difference(oldList.[index] - newList.[index])
        else New next
        )

let modified (newList: list<int>) (oldList: list<int>) =
    let oldList = oldList |> List.toArray
    newList
    |> List.mapi (fun i next ->
        if i < oldList.Length
        then Difference (oldList.[i] - next)
        else New next
        )

#time "on"
original newList oldList |> ignore
// Real: 00:00:07.884, CPU: 00:00:07.218, GC gen0: 2, gen1: 2, gen2: 1
modified newList oldList |> ignore
// Real: 00:00:00.006, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0

The Game of Life sample is a fun benchmark, I'll keep digging for hot spots. I'd like to avoid allocating a new array here, but haven't found a clean way to do it yet :)

Would be interesting how the change suggested above (no array creation) already improves perf.

@mathias-brandewinder
Copy link
Contributor Author

I will make the formatting changes you suggested (sorry for not following the code base standard!), with one small change, keeping if index < lastList.Length, unless you have objections. This change is purely a micro optimization, we might as well avoid an operation.

@mathias-brandewinder
Copy link
Contributor Author

Would be interesting how the change suggested above (no array creation) already improves perf.

The one @Numpsy mentioned, that is Differ.diff (lastList.[index], next)? It helps, but only marginally, because we are still incurring a significant cost in accessing the other list by index. Rough benchmark:

let alternate (newList: list<int>) (oldList: list<int>) =
    newList
    |> List.mapi (fun index next ->
        if index + 1 <= oldList.Length
        then Difference(oldList.[index] - next)
        else New next
        )
Original: Real: 00:00:07.919, CPU: 00:00:07.203, GC gen0: 0, gen1: 0, gen2: 0
Alternate: Real: 00:00:06.489, CPU: 00:00:06.515, GC gen0: 2, gen1: 2, gen2: 1
Modified: Real: 00:00:00.009, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0

@Numpsy
Copy link
Collaborator

Numpsy commented Jun 22, 2023

Original: Real: 00:00:07.919, CPU: 00:00:07.203, GC gen0: 0, gen1: 0, gen2: 0
Alternate: Real: 00:00:06.489, CPU: 00:00:06.515, GC gen0: 2, gen1: 2, gen2: 1
Modified: Real: 00:00:00.009, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0

Hmm, thought there would be a bigger different if indexed collection access was the bottleneck and half of those are removed,
though are those figures suggesting there are GC calls in the version that doesn't have the toArrayCalls and none in the one that does?
Maybe this will depend on which list is longer though

@mathias-brandewinder
Copy link
Contributor Author

@Numpsy I was also expecting a bigger speedup, roughly 50%. I would take the measurement with a grain of salt, running this from the scripting environment with #time "on" is not super reliable. I could add a proper benchmark, if useful - but I think what the script shows is correct overall. And running my game of life code with that change did run visibly faster.

@Numpsy
Copy link
Collaborator

Numpsy commented Jun 22, 2023

Just in case, does it make a difference in the alternate case if the call to oldList.Length is outside of the loop?

@mathias-brandewinder
Copy link
Contributor Author

Good call, I tried it out quick:

let alternate (newList: list<int>) (oldList: list<int>) =
    let len = oldList.Length
    newList
    |> List.mapi (fun index next ->
        if index + 1 <= len
        then Difference(oldList.[index] - next)
        else New next
        )

Real: 00:00:01.297, CPU: 00:00:01.296, GC gen0: 0, gen1: 0, gen2: 0

@JaggerJo
Copy link
Member

JaggerJo commented Jun 23, 2023

Good call, I tried it out quick:

let alternate (newList: list<int>) (oldList: list<int>) =
    let len = oldList.Length
    newList
    |> List.mapi (fun index next ->
        if index + 1 <= len
        then Difference(oldList.[index] - next)
        else New next
        )

Real: 00:00:01.297, CPU: 00:00:01.296, GC gen0: 0, gen1: 0, gen2: 0

I think we can merge this as a first improvement.

I'm also happy to merge any other improvements once were sure they are safe. I don't want to merge the List.toArray improvements just yet. I'd want to test it in the benchmark project against a real set of views with different attribute counts.

Just to make sure this does not have any negative effects on small lists.

@mathias-brandewinder
Copy link
Contributor Author

@JaggerJo caution is understandable, it is a key function in the loop!
I have written a benchmark using a real view, testing the original Differ.diff against my modification. The view is essentially a uniform grid of 250 x 250 rectangles, from my game of life test.
I will expand the benchmark, to test:

  • Original code as baseline, minor modification (no arrays), array modification
  • Perhaps for sizes of 10, 100, 1000, 100,000, to compare perf for various scenarios
    Would you have other views in mind that could be a good test case?

@JaggerJo
Copy link
Member

@JaggerJo caution is understandable, it is a key function in the loop! I have written a benchmark using a real view, testing the original Differ.diff against my modification. The view is essentially a uniform grid of 250 x 250 rectangles, from my game of life test. I will expand the benchmark, to test:

  • Original code as baseline, minor modification (no arrays), array modification
  • Perhaps for sizes of 10, 100, 1000, 100,000, to compare perf for various scenarios
    Would you have other views in mind that could be a good test case?

I'm doing the same right now, let's collaborate 😄 I've pushed my stuff to main a15dc81

@JaggerJo
Copy link
Member

@mathias-brandewinder came up with this:

    let private diffContentMultiple (lastList: IView list, nextList: IView list) : ViewDelta list =
        let lastListLength = lastList.Length

        let mutable lastTail: IView list = lastList

        nextList |> List.mapi (fun index next ->
            if index + 1 <= lastListLength then
                let result = diff(lastTail.Head, next)
                lastTail <- lastTail.Tail
                result
            else
                ViewDelta.From next
        )

@Numpsy
Copy link
Collaborator

Numpsy commented Jun 23, 2023

I'd wondered if you could do something along those lines with GetEnumerator/MoveNext, but haven't tried it (perhaps less of a 'functional' approach though)

@mathias-brandewinder
Copy link
Contributor Author

mathias-brandewinder commented Jun 23, 2023

@JaggerJo nice - I remember I tried this as well, but the results were poor, I think because I forgot to move the list length out of the loop.
Below is what BenchmarkDotNet gives me, for 3 different settings (25 items, 2,500 items, 90,000 items in the grid). Overall the results are pretty similar, with a slight edge to the mutable list version. I did not include the original version, which is slower in every single case.
Want me to push your modification instead of the array based one?
I pushed your modification, tested it on my Game of Life which runs like a champ :)

5 x 5

|      Method |     Mean |    Error |   StdDev |   Gen0 | Allocated |
|------------ |---------:|---------:|---------:|-------:|----------:|
|  ArrayBased | 35.20 us | 0.658 us | 1.063 us | 0.4883 |  33.42 KB |
| MutableList | 33.95 us | 0.456 us | 0.381 us | 0.4883 |  33.21 KB |

50 x 50

|      Method |     Mean |     Error |    StdDev |    Gen0 |    Gen1 | Allocated |
|------------ |---------:|----------:|----------:|--------:|--------:|----------:|
|  ArrayBased | 2.357 ms | 0.0447 ms | 0.0772 ms | 27.3438 | 11.7188 |   1.85 MB |
| MutableList | 2.249 ms | 0.0343 ms | 0.0286 ms | 27.3438 | 11.7188 |   1.83 MB |

300 x 300

|      Method |      Mean |    Error |   StdDev |      Gen0 |     Gen1 |     Gen2 | Allocated |
|------------ |----------:|---------:|---------:|----------:|---------:|---------:|----------:|
|  ArrayBased | 108.78 ms | 2.134 ms | 2.621 ms | 1000.0000 | 400.0000 | 400.0000 |  65.85 MB |
| MutableList |  94.92 ms | 1.851 ms | 2.132 ms | 1000.0000 | 666.6667 | 333.3333 |  65.17 MB |

@JaggerJo
Copy link
Member

Nice! let's merge this. Hope we'll find more opportunities to improve perf like this one.

@mathias-brandewinder
Copy link
Contributor Author

Apologies for the build failure, I missed that changes had happened before my last commit.

@mathias-brandewinder
Copy link
Contributor Author

@Numpsy I tried an enumerator based approach, as far as I can tell it is no better than the current best, and the code is pretty gross, basically this:

let enumeratorBased (newList: list<int>) (oldList: list<int>) =

    let enumeratorOld = (oldList :> seq<int>).GetEnumerator()
    let enumeratorNew = (newList :> seq<int>).GetEnumerator()
    let mutable notFinished = true

    [
        while (enumeratorNew.MoveNext()) do
            notFinished <-
                if notFinished
                then enumeratorOld.MoveNext()
                else false

            if notFinished
            then
                Difference (enumeratorOld.Current - enumeratorNew.Current)
            else
                New (enumeratorNew.Current)
    ]

@Numpsy
Copy link
Collaborator

Numpsy commented Jun 24, 2023

I'd been thinking more a hybrid approach like

    let private diffContentMultiple2 (lastList: IView list, nextList: IView list) : ViewDelta list =
        let lastListLength = lastList.Length
        use iter = (lastList :> IView seq).GetEnumerator()

        nextList |> List.mapi (fun index next ->
            if index + 1 <= lastListLength then
                iter.MoveNext() |> ignore
                let result = diff(iter.Current , next)
                result
            else
                ViewDelta.From next
        )

but it looks like the list enumerator is implemented with head/tail itself, so it ends up being a more round about version of the suggested mutable list approach

@JaggerJo
Copy link
Member

I've switched over to the fast implementation in main with this commit!

I've added both of you as co-authors. Hope that's fine.

BenchmarkDotNet=v0.13.5, OS=macOS Ventura 13.3 (22E252) [Darwin 22.4.0]
Apple M2 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.400-preview.23226.4
[Host] : .NET 6.0.14 (6.0.1423.7309), Arm64 RyuJIT AdvSIMD DEBUG
Job-XQLMZA : .NET 6.0.14 (6.0.1423.7309), Arm64 RyuJIT AdvSIMD

IterationCount=5 RunStrategy=Throughput WarmupCount=5

Method Cells ExtendedAttrs Mean Error StdDev Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
Baseline 10 False 129.1 us 0.36 us 0.09 us 1.00 53.2227 - - 108.72 KB 1.00
ChampionA 10 False 118.8 us 1.51 us 0.23 us 0.92 53.2227 - - 108.72 KB 1.00
ChampionB 10 False 116.6 us 4.25 us 1.10 us 0.90 53.2227 - - 108.74 KB 1.00
ChampionC 10 False 116.0 us 4.07 us 0.63 us 0.90 53.2227 - - 108.74 KB 1.00
Baseline 10 True 367.2 us 1.39 us 0.36 us 1.00 101.5625 - - 207.8 KB 1.00
ChampionA 10 True 354.6 us 1.55 us 0.40 us 0.97 101.5625 - - 207.8 KB 1.00
ChampionB 10 True 350.6 us 7.83 us 2.03 us 0.95 101.5625 - - 207.82 KB 1.00
ChampionC 10 True 355.9 us 11.37 us 2.95 us 0.97 101.5625 - - 207.82 KB 1.00
Baseline 100 False 1,161,724.2 us 44,786.40 us 6,930.74 us 1.00 1000.0000 - - 10884.64 KB 1.00
ChampionA 100 False 297,266.5 us 3,449.09 us 895.72 us 0.26 2500.0000 500.0000 - 10884.16 KB 1.00
ChampionB 100 False 15,594.2 us 1,418.83 us 368.46 us 0.01 3031.2500 421.8750 - 10883.73 KB 1.00
ChampionC 100 False 15,027.5 us 815.86 us 211.88 us 0.01 3234.3750 343.7500 15.6250 10883.74 KB 1.00
Baseline 100 True 1,224,936.2 us 6,171.25 us 955.01 us 1.00 7000.0000 1000.0000 - 20862.18 KB 1.00
ChampionA 100 True 330,595.9 us 5,783.97 us 1,502.08 us 0.27 6000.0000 500.0000 - 20862.45 KB 1.00
ChampionB 100 True 39,654.2 us 462.12 us 71.51 us 0.03 5615.3846 384.6154 - 20861.32 KB 1.00
ChampionC 100 True 39,635.0 us 659.52 us 171.28 us 0.03 5846.1538 384.6154 - 20861.32 KB 1.00

@JaggerJo JaggerJo closed this Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants