## Benchmark of routines for generation of weak integer compositions (ordered integer partitions with lower bound 0)

In [1]:
using BenchmarkTools

include("weak_integer_compositions.jl");
import .IntegerCompositions: 
    weak_integer_compositions, 
    weak_integer_compositions_v2, 
    weak_integer_compositions_kun

In [2]:
loop_num = 2;
n_expandables = 5;

In [3]:
sort(collect(weak_integer_compositions(2, 4)))

10-element Vector{Vector{Int64}}:
 [0, 0, 0, 2]
 [0, 0, 1, 1]
 [0, 0, 2, 0]
 [0, 1, 0, 1]
 [0, 1, 1, 0]
 [0, 2, 0, 0]
 [1, 0, 0, 1]
 [1, 0, 1, 0]
 [1, 1, 0, 0]
 [2, 0, 0, 0]

In [4]:
# Verify that all three implementations give equivalent results
@assert allequal([sort(collect(v)) for v in [weak_integer_compositions(loop_num, n_expandables), 
                                             weak_integer_compositions_v2(loop_num, n_expandables), 
                                             weak_integer_compositions_kun(loop_num, n_expandables)]])

### Case 1: Iterator carrying a `Stateful` instance of `Combinatorics.WithReplacementCombinations`

In [5]:
b1 = @benchmarkable collect(weak_integer_compositions(loop_num, n_expandables)) samples=100000 evals=10
run(b1)

BenchmarkTools.Trial: 100000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.594 μs[22m[39m … [35m737.375 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 99.64%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.742 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.213 μs[22m[39m ± [32m 12.491 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m18.76% ±  3.34%

  [39m [39m▂[39m▅[39m▇[39m█[39m█[34m█[39m[39m▇[39m▆[39m▅[39m▄[39m▄[39m▃[39m▃[39m▃[39m▃[39m▃[39m▂[39m▂[39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[32m▁[39m[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃
  [39m▆[39m█[39m█[39m█[39m█

### Case 2: Direct iterator implementation (using modified `Combinatorics.jl` source code for `WithReplacementCombinations`)

In [6]:
b2 = @benchmarkable collect(weak_integer_compositions_v2(loop_num, n_expandables)) samples=100000 evals=10
run(b2)

BenchmarkTools.Trial: 100000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.394 μs[22m[39m … [35m558.592 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 99.41%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.533 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.940 μs[22m[39m ± [32m 11.368 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m18.42% ±  3.16%

  [39m [39m▁[39m▄[39m▆[39m▇[39m█[39m█[34m█[39m[39m▇[39m▆[39m▆[39m▅[39m▄[39m▄[39m▃[39m▃[39m▃[39m▃[39m▂[39m▂[39m▂[39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[32m▁[39m[39m▁[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▄
  [39m▆[39m█[39m█[39m█[39m█

### Case 3: Existing implementation, i.e., `Parquet.orderedPartition` stripped of assertions and specialized to weak compositions (`lowerbound=0`)

In [7]:
b3 = @benchmarkable collect(weak_integer_compositions_kun(loop_num, n_expandables)) samples=100000 evals=10
run(b3)

BenchmarkTools.Trial: 16819 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m24.529 μs[22m[39m … [35m685.130 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 93.88%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m25.791 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m28.820 μs[22m[39m ± [32m 25.759 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m9.94% ±  9.96%

  [34m█[39m[32m [39m[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [34m█[39m[32m█[39m[39

We observe that both iterators outperform the existing implementation in terms of runtime and memory allocations, especially in the regime $n_l < n_e$ where $n_l$ is the loop number and $n_e$ is the number of propagators to be expanded. This is because the simple implementation first generates all permutations for each cycle, then identifies and discards duplicates. In other words, the algorithm does not take advantage of the multiset structure.

In contrast, the Combinatorics.jl iterator `with_replacement_combinations` directly generates multicombinations of fixed size. For instance, consider the following two cases: (1) $n_l = 2, n_e = 5$, and (2) $n_l = 5, n_e = 10$.
The naive approach is an order of magnitude slower for case (1), and completely intractable in case (2). Even at high orders as in case (2), the more sophisticated multicombination iterator uses less than 1 MB of memory and retains a runtime of under a millisecond. 

We propose to use the first of the two iterator implementations; it is slightly less efficient than the direct implementation, but it is more Julian, and also avoids any licensing headaches associated with modifying the `Combinatorics.jl` sources. Note that this would currently only replace the functionality of `Parquet.orderedPartition` for the generation of *weak* compositions, i.e., when `lowerbound=0`.

Finally, note that many complicated optimizations exist for (multi)permutation/combination generation (Gray code, Co-lex and Cool-lex orders, etc.)—I have not reverse-engineered the actual implementation in `Combinatorics.jl`, but it would be interesting to do so.