Skip to content

Variable length load meandering#595

Merged
LebedevRI merged 12 commits into
darktable-org:developfrom
LebedevRI:variable-length-load
Jan 7, 2024
Merged

Variable length load meandering#595
LebedevRI merged 12 commits into
darktable-org:developfrom
LebedevRI:variable-length-load

Conversation

@LebedevRI
Copy link
Copy Markdown
Member

build-Clang17-release$ bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_filter="uint[3][2]_t" --benchmark_repetitions=1000 --benchmark_min_time=1x --benchmark_display_aggregates_only=true
2024-01-07T03:11:09+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.45, 1.13, 1.00
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoad, uint32_t>/524288_mean                                                      58.0 us         58.0 us         1000 Latency=121.66ps Throughput=8.41783Gi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_median                                                    58.0 us         58.0 us         1000 Latency=121.53ps Throughput=8.42591Gi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_stddev                                                   0.732 us        0.652 us         1000 Latency=1.36709ps Throughput=83.6357Mi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_cv                                                        1.26 %          1.12 %          1000 Latency=1.12% Throughput=0.97%
BM_Impl<variableLengthLoad, uint32_t>/524288_mean                                                    116 us          116 us         1000 Latency=243.646ps Throughput=4.20289Gi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_median                                                  116 us          116 us         1000 Latency=243.479ps Throughput=4.2057Gi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_stddev                                                0.486 us        0.484 us         1000 Latency=1.01467ps Throughput=17.4973Mi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_cv                                                     0.42 %          0.42 %          1000 Latency=0.42% Throughput=0.41%
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_mean                             168 us          168 us         1000 Latency=351.597ps Throughput=2.91248Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_median                           168 us          168 us         1000 Latency=351.294ps Throughput=2.91494Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_stddev                         0.834 us        0.731 us         1000 Latency=1.5323ps Throughput=12.5236Mi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_cv                              0.50 %          0.44 %          1000 Latency=0.44% Throughput=0.42%
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_mean                                     540 us          540 us         1000 Latency=1.10682ns Throughput=925.179Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_median                                   540 us          540 us         1000 Latency=1.10606ns Throughput=925.806Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_stddev                                  1.24 us         1.11 us         1000 Latency=2.31991ps Throughput=1.86664Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_cv                                      0.23 %          0.20 %          1000 Latency=0.20% Throughput=0.20%
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_mean                                      540 us          540 us         1000 Latency=1.10669ns Throughput=925.28Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_median                                    540 us          540 us         1000 Latency=1.10608ns Throughput=925.789Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_stddev                                  0.928 us        0.790 us         1000 Latency=1.65743ps Throughput=1.34762Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_cv                                       0.17 %          0.15 %          1000 Latency=0.15% Throughput=0.15%
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_mean                                   58.0 us         58.0 us         1000 Latency=121.602ps Throughput=8.42126Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_median                                 58.0 us         57.9 us         1000 Latency=121.509ps Throughput=8.42736Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_stddev                                0.368 us        0.367 us         1000 Latency=787.756fs Throughput=51.6112Mi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_cv                                     0.63 %          0.63 %          1000 Latency=0.63% Throughput=0.60%
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_mean                                77.5 us         77.5 us         1000 Latency=162.545ps Throughput=6.29993Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_median                              77.5 us         77.4 us         1000 Latency=162.403ps Throughput=6.30528Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_stddev                             0.378 us        0.378 us         1000 Latency=811.311fs Throughput=30.3316Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_cv                                  0.49 %          0.49 %          1000 Latency=0.49% Throughput=0.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_mean         77.7 us         77.7 us         1000 Latency=162.858ps Throughput=6.28787Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_median       77.6 us         77.6 us         1000 Latency=162.739ps Throughput=6.29228Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_stddev      0.517 us        0.446 us         1000 Latency=958.27fs Throughput=35.4747Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_cv           0.67 %          0.57 %          1000 Latency=0.57% Throughput=0.55%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_mean                 77.9 us         77.9 us         1000 Latency=163.389ps Throughput=6.2674Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_median               77.8 us         77.8 us         1000 Latency=163.221ps Throughput=6.27369Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_stddev              0.380 us        0.379 us         1000 Latency=813.353fs Throughput=30.1957Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_cv                   0.49 %          0.49 %          1000 Latency=0.49% Throughput=0.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean                  78.2 us         78.1 us         1000 Latency=163.884ps Throughput=6.24847Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median                78.1 us         78.1 us         1000 Latency=163.788ps Throughput=6.252Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev               0.367 us        0.367 us         1000 Latency=787.45fs Throughput=28.9555Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv                    0.47 %          0.47 %          1000 Latency=0.47% Throughput=0.45%

build-Clang17-release$ bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_filter="uint[6][4]_t" --benchmark_repetitions=1000 --benchmark_min_time=1x --benchmark_display_aggregates_only=true
2024-01-07T03:11:22+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.42, 1.14, 1.00
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoad, uint64_t>/524288_mean                                                      40.2 us         40.2 us         1000 Latency=84.229ps Throughput=12.1596Gi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_median                                                    40.2 us         40.1 us         1000 Latency=84.1797ps Throughput=12.1645Gi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_stddev                                                   0.639 us        0.602 us         1000 Latency=1.26238ps Throughput=157.635Mi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_cv                                                        1.59 %          1.50 %          1000 Latency=1.50% Throughput=1.27%
BM_Impl<variableLengthLoad, uint64_t>/524288_mean                                                   58.4 us         58.4 us         1000 Latency=122.476ps Throughput=8.36111Gi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_median                                                 58.4 us         58.4 us         1000 Latency=122.39ps Throughput=8.36671Gi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_stddev                                                0.360 us        0.358 us         1000 Latency=769.389fs Throughput=49.8219Mi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_cv                                                     0.62 %          0.61 %          1000 Latency=0.61% Throughput=0.58%
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_mean                             137 us          137 us         1000 Latency=286.67ps Throughput=3.57213Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_median                           137 us          137 us         1000 Latency=286.607ps Throughput=3.57283Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_stddev                         0.650 us        0.649 us         1000 Latency=1.36204ps Throughput=16.3394Mi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_cv                              0.48 %          0.48 %          1000 Latency=0.48% Throughput=0.45%
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_mean                                     251 us          251 us         1000 Latency=526.61ps Throughput=1.94453Gi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_median                                   251 us          251 us         1000 Latency=526.28ps Throughput=1.94573Gi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_stddev                                 0.756 us        0.677 us         1000 Latency=1.41881ps Throughput=5.28031Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_cv                                      0.30 %          0.27 %          1000 Latency=0.27% Throughput=0.27%
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_mean                                      251 us          251 us         1000 Latency=526.693ps Throughput=1.94422Gi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_median                                    251 us          251 us         1000 Latency=526.364ps Throughput=1.94542Gi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_stddev                                  0.845 us        0.700 us         1000 Latency=1.46732ps Throughput=5.45387Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_cv                                       0.34 %          0.28 %          1000 Latency=0.28% Throughput=0.27%
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_mean                                   40.2 us         40.2 us         1000 Latency=84.2051ps Throughput=12.1615Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_median                                 40.2 us         40.2 us         1000 Latency=84.2426ps Throughput=12.1554Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_stddev                                0.310 us        0.309 us         1000 Latency=663.128fs Throughput=90.6958Mi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_cv                                     0.77 %          0.77 %          1000 Latency=0.77% Throughput=0.73%
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_mean                                58.1 us         58.1 us         1000 Latency=121.842ps Throughput=8.41092Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_median                              58.2 us         58.1 us         1000 Latency=121.949ps Throughput=8.39693Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_stddev                              1.35 us         1.35 us         1000 Latency=2.84149ps Throughput=289.884Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_cv                                  2.33 %          2.33 %          1000 Latency=2.33% Throughput=3.37%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_mean         39.3 us         39.3 us         1000 Latency=82.388ps Throughput=12.4299Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_median       39.2 us         39.2 us         1000 Latency=82.2503ps Throughput=12.4498Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_stddev      0.357 us        0.356 us         1000 Latency=763.755fs Throughput=105.899Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_cv           0.91 %          0.91 %          1000 Latency=0.91% Throughput=0.83%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_mean                 39.3 us         39.3 us         1000 Latency=82.3689ps Throughput=12.4334Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_median               39.2 us         39.2 us         1000 Latency=82.3132ps Throughput=12.4403Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_stddev              0.486 us        0.484 us         1000 Latency=1.01573ps Throughput=125.609Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_cv                   1.24 %          1.23 %          1000 Latency=1.23% Throughput=0.99%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean                  39.7 us         39.7 us         1000 Latency=83.2796ps Throughput=12.2968Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median                39.7 us         39.7 us         1000 Latency=83.215ps Throughput=12.3055Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev               0.434 us        0.346 us         1000 Latency=742.698fs Throughput=101.825Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv                    1.09 %          0.87 %          1000 Latency=0.87% Throughput=0.81%

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 7, 2024

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (c2b2976) 58.90% compared to head (1286399) 59.11%.

Files Patch % Lines
src/librawspeed/adt/VariableLengthLoad.h 67.12% 24 Missing ⚠️
test/librawspeed/adt/VariableLengthLoadTest.cpp 80.48% 8 Missing ⚠️
src/librawspeed/io/BitStream.h 25.00% 3 Missing ⚠️
...ch/librawspeed/adt/VariableLengthLoadBenchmark.cpp 97.95% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #595      +/-   ##
===========================================
+ Coverage    58.90%   59.11%   +0.20%     
===========================================
  Files          247      250       +3     
  Lines        14609    14770     +161     
  Branches      1991     2000       +9     
===========================================
+ Hits          8606     8731     +125     
- Misses        5882     5918      +36     
  Partials       121      121              
Flag Coverage Δ
benchmarks 10.23% <56.88%> (+0.51%) ⬆️
integration 46.41% <5.47%> (-0.48%) ⬇️
linux 56.77% <66.88%> (+0.09%) ⬆️
macOS 20.69% <73.83%> (+0.42%) ⬆️
rpu_u 46.41% <5.47%> (-0.48%) ⬇️
unittests 18.40% <49.70%> (+0.34%) ⬆️
windows ∅ <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LebedevRI LebedevRI force-pushed the variable-length-load branch 2 times, most recently from 1fe0615 to f55a1f5 Compare January 7, 2024 01:28
…SequentialReplenisher::getInput()`

It is not obvious if this snippet is the best solution to the problem,
but even if it is, there's hidden UB there:
`Base::data + Base::pos` may be creating OOB pointer,
even if the size (`bytesRemaining`) will be zero then.
@LebedevRI LebedevRI force-pushed the variable-length-load branch 2 times, most recently from c1e753b to 003f519 Compare January 7, 2024 03:39
Basically identical to the original `variableLengthLoadNaiveViaMemcpy()`,
but without any UB, and at least currently, if we are known-inbounds,
it does get optimized to a simple load.
While it is rather interesting to know
how fast the whatever implementation is on itself,
practically always we deal with the fully-in-bounds case,
so knowing what's the theoretical maximum perf is good.
And that's the actual workloop in `BitStreamForwardSequentialReplenisher`,
that is what we should optimize.
If we can make the load unconditional (after clamping the position),
then we can later fix-up the loaded value, by bit-shifting it.

Though the resulting assembly is rather bad for >8 byte loads.
`inPos` may be past-the end, and that's UB.
Funnily-enough, this happens to be faster:
(one less register used, that had to be spill-reloaded)

```
build-Clang17-release$ /repositories/googlebenchmark/tools/compare.py -a benchmarks bench/librawspeed/adt/VariableLengthLoadBenchmark{-old,} --benchmark_repetitions=100000 --benchmark_filter="fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t" --benchmark_min_time=1x --benchmark_min_warmup_time=0.5
RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark-old --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpg4dbf8dr
2024-01-07T20:35:27+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark-old
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.82, 1.15, 1.32
-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean          116 us          116 us       100000 Latency=220.58ps Throughput=4.23082Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median        116 us          116 us       100000 Latency=221.329ps Throughput=4.20787Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev       4.36 us         4.35 us       100000 Latency=8.28896ps Throughput=236.877Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv           3.77 %          3.76 %        100000 Latency=3.76% Throughput=5.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean         39.0 us         39.0 us       100000 Latency=74.3106ps Throughput=12.534Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median       38.9 us         38.9 us       100000 Latency=74.2531ps Throughput=12.5425Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev      0.449 us        0.409 us       100000 Latency=780.46fs Throughput=109.895Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv           1.15 %          1.05 %        100000 Latency=1.05% Throughput=0.86%
RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplk2hmw4g
2024-01-07T20:35:48+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.87, 1.14, 1.31
-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean         78.3 us         78.3 us       100000 Latency=149.272ps Throughput=6.23925Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median       78.2 us         78.2 us       100000 Latency=149.174ps Throughput=6.24321Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev      0.430 us        0.398 us       100000 Latency=759.527fs Throughput=30.7237Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv           0.55 %          0.51 %        100000 Latency=0.51% Throughput=0.48%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean         39.7 us         39.7 us       100000 Latency=75.6285ps Throughput=12.3154Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median       39.6 us         39.6 us       100000 Latency=75.6073ps Throughput=12.3179Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev      0.417 us        0.397 us       100000 Latency=757.501fs Throughput=102.284Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv           1.05 %          1.00 %        100000 Latency=1.00% Throughput=0.81%
Comparing bench/librawspeed/adt/VariableLengthLoadBenchmark-old to bench/librawspeed/adt/VariableLengthLoadBenchmark
Benchmark                                                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_pvalue                 0.0000          0.0000      U Test, Repetitions: 100000 vs 100000
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean                  -0.3233         -0.3233           116            78           116            78
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median                -0.3260         -0.3260           116            78           116            78
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev                -0.9014         -0.9084             4             0             4             0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv                    -0.8543         -0.8646             0             0             0             0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_pvalue                 0.0000          0.0000      U Test, Repetitions: 100000 vs 100000
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean                  +0.0177         +0.0177            39            40            39            40
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median                +0.0182         +0.0182            39            40            39            40
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev                -0.0726         -0.0294             0             0             0             0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv                    -0.0887         -0.0463             0             0             0             0
OVERALL_GEOMEAN                                                                                     -0.1698         -0.1697             0             0             0             0
```
@LebedevRI LebedevRI force-pushed the variable-length-load branch from 003f519 to 4734563 Compare January 7, 2024 18:23
`std::copy()` needs a pointer difference to compute the size,
and while `std::copy_n()` helps with that, it still does not
get optimized into a simple `memcpy()`,
but ends up being conditional.

3 different implementations is plenty enough.
@LebedevRI LebedevRI force-pushed the variable-length-load branch from 4734563 to 1286399 Compare January 7, 2024 19:04
@LebedevRI LebedevRI merged commit 88cccd9 into darktable-org:develop Jan 7, 2024
@LebedevRI LebedevRI deleted the variable-length-load branch January 7, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant