Variable length load meandering by LebedevRI · Pull Request #595 · darktable-org/rawspeed

LebedevRI · 2024-01-07T00:13:24Z

build-Clang17-release$ bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_filter="uint[3][2]_t" --benchmark_repetitions=1000 --benchmark_min_time=1x --benchmark_display_aggregates_only=true
2024-01-07T03:11:09+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.45, 1.13, 1.00
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoad, uint32_t>/524288_mean                                                      58.0 us         58.0 us         1000 Latency=121.66ps Throughput=8.41783Gi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_median                                                    58.0 us         58.0 us         1000 Latency=121.53ps Throughput=8.42591Gi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_stddev                                                   0.732 us        0.652 us         1000 Latency=1.36709ps Throughput=83.6357Mi/s
BM_Impl<fixedLengthLoad, uint32_t>/524288_cv                                                        1.26 %          1.12 %          1000 Latency=1.12% Throughput=0.97%
BM_Impl<variableLengthLoad, uint32_t>/524288_mean                                                    116 us          116 us         1000 Latency=243.646ps Throughput=4.20289Gi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_median                                                  116 us          116 us         1000 Latency=243.479ps Throughput=4.2057Gi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_stddev                                                0.486 us        0.484 us         1000 Latency=1.01467ps Throughput=17.4973Mi/s
BM_Impl<variableLengthLoad, uint32_t>/524288_cv                                                     0.42 %          0.42 %          1000 Latency=0.42% Throughput=0.41%
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_mean                             168 us          168 us         1000 Latency=351.597ps Throughput=2.91248Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_median                           168 us          168 us         1000 Latency=351.294ps Throughput=2.91494Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_stddev                         0.834 us        0.731 us         1000 Latency=1.5323ps Throughput=12.5236Mi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint32_t>/524288_cv                              0.50 %          0.44 %          1000 Latency=0.44% Throughput=0.42%
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_mean                                     540 us          540 us         1000 Latency=1.10682ns Throughput=925.179Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_median                                   540 us          540 us         1000 Latency=1.10606ns Throughput=925.806Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_stddev                                  1.24 us         1.11 us         1000 Latency=2.31991ps Throughput=1.86664Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint32_t>/524288_cv                                      0.23 %          0.20 %          1000 Latency=0.20% Throughput=0.20%
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_mean                                      540 us          540 us         1000 Latency=1.10669ns Throughput=925.28Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_median                                    540 us          540 us         1000 Latency=1.10608ns Throughput=925.789Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_stddev                                  0.928 us        0.790 us         1000 Latency=1.65743ps Throughput=1.34762Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint32_t>/524288_cv                                       0.17 %          0.15 %          1000 Latency=0.15% Throughput=0.15%
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_mean                                   58.0 us         58.0 us         1000 Latency=121.602ps Throughput=8.42126Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_median                                 58.0 us         57.9 us         1000 Latency=121.509ps Throughput=8.42736Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_stddev                                0.368 us        0.367 us         1000 Latency=787.756fs Throughput=51.6112Mi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint32_t>/524288_cv                                     0.63 %          0.63 %          1000 Latency=0.63% Throughput=0.60%
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_mean                                77.5 us         77.5 us         1000 Latency=162.545ps Throughput=6.29993Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_median                              77.5 us         77.4 us         1000 Latency=162.403ps Throughput=6.30528Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_stddev                             0.378 us        0.378 us         1000 Latency=811.311fs Throughput=30.3316Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint32_t>/524288_cv                                  0.49 %          0.49 %          1000 Latency=0.49% Throughput=0.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_mean         77.7 us         77.7 us         1000 Latency=162.858ps Throughput=6.28787Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_median       77.6 us         77.6 us         1000 Latency=162.739ps Throughput=6.29228Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_stddev      0.517 us        0.446 us         1000 Latency=958.27fs Throughput=35.4747Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint32_t>/524288_cv           0.67 %          0.57 %          1000 Latency=0.57% Throughput=0.55%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_mean                 77.9 us         77.9 us         1000 Latency=163.389ps Throughput=6.2674Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_median               77.8 us         77.8 us         1000 Latency=163.221ps Throughput=6.27369Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_stddev              0.380 us        0.379 us         1000 Latency=813.353fs Throughput=30.1957Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint32_t>/524288_cv                   0.49 %          0.49 %          1000 Latency=0.49% Throughput=0.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean                  78.2 us         78.1 us         1000 Latency=163.884ps Throughput=6.24847Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median                78.1 us         78.1 us         1000 Latency=163.788ps Throughput=6.252Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev               0.367 us        0.367 us         1000 Latency=787.45fs Throughput=28.9555Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv                    0.47 %          0.47 %          1000 Latency=0.47% Throughput=0.45%

build-Clang17-release$ bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_filter="uint[6][4]_t" --benchmark_repetitions=1000 --benchmark_min_time=1x --benchmark_display_aggregates_only=true
2024-01-07T03:11:22+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.42, 1.14, 1.00
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoad, uint64_t>/524288_mean                                                      40.2 us         40.2 us         1000 Latency=84.229ps Throughput=12.1596Gi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_median                                                    40.2 us         40.1 us         1000 Latency=84.1797ps Throughput=12.1645Gi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_stddev                                                   0.639 us        0.602 us         1000 Latency=1.26238ps Throughput=157.635Mi/s
BM_Impl<fixedLengthLoad, uint64_t>/524288_cv                                                        1.59 %          1.50 %          1000 Latency=1.50% Throughput=1.27%
BM_Impl<variableLengthLoad, uint64_t>/524288_mean                                                   58.4 us         58.4 us         1000 Latency=122.476ps Throughput=8.36111Gi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_median                                                 58.4 us         58.4 us         1000 Latency=122.39ps Throughput=8.36671Gi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_stddev                                                0.360 us        0.358 us         1000 Latency=769.389fs Throughput=49.8219Mi/s
BM_Impl<variableLengthLoad, uint64_t>/524288_cv                                                     0.62 %          0.61 %          1000 Latency=0.61% Throughput=0.58%
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_mean                             137 us          137 us         1000 Latency=286.67ps Throughput=3.57213Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_median                           137 us          137 us         1000 Latency=286.607ps Throughput=3.57283Gi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_stddev                         0.650 us        0.649 us         1000 Latency=1.36204ps Throughput=16.3394Mi/s
BM_Impl<variableLengthLoadNaiveViaConditionalLoad, uint64_t>/524288_cv                              0.48 %          0.48 %          1000 Latency=0.48% Throughput=0.45%
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_mean                                     251 us          251 us         1000 Latency=526.61ps Throughput=1.94453Gi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_median                                   251 us          251 us         1000 Latency=526.28ps Throughput=1.94573Gi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_stddev                                 0.756 us        0.677 us         1000 Latency=1.41881ps Throughput=5.28031Mi/s
BM_Impl<variableLengthLoadNaiveViaStdCopy, uint64_t>/524288_cv                                      0.30 %          0.27 %          1000 Latency=0.27% Throughput=0.27%
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_mean                                      251 us          251 us         1000 Latency=526.693ps Throughput=1.94422Gi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_median                                    251 us          251 us         1000 Latency=526.364ps Throughput=1.94542Gi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_stddev                                  0.845 us        0.700 us         1000 Latency=1.46732ps Throughput=5.45387Mi/s
BM_Impl<variableLengthLoadNaiveViaMemcpy, uint64_t>/524288_cv                                       0.34 %          0.28 %          1000 Latency=0.28% Throughput=0.27%
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_mean                                   40.2 us         40.2 us         1000 Latency=84.2051ps Throughput=12.1615Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_median                                 40.2 us         40.2 us         1000 Latency=84.2426ps Throughput=12.1554Gi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_stddev                                0.310 us        0.309 us         1000 Latency=663.128fs Throughput=90.6958Mi/s
BM_Impl<fixedLengthLoadOr<fixedLengthLoad>, uint64_t>/524288_cv                                     0.77 %          0.77 %          1000 Latency=0.77% Throughput=0.73%
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_mean                                58.1 us         58.1 us         1000 Latency=121.842ps Throughput=8.41092Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_median                              58.2 us         58.1 us         1000 Latency=121.949ps Throughput=8.39693Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_stddev                              1.35 us         1.35 us         1000 Latency=2.84149ps Throughput=289.884Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoad>, uint64_t>/524288_cv                                  2.33 %          2.33 %          1000 Latency=2.33% Throughput=3.37%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_mean         39.3 us         39.3 us         1000 Latency=82.388ps Throughput=12.4299Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_median       39.2 us         39.2 us         1000 Latency=82.2503ps Throughput=12.4498Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_stddev      0.357 us        0.356 us         1000 Latency=763.755fs Throughput=105.899Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaConditionalLoad>, uint64_t>/524288_cv           0.91 %          0.91 %          1000 Latency=0.91% Throughput=0.83%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_mean                 39.3 us         39.3 us         1000 Latency=82.3689ps Throughput=12.4334Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_median               39.2 us         39.2 us         1000 Latency=82.3132ps Throughput=12.4403Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_stddev              0.486 us        0.484 us         1000 Latency=1.01573ps Throughput=125.609Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaStdCopy>, uint64_t>/524288_cv                   1.24 %          1.23 %          1000 Latency=1.23% Throughput=0.99%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean                  39.7 us         39.7 us         1000 Latency=83.2796ps Throughput=12.2968Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median                39.7 us         39.7 us         1000 Latency=83.215ps Throughput=12.3055Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev               0.434 us        0.346 us         1000 Latency=742.698fs Throughput=101.825Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv                    1.09 %          0.87 %          1000 Latency=0.87% Throughput=0.81%

codecov · 2024-01-07T00:34:58Z

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (c2b2976) 58.90% compared to head (1286399) 59.11%.

Files	Patch %	Lines
src/librawspeed/adt/VariableLengthLoad.h	67.12%	24 Missing ⚠️
test/librawspeed/adt/VariableLengthLoadTest.cpp	80.48%	8 Missing ⚠️
src/librawspeed/io/BitStream.h	25.00%	3 Missing ⚠️
...ch/librawspeed/adt/VariableLengthLoadBenchmark.cpp	97.95%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #595      +/-   ##
===========================================
+ Coverage    58.90%   59.11%   +0.20%     
===========================================
  Files          247      250       +3     
  Lines        14609    14770     +161     
  Branches      1991     2000       +9     
===========================================
+ Hits          8606     8731     +125     
- Misses        5882     5918      +36     
  Partials       121      121

Flag	Coverage Δ
benchmarks	`10.23% <56.88%> (+0.51%)`	⬆️
integration	`46.41% <5.47%> (-0.48%)`	⬇️
linux	`56.77% <66.88%> (+0.09%)`	⬆️
macOS	`20.69% <73.83%> (+0.42%)`	⬆️
rpu_u	`46.41% <5.47%> (-0.48%)`	⬇️
unittests	`18.40% <49.70%> (+0.34%)`	⬆️
windows	`∅ <ø> (∅)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…SequentialReplenisher::getInput()` It is not obvious if this snippet is the best solution to the problem, but even if it is, there's hidden UB there: `Base::data + Base::pos` may be creating OOB pointer, even if the size (`bytesRemaining`) will be zero then.

…implementation

Basically identical to the original `variableLengthLoadNaiveViaMemcpy()`, but without any UB, and at least currently, if we are known-inbounds, it does get optimized to a simple load.

… warnings

While it is rather interesting to know how fast the whatever implementation is on itself, practically always we deal with the fully-in-bounds case, so knowing what's the theoretical maximum perf is good.

And that's the actual workloop in `BitStreamForwardSequentialReplenisher`, that is what we should optimize.

If we can make the load unconditional (after clamping the position), then we can later fix-up the loaded value, by bit-shifting it. Though the resulting assembly is rather bad for >8 byte loads.

`inPos` may be past-the end, and that's UB. Funnily-enough, this happens to be faster: (one less register used, that had to be spill-reloaded) ``` build-Clang17-release$ /repositories/googlebenchmark/tools/compare.py -a benchmarks bench/librawspeed/adt/VariableLengthLoadBenchmark{-old,} --benchmark_repetitions=100000 --benchmark_filter="fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t" --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark-old --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpg4dbf8dr 2024-01-07T20:35:27+03:00 Running bench/librawspeed/adt/VariableLengthLoadBenchmark-old Run on (32 X 3400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 512 KiB (x16) L3 Unified 32768 KiB (x2) Load Average: 0.82, 1.15, 1.32 ----------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------------------------------------- BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean 116 us 116 us 100000 Latency=220.58ps Throughput=4.23082Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median 116 us 116 us 100000 Latency=221.329ps Throughput=4.20787Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev 4.36 us 4.35 us 100000 Latency=8.28896ps Throughput=236.877Mi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv 3.77 % 3.76 % 100000 Latency=3.76% Throughput=5.47% BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean 39.0 us 39.0 us 100000 Latency=74.3106ps Throughput=12.534Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median 38.9 us 38.9 us 100000 Latency=74.2531ps Throughput=12.5425Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev 0.449 us 0.409 us 100000 Latency=780.46fs Throughput=109.895Mi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv 1.15 % 1.05 % 100000 Latency=1.05% Throughput=0.86% RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplk2hmw4g 2024-01-07T20:35:48+03:00 Running bench/librawspeed/adt/VariableLengthLoadBenchmark Run on (32 X 3400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 512 KiB (x16) L3 Unified 32768 KiB (x2) Load Average: 0.87, 1.14, 1.31 ----------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------------------------------------- BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean 78.3 us 78.3 us 100000 Latency=149.272ps Throughput=6.23925Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median 78.2 us 78.2 us 100000 Latency=149.174ps Throughput=6.24321Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev 0.430 us 0.398 us 100000 Latency=759.527fs Throughput=30.7237Mi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv 0.55 % 0.51 % 100000 Latency=0.51% Throughput=0.48% BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean 39.7 us 39.7 us 100000 Latency=75.6285ps Throughput=12.3154Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median 39.6 us 39.6 us 100000 Latency=75.6073ps Throughput=12.3179Gi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev 0.417 us 0.397 us 100000 Latency=757.501fs Throughput=102.284Mi/s BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv 1.05 % 1.00 % 100000 Latency=1.00% Throughput=0.81% Comparing bench/librawspeed/adt/VariableLengthLoadBenchmark-old to bench/librawspeed/adt/VariableLengthLoadBenchmark Benchmark Time CPU Time Old Time New CPU Old CPU New ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_pvalue 0.0000 0.0000 U Test, Repetitions: 100000 vs 100000 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean -0.3233 -0.3233 116 78 116 78 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median -0.3260 -0.3260 116 78 116 78 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev -0.9014 -0.9084 4 0 4 0 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv -0.8543 -0.8646 0 0 0 0 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_pvalue 0.0000 0.0000 U Test, Repetitions: 100000 vs 100000 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean +0.0177 +0.0177 39 40 39 40 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median +0.0182 +0.0182 39 40 39 40 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev -0.0726 -0.0294 0 0 0 0 BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv -0.0887 -0.0463 0 0 0 0 OVERALL_GEOMEAN -0.1698 -0.1697 0 0 0 0 ```

`std::copy()` needs a pointer difference to compute the size, and while `std::copy_n()` helps with that, it still does not get optimized into a simple `memcpy()`, but ends up being conditional. 3 different implementations is plenty enough.

LebedevRI force-pushed the variable-length-load branch 2 times, most recently from 1fe0615 to f55a1f5 Compare January 7, 2024 01:28

LebedevRI force-pushed the variable-length-load branch 2 times, most recently from c1e753b to 003f519 Compare January 7, 2024 03:39

LebedevRI added 10 commits January 7, 2024 20:03

Add basic benchmark infra for the VariableLengthLoad

25f9add

Add test infra for VariableLengthLoad

75c1741

Add alternative, simpler variableLengthLoadNaiveViaConditionalLoad …

3ac6beb

…implementation

Add alternative, variableLengthLoadNaiveViaStdCopy

03e9aea

Basically identical to the original `variableLengthLoadNaiveViaMemcpy()`, but without any UB, and at least currently, if we are known-inbounds, it does get optimized to a simple load.

GCC: disable -Wstringop-overflow=/-Warray-bounds=, produces bougs…

c508028

… warnings

Also benchmark the perf of the best-case path

4429eb8

While it is rather interesting to know how fast the whatever implementation is on itself, practically always we deal with the fully-in-bounds case, so knowing what's the theoretical maximum perf is good.

Also benchmark the perf of the actual case

1f5d4f0

And that's the actual workloop in `BitStreamForwardSequentialReplenisher`, that is what we should optimize.

Implement proper (best?) variableLengthLoad

d4b932c

If we can make the load unconditional (after clamping the position), then we can later fix-up the loaded value, by bit-shifting it. Though the resulting assembly is rather bad for >8 byte loads.

Make variableLengthLoad() actually work on big-endian machines

e26043e

LebedevRI force-pushed the variable-length-load branch from 003f519 to 4734563 Compare January 7, 2024 18:23

Drop variableLengthLoadNaiveViaStdCopy()

1286399

`std::copy()` needs a pointer difference to compute the size, and while `std::copy_n()` helps with that, it still does not get optimized into a simple `memcpy()`, but ends up being conditional. 3 different implementations is plenty enough.

LebedevRI force-pushed the variable-length-load branch from 4734563 to 1286399 Compare January 7, 2024 19:04

LebedevRI merged commit 88cccd9 into darktable-org:develop Jan 7, 2024

LebedevRI deleted the variable-length-load branch January 7, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable length load meandering#595

Variable length load meandering#595
LebedevRI merged 12 commits into
darktable-org:developfrom
LebedevRI:variable-length-load

LebedevRI commented Jan 7, 2024

Uh oh!

codecov Bot commented Jan 7, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LebedevRI commented Jan 7, 2024

Uh oh!

codecov Bot commented Jan 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jan 7, 2024 •

edited

Loading