Variable length load meandering#595
Merged
Merged
Conversation
Member
LebedevRI
commented
Jan 7, 2024
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #595 +/- ##
===========================================
+ Coverage 58.90% 59.11% +0.20%
===========================================
Files 247 250 +3
Lines 14609 14770 +161
Branches 1991 2000 +9
===========================================
+ Hits 8606 8731 +125
- Misses 5882 5918 +36
Partials 121 121
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
1fe0615 to
f55a1f5
Compare
…SequentialReplenisher::getInput()` It is not obvious if this snippet is the best solution to the problem, but even if it is, there's hidden UB there: `Base::data + Base::pos` may be creating OOB pointer, even if the size (`bytesRemaining`) will be zero then.
c1e753b to
003f519
Compare
Basically identical to the original `variableLengthLoadNaiveViaMemcpy()`, but without any UB, and at least currently, if we are known-inbounds, it does get optimized to a simple load.
While it is rather interesting to know how fast the whatever implementation is on itself, practically always we deal with the fully-in-bounds case, so knowing what's the theoretical maximum perf is good.
And that's the actual workloop in `BitStreamForwardSequentialReplenisher`, that is what we should optimize.
If we can make the load unconditional (after clamping the position), then we can later fix-up the loaded value, by bit-shifting it. Though the resulting assembly is rather bad for >8 byte loads.
`inPos` may be past-the end, and that's UB.
Funnily-enough, this happens to be faster:
(one less register used, that had to be spill-reloaded)
```
build-Clang17-release$ /repositories/googlebenchmark/tools/compare.py -a benchmarks bench/librawspeed/adt/VariableLengthLoadBenchmark{-old,} --benchmark_repetitions=100000 --benchmark_filter="fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t" --benchmark_min_time=1x --benchmark_min_warmup_time=0.5
RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark-old --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpg4dbf8dr
2024-01-07T20:35:27+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark-old
Run on (32 X 3400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 0.82, 1.15, 1.32
-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean 116 us 116 us 100000 Latency=220.58ps Throughput=4.23082Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median 116 us 116 us 100000 Latency=221.329ps Throughput=4.20787Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev 4.36 us 4.35 us 100000 Latency=8.28896ps Throughput=236.877Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv 3.77 % 3.76 % 100000 Latency=3.76% Throughput=5.47%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean 39.0 us 39.0 us 100000 Latency=74.3106ps Throughput=12.534Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median 38.9 us 38.9 us 100000 Latency=74.2531ps Throughput=12.5425Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev 0.449 us 0.409 us 100000 Latency=780.46fs Throughput=109.895Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv 1.15 % 1.05 % 100000 Latency=1.05% Throughput=0.86%
RUNNING: bench/librawspeed/adt/VariableLengthLoadBenchmark --benchmark_repetitions=100000 --benchmark_filter=fixedLengthLoadOr.*variableLengthLoadNaiveViaMemcpy.*uint[36][24]_t --benchmark_min_time=1x --benchmark_min_warmup_time=0.5 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmplk2hmw4g
2024-01-07T20:35:48+03:00
Running bench/librawspeed/adt/VariableLengthLoadBenchmark
Run on (32 X 3400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 0.87, 1.14, 1.31
-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean 78.3 us 78.3 us 100000 Latency=149.272ps Throughput=6.23925Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median 78.2 us 78.2 us 100000 Latency=149.174ps Throughput=6.24321Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev 0.430 us 0.398 us 100000 Latency=759.527fs Throughput=30.7237Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv 0.55 % 0.51 % 100000 Latency=0.51% Throughput=0.48%
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean 39.7 us 39.7 us 100000 Latency=75.6285ps Throughput=12.3154Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median 39.6 us 39.6 us 100000 Latency=75.6073ps Throughput=12.3179Gi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev 0.417 us 0.397 us 100000 Latency=757.501fs Throughput=102.284Mi/s
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv 1.05 % 1.00 % 100000 Latency=1.00% Throughput=0.81%
Comparing bench/librawspeed/adt/VariableLengthLoadBenchmark-old to bench/librawspeed/adt/VariableLengthLoadBenchmark
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_pvalue 0.0000 0.0000 U Test, Repetitions: 100000 vs 100000
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_mean -0.3233 -0.3233 116 78 116 78
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_median -0.3260 -0.3260 116 78 116 78
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_stddev -0.9014 -0.9084 4 0 4 0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint32_t>/524288_cv -0.8543 -0.8646 0 0 0 0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_pvalue 0.0000 0.0000 U Test, Repetitions: 100000 vs 100000
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_mean +0.0177 +0.0177 39 40 39 40
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_median +0.0182 +0.0182 39 40 39 40
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_stddev -0.0726 -0.0294 0 0 0 0
BM_Impl<fixedLengthLoadOr<variableLengthLoadNaiveViaMemcpy>, uint64_t>/524288_cv -0.0887 -0.0463 0 0 0 0
OVERALL_GEOMEAN -0.1698 -0.1697 0 0 0 0
```
003f519 to
4734563
Compare
`std::copy()` needs a pointer difference to compute the size, and while `std::copy_n()` helps with that, it still does not get optimized into a simple `memcpy()`, but ends up being conditional. 3 different implementations is plenty enough.
4734563 to
1286399
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.