Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Prefetch instructions #352

Closed
wants to merge 1 commit into from
Closed

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Sep 19, 2020

Introduction

Most modern instruction sets include prefetch instructions. These instructions have no explicit effects, but provide a hint to the processor to pre-load soon-to-be-used data from memory into cache. As these instructions have only side-effects, they don't directly affect SIMD register. However, their usage is closely associated with SIMD processing (e.g. on x86 they were added in SSE, and on ARM -- in ARMv7, together with NEON), thus I suggest they should be part of the specification.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with SSE instruction set

  • prefetch.t
    • prefetch.t(mem) is lowered to PREFETCHT0 [mem]
  • prefetch.nt
    • prefetch.nt(mem) is lowered to PREFETCHNTA [mem]

ARM64 processors

  • prefetch.t
    • prefetch.t(mem) is lowered to PRFM PLDL1KEEP, [Xmem]
  • prefetch.nt
    • prefetch.nt(mem) is lowered to PRFM PLDL1STRM, [Xmem]

ARMv7 processors

  • prefetch.t
    • prefetch.t(mem) is lowered to PLD [Rmem]
  • prefetch.nt
    • prefetch.nt(mem) is lowered to PLD [Rmem]

@tlively
Copy link
Member

tlively commented Sep 19, 2020

I would be interested to hear others' thoughts, but IMO this seems to fall outside the scope of this proposal. It's definitely something we should consider, but I think we should continue the discussion of this as a separate proposal at WebAssembly/design#1364.

@penzn
Copy link
Contributor

penzn commented Sep 19, 2020

I agree, this would make much more sense "upstream".

@Maratyszcza
Copy link
Contributor Author

IMO it makes most sense to add Prefetch instructions as part of the SIMD specification, for two reasons:

  1. Like SIMD, Prefetch is a performance feature. Prefetch is typically used in the same codebases that use SIMD instructions (note that examples in the Applications section of the proposal interleave SIMD operations and Prefetch).
  2. Prefetch solves the problem inherent to SIMD computations: the problem of memory subsystem not catching up to data processing. Scalar computations almost never deal with this issue, because compute operations themselves become the bottleneck, and the capabilities of the out-of-order engines in the processors become sufficient to keep the scalar units fed with the smaller amount of in-memory data. Thus, there is little incentive to introduce Prefetch instructions outside of SIMD, as evidenced by both x86 and ARM instruction sets introducing Prefetch and SIMD instructions simultaneously.

@tlively
Copy link
Member

tlively commented Sep 20, 2020

I'm sympathetic to those arguments, but prefetch opens up a can of semantic worms. Since it's semantically a nop, it's unclear how the specification can say anything about how it should be ordered with respect to other instructions. We would have to involve the wider CG in a discussion about the proper way to specify that, and I expect that those discussions would delay shipping this proposal by multiple months. It would be much better to split these instructions and all the questions they raise into a follow-up proposal so we can get this one out the door.

@lars-t-hansen
Copy link
Contributor

This request is usually phrased in terms of a prefetch instruction that the producer knows where to place (through magic, more or less). This makes sense both because the CPUs have prefetch instructions and because the front end can help out with where to place the prefetch. But would an alternative be a "hot load" instruction, that bundles the prefetch with the load (as a prefix now), that the JIT can then try to place meaningfully?

@nfrechette
Copy link

Wouldn't it be a 'cold load' since the data isn't in the cache for that load?
In practice, it isn't always possible to keep the instruction close to the load and a compiler might not be able to figure out where it should go. For example, my animation decompression works by processing one cache line worth of data at a time. I prefetch the next cache line ahead but many other things happen before we finally start processing it to make sure the latency is fully hidden in case of a TLB miss. Other times, you want to bundle several prefetches/loads that miss together to avoid bubbles in the execution pipelines and allow the processor to prepare as much work as possible in the cache miss's shadow. It is not uncommon for a prefetch to live in one function while the load ends up in a different function.

Generally speaking, I think it would be best to leave the instruction where it is in the stream. The compiler shouldn't attempt to move it much. Prefetching is often added once many other optimizations have taken place first and by that point, you'll have to carefully measure the benefits (if any) of adding prefetching and where. With out-of-order processors, prefetching shouldn't be casually added just anywhere (unlike with in-order processors where it was often copy/pasted everywhere). Their usage definitely isn't as common but where it is used, it can provide significant benefits.

This circles back somewhat to the general pain of writing SIMD in a language that can compile to multiple ASM SIMD flavors. Where a prefetch is best placed with ARM might not be the same with x64 (although it might not make a huge difference then). I wonder what ISPC is doing with this since it somewhat faces a similar problem.

Prefetching getting added with previous SIMD instruction sets is a coincidence IMO. New SSE/NEON standards don't come out very often. It is simply one more way to help better utilize modern processors. Scalar code can benefit from prefetching just as much. It really depends on what the code is doing.

@lars-t-hansen
Copy link
Contributor

Wouldn't it be a 'cold load' since the data isn't in the cache for that load?

This is potato-potahto but the instruction is hot (ie should not be expensive), hence the nomenclature. But we move on...

Generally speaking, I think it would be best to leave the instruction where it is in the stream. The compiler shouldn't attempt to move it much.

This will tend to inhibit optimizations in a JIT and "much" is not very precise. Currently we (Firefox) pretty much have instructions that are fully moveable and instructions that don't move at all.

Can we come up with rules that are better than "treat the prefetch as a reordering barrier"?

This circles back somewhat to the general pain of writing SIMD in a language that can compile to multiple ASM SIMD flavors. Where a prefetch is best placed with ARM might not be the same with x64 (although it might not make a huge difference then). I wonder what ISPC is doing with this since it somewhat faces a similar problem.

And different JITs for the same architecture may also affect the outcome significantly. You test your code on JIT A and place your prefetches carefully, then the web exposes your code to JIT B which has some hoisting or checking optimization that shrinks the path from prefetch to load and makes the prefetch placement suboptimal / wrong.

@kmiller68
Copy link

And different JITs for the same architecture may also affect the outcome significantly. You test your code on JIT A and place your prefetches carefully, then the web exposes your code to JIT B which has some hoisting or checking optimization that shrinks the path from prefetch to load and makes the prefetch placement suboptimal / wrong.

It's not even JIT A vs JIT B, it could just as well be JIT A.1 vs JIT A.2. I'm not convinced that we want prefetch instructions to be a barrier any more than it's a barrier for the CPU itself, which as far as I know it's not. I could be convince otherwise but I'd like to see hard data before committing to making the instructions barriers. Beyond that, It's perfectly valid to insert a bunch of random unobservable loads/stores between the prefetch and the subsequent load, which from a performance PoV is likely just as bad as moving the prefetch.

@fbarchard
Copy link

I would like to see prefetch in the SIMD specification. I've used it often, but only in SIMD.
The placement isnt critical, but it can be co-issued for free after a math instruction, with the same characteristics as a load.

@lars-t-hansen
Copy link
Contributor

@kmiller68, the use of "barrier" was just meant to illustrate that if there's a prefetch instruction and the desire is for it to not move "too much" in the code, then a reasonably precise meaning for that would have to be found, or it's pointless to have it. The most precise I have come up with so far is that it acts as a reordering barrier in the semantics - bytecodes preceding it have to be executed before it, bytecodes succeeding it have to be executed after it, much in the way of a store to memory - not that it is expressed as an actual reordering barrier in the hardware. Improvements on that are obviously welcome, but it has to be something the compiler can relate to.

@lemaitre
Copy link

I would like to highlight that hardware prefetchers are better and better, to the point that Agner Fog says the following about Zen prefetcher (architecture.pdf, 20.16):

Automatic hardware prefetching is more efficient than explicit software prefetching in most cases.

I am really curious to see if there is any benefits of prefetching instructions on any recent hardware, and especially for 128-bit SIMD code (the target for current SIMD WASM).

@nfrechette
Copy link

I don't think prefetches should be like barriers, some re-ordering is fine providing it doesn't end up 100 instructions away or the other side of a an important loop. Key is to have the behavior be predictable. Even if we lose a few cycles doing a sub-optimal prefetch, we can still save hundreds of cycles from the cache miss.

@lemaitre In my code, software prefetching gave a 10-20% boost (even though all my reads are linear and contiguous and hardware prefetching friendly) on my Ryzen 2950X and my Pixel 3 phone. Agner is right in that with the advent of hardware prefetching and new chips supporting more parallel streams, their usage isn't as important anymore and can hurt performance in some scenarios. But they remain an important tool. The hardware prefetcher doesn't kick in until you've done 2 cache misses in a pattern it recognizes (and only if it can accommodate the stream at the L1/L2/L3 levels). Depending on your code, that might be significant. In my animation decompression, some streams are very densely packed and might fit in 2-3 cache lines meaning the hardware prefetcher often doesn't have time to kick in. I also know the memory layout and where TLB misses are likely to happen. Software prefetching allows me to hide that latency as well. Code that does random access but performs at least 100+ instructions with each access can benefit as well. For example, querying an R-tree or B-tree with multiple children per node. You can easily prefetch the next node to process while processing the current one. Hardware prefetching won't help you here and each node is susceptible to TLB misses as well. In code like this, software prefetching will always help and the hardware is unlikely to ever be able to help.

@tlively
Copy link
Member

tlively commented Sep 22, 2020

I don't think prefetches should be like barriers, some re-ordering is fine providing it doesn't end up 100 instructions away or the other side of a an important loop. Key is to have the behavior be predictable.

That's a good high-level goal, but we still need to formalize it in such a way that it can be implemented and guaranteed by compilers (both in the engine and in the producing toolchain). That's what @lars-t-hansen was talking about in his previous comment.

@Maratyszcza
Copy link
Contributor Author

IMO prefetch instructions should NOT be considered barriers. Hardware prefetch instructions are not barriers at the architecture level, i.e. the CPU can execute them in different order than they appear in native instruction stream. Thus, it would be strange if the WebAssembly engines try to enforce stronger memory ordering than hardware, and practically impossible on out-of-order processors.

I suggest we experiment with implementing Prefetch in V8 and/or SpiderMonkey and evaluate its impact on real-world tasks.

@lars-t-hansen
Copy link
Contributor

@Maratyszcza, that issue, which was an issue of wording only, has already been adressed (see my latest comment above) and the substantive issue here is not how to "experiment" with a prefetch but how to express its meaning to the compiler.

@zeux
Copy link
Contributor

zeux commented Oct 1, 2020

In my experience, on Intel ISA prefetch is most useful outside of tight SIMD kernels - hardware prefetcher works well when you're processing streams of data, and it doesn't work well when the access pattern is unpredictable. Historically for in-order architectures and/or ones without hardware prefetchers it could make sense to use it for stream processing but these times mostly passed. In fact on Intel processors it's almost easier to introduce perf. regressions by adding prefetch for stream processing vs improving performance... This feels like it's outside of the scope.

@fbarchard
Copy link

I had some code that did a prefetch but it was at the end of loop. Its fine in the case where memory is the bottleneck, but if you run the code on data that is already in L1 cache, the prefetches can slow down the code. So instead of 2 prfm just before the branch, i moved them just after some math. e.g.

ld1        {v0.16b, v1.16b}, [%0], #32  // load row 1 and post inc
ld1        {v2.16b, v3.16b}, [%1], #32  // load row 2 and post inc
subs       %w3, %w3, #16                // 16 processed per loop
uaddlp     v0.8h, v0.16b                // row 1 add adjacent
prfm       pldl1keep, [%0, 448]         // prefetch 7 lines ahead
uaddlp     v1.8h, v1.16b              
prfm       pldl1keep, [%1, 448]       
uadalp     v0.8h, v2.16b                // += row 2 add adjacent
uadalp     v1.8h, v3.16b              
rshrn      v0.8b, v0.8h, #2             // round and pack
rshrn2     v0.16b, v1.8h, #2          
st1        {v0.16b}, [%2], #16        
b.gt       1b                         

Running 120 benchmarks at 128x72 resolution, 10000 times each the differences are subtle. 3 runs of each benchmark
no prfm
(82448 ms total)
(82524 ms total)
(82123 ms total)

prfm middle (code above)
(82320 ms total)
(81095 ms total)
(81827 ms total)

prfm end
(84484 ms total)
(82876 ms total)
(83001 ms total)

@fbarchard
Copy link

Increasing the resolution to 1280x720 and running 12 tests 1000 each:

no prfm
(90115 ms total)
(90661 ms total)
(90812 ms total)

prfm middle
(81816 ms total)
(82394 ms total)
(82509 ms total)

prfm end
(82424 ms total)
(82733 ms total)
(83089 ms total)

Prefetching improves this example - scaling with bilinear filter to 1280x720, by 10%
When prefetch is effective, the location of the prefetch instruction doesnt matter much.
When prefetch is ineffective (data is in L1 cache), scheduling the instruction to be free makes a small difference. (1.4%)

@ngzhian
Copy link
Member

ngzhian commented Oct 16, 2020

Will PREFETCHT0 [mem] be considered a memory access? We need to do bounds check for prefetch as well, won't we?

@Maratyszcza
Copy link
Contributor Author

@ngzhian Prefetch of unallocated memory does not cause SEGFAULT, so no bounds check needed.

@ngzhian
Copy link
Member

ngzhian commented Nov 24, 2020

Prototyped on arm64 https://crrev.com/c/2543167

tlively added a commit to llvm/llvm-project that referenced this pull request Jan 5, 2021
As proposed in WebAssembly/simd#352 and using the
opcodes used in the V8 prototype:
https://chromium-review.googlesource.com/c/v8/v8/+/2543167. These instructions
are only usable via intrinsics and clang builtins to make them opt-in while they
are being benchmarked.

Differential Revision: https://reviews.llvm.org/D93883
tlively added a commit to tlively/binaryen that referenced this pull request Jan 6, 2021
As proposed in WebAssembly/simd#352, using the opcodes
used in the LLVM and V8 implementations.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Jan 6, 2021
As proposed in WebAssembly/simd#352, using the opcodes
used in the LLVM and V8 implementations.
@Maratyszcza
Copy link
Contributor Author

I evaluated the performance impact on end-to-end sparse inference in convolution neural networks by modifying the SpMM microkernels in XNNPACK library. I use three sparse neural network models:

Performance results on ARM64 are presented below:

Processor (Device)  Speedup on MobileNet v2 Speedup on Hand Tracking Speedup on Segmentation
Qualcomm Snapdragon 670 (Pixel 3a) 2% 5% -1%
Samsung Exynos 8895 (Galaxy S8) 3% 8% 11%

@Maratyszcza
Copy link
Contributor Author

Performance results on x86-64 systems:

Processor  Speedup on MobileNet v2 Speedup on Hand Tracking Speedup on Segmentation
AMD PRO A10-8700B -5% -6% -3%
AMD A4-7210 -3% -6% -3%
Intel Xeon W-2135 -2% -2% -1%
Intel Celeron N3060 -6% -8% -5%

@penzn
Copy link
Contributor

penzn commented Jan 22, 2021

IMO, consistent performance losses on one architecture is a bit of a disqualifying factor. However, the issues with prefetch run deeper than this - it is not really portable in terms of performance. The effects usually are specific to particular architecture, not broad x86/Arm, but model/family/etc. There might be a workload where you would get negative speedups on Arm, and positive on x86, or negative on one x86 chip and positive on the other.

As an a example, here is a link to a paper describing challenges with prefetch.

I am provisionally against this PR.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jan 22, 2021

However, the issues with prefetch run deeper than this - it is not really portable in terms of performance.

I agree, portability is rather disappointing. Better performance on ARM vs x86 might be due to prefetch code being ported from ARM implementation.

Likewise, provisionally against this proposal.

@jan-wassenberg
Copy link

FWIW JPEG XL uses prefetch for ANS alias tables and filtering, and did see some modest gains, IIRC also on x86.

On balance though, I agree prefetch is problematic due to lack of performance portability, and we've also seen some perf penalty, so would be OK with leaving it out.

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Adding a preliminary vote for the inclusion of prefetch operations to the SIMD proposal below. Please vote with -

👍 For including prefetch operations
👎 Against including prefetch operations

@Maratyszcza
Copy link
Contributor Author

The community group decided against including these instructions in #429 due to performance portability concerns.

@Maratyszcza Maratyszcza closed this Feb 4, 2021
lazyparser pushed a commit to riscv-collab/v8 that referenced this pull request Mar 22, 2021
Removing prefetch operations as per the vote in the github issue:
WebAssembly/simd#352

Bug:v8:11168

Change-Id: Ia72684e68ce886f8f26a7d3b5bea601be416dfab
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2771758
Reviewed-by: Jakob Kummerow <jkummerow@chromium.org>
Reviewed-by: Maya Lekova <mslekova@chromium.org>
Reviewed-by: Zhi An Ng <zhin@chromium.org>
Commit-Queue: Deepti Gandluri <gdeepti@chromium.org>
Cr-Commit-Position: refs/heads/master@{#73578}
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet