Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SVE-like flexible vectors #27

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lemaitre
Copy link

@lemaitre lemaitre commented Feb 19, 2021

Hello everyone, this is an attempt to formalize flexible vectors in a model that mimics SVE, as an alternative to #21 . The main goal of this PR is to discuss the model, letting the bikeshedding aside (at least for now).

The main idea is the following: each target architecture has a runtime constant SIMD width (ie: it cannot change during execution of the process), but this width is not known at compile time because we don't know the target architecture at this point.
All vectors have a width that is equal to this target SIMD width.

Smaller vectors are handled with masked operations. Each operation defines how it handles inactive elements. There usually is 3 policies: zero inactive elements, forward inactive elements from the first input, or letting them undefined. This third policy might be controversial and could be removed, but it gives more flexibility to WASM engines on the cases where the user actually does not care.

There is 10 types: 5 vectors (vec.v8, vec.v16, vec.v32, vec.v64, vec.v128) and 5 masks (vec.m8, vec.m16, vec.m32, vec.m64, vec.m128). This is needed because of the different way target architectures handle masks. Some have type-width agnostic masks (eg: AVX512), while some have type-width aware masks (eg: SSE, SVE).
Technically, it might not be necessary to have multiple vector types and only keep multiple mask types, but I think it simplifies the mental model to have a direct correspondence between vectors and masks (you cannot use a vec.m8 mask with a vec.v16 vector).

128-bit elements might not be necessary, but they would provide users a simpler way to deal with flexible vectors as they could have all the SIMD operations on 128-bit blocks they are used to (including compile-time shuffles).

There are many operations missing, eg: gathers, scatters, compact, expand, narrowing arithmetic, widening arithmetic, some masks operations (equivalent to break control flow), first-fault loads.
Currently, mask sizes are not defined and thus, no wait to load/store masks from/to memory exist. It is possible always possible to convert a mask into a vector and have this vector in memory, but this looks contrived.
I see 2 orthogonal ways to solve this issue: let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element).
The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).

More details in the README.


(edit)
The points I wish to be discussed:

  • The semantics of masks (eg: do we need more/less operations?)
  • The semantics of narrowing/widening operations
  • The semantics of swizzling
  • Is this model enough for most use cases?
  • Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

Copy link
Contributor

@penzn penzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please limit the PR to the instructions you want to add or change - and provide some background on why you want the change.

proposals/flexible-vectors/README.md Show resolved Hide resolved
@penzn
Copy link
Contributor

penzn commented Feb 20, 2021

Hello everyone, this is an attempt to formalize flexible vectors in a model that mimics SVE, as an alternative to #21 . The main goal of this PR is to discuss the model, letting the bikeshedding aside (at least for now).

My main issue is that the PR replaces the entire proposal :) Also, there are some instructions other than just masks that are added - do you want to discuss those?

The main idea is the following: each target architecture has a runtime constant SIMD width (ie: it cannot change during execution of the process), but this width is not known at compile time because we don't know the target architecture at this point.
All vectors have a width that is equal to this target SIMD width.

This is already true. Sorry, it wasn't very clear before, I've updated the readme to make this a bit more explicit.

Masks change is still welcome - let's just fit it inside what is already set up if possible. If you fill like it can't be reconciled with the current state of the proposal I will be happy to hear why, maybe we can do something about it.

@lemaitre
Copy link
Author

My main issue is that the PR replaces the entire proposal :)

Yeah, sorry about that. This PR is not intended to be a true PR: I don't expect it to be merged. The goal is to have a solid base on which we can discuss, and I believe that an issue would not be as convenient as a PR. As soon as we are confident enough this model can work and covers most use-cases, I will retro-fit this concept into your actual proposal.

Also, there are some instructions other than just masks that are added - do you want to discuss those?

Basically, I think we should discuss anything related to the flexible vector concept. Lane-wise operations are no issue, so I mostly skipped them, but shuffles and interleaving (for instance) need a proper definition.

So basically, the points I wish to be discussed:

  • The semantics of masks (eg: do we need more/less operations?)
  • The semantics of narrowing/widening operations
  • The semantics of swizzling
  • Is this model enough for most use cases?
  • Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

This is already true. Sorry, it wasn't very clear before, I've updated the readme to make this a bit more explicit.

Nice ;)
However, I don't know how it will play with your optional set_length.
That's why I made it very explicit that length is actually a runtime constant (I don't want a true set_length).

Copy link
Contributor

@penzn penzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about that. This PR is not intended to be a true PR: I don't expect it to be merged. The goal is to have a solid base on which we can discuss, and I believe that an issue would not be as convenient as a PR. As soon as we are confident enough this model can work and covers most use-cases, I will retro-fit this concept into your actual proposal.

There are a few instructions in the PR we should probably merge :)

I am not sure I have all the answers, and some of the answers might be trivial (sorry about that):

So basically, the points I wish to be discussed:

  • The semantics of masks (eg: do we need more/less operations?)

I think this is tied with "use cases" question - if we have use cases for more, we can add more, and if we should not add operations we don't have use cases for.

  • The semantics of narrowing/widening operations

What would be the concern here? With values encoding lane types we can go between lane sizes relatively easily.

  • The semantics of swizzling

That is the most "interesting" aspect of flexible length vector operations - it is very hard to do general purpose swizzle and shuffle if you don't know the length (especially shuffle with compile-time masks). However it is definitely possible to do things like concat, unpack, shift lane, etc.

  • Is this model enough for most use cases?

The devil is in details - very basic algorithms would work with even less, but if we want to port code that already works with existing vector instruction sets we need to come up with something compatible. What sort of algorithms you have in mind? Also, there are some examples in #5.

  • Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

Masks are tricky on those architectures, however "set length" approach is tricky as well (that's why it is now optional). Originally I thought of implementing set length by zeroing out left out portion of the vector, but that would be still quite expensive.

However, I don't know how it will play with your optional set_length.
That's why I made it very explicit that length is actually a runtime constant (I don't want a true set_length).

Sure, I see your point - setting length might mean setting mask length, or something like that, but that is not very practical.

- `vec.m32.sign(a: vec.v32) -> vec.m32`
- `vec.m64.sign(a: vec.v64) -> vec.m64`

## Inter-lane operations
Copy link
Contributor

@penzn penzn Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need at least some of those operations anyway. Do you mind if I pull this into a separate PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're confident enough about the usefulness of those, go on.
Out of curiosity, which ones do you want to pull?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only one operation currently in is lane shift, so most of them, at least in terms of categories.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be warned, the vec.vX.lane_shift is not equivalent to your shift operations. This one takes 2 input vectors. This makes it general enough to implement all shift related operations, including rotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are two flavors of this - one taking two inputs and the other taking only one. An as usual, the two major architecture families have made two different choices.


### LUT1 zero

Elements whose index is out of bounds are set to `0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this would be used for, to prepare input for swizzle? In Wasm SIMD swizzle has built-in zeroing of elements referenced by out-of-range indices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this description is not a full description. It is an actual swizzle operation (look at the python pseudo code).
The description only shows the out-of-bounds policy.

Copy link
Contributor

@penzn penzn Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, thought it was an instruction preparing LUT access. And I thought it wasn't a bad idea, given the pains of defining overflow behavior in simd128 :)

What is the difference between LUT and swizzle?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LUT and swizzle/shuffle are basically the same, but the concept of LUT can be extended to more inputs. For instance, Neon and SVE2 have LUT1, LUT2, LUT3 and LUT4. The naming is really easy to understand: LUTn is a lookup in a table of n registers.
I should probably add LUT2 to the spec. I don't think larger LUTs are that useful, so LUT1 and LUT2 should be enough.

Also, I don't mind using the terms swizzle and shuffle for LUT1 and LUT2, but I don't really like them either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small note about your example - if TBL/TBX is what you have in mind for LUTn, then SVE2 supports only LUT1 and LUT2 (unlike Neon), but, of course, lookup into arbitrary large tables can be synthesized with TBX (and probably vector DEC*).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LUTn would be the WASM equivalent of TBL/TBX.
Currently, only LUT1 and LUT2 are part of this PR. The reason is: only Neon provides the operation with more registers.

@lemaitre
Copy link
Author

  • The semantics of masks (eg: do we need more/less operations?)

I think this is tied with "use cases" question - if we have use cases for more, we can add more, and if we should not add operations we don't have use cases for.

I agree it is tied to use cases.
Here are some operations that I am confident they are useful:

  • vec.mX.index_CMP: most used will be the index_lt to implement most loops
  • all masks bitwise operations
  • vec.mX.all|none: creating basic masks
  • vec.vX.CMP
  • vec.vX.test_none|any|all: used for branching in SIMD

There are others I'm not sure they are useful, and some that I incorrectly ported from SVE:

  • vec.mX.first|last: they are interesting to iterate over lanes while staying in SIMD. The SVE version takes 2 masks to implement more efficiently this use case. SVE also provides the equivalent to previous and next.
  • vec.mX.index_first|last: transform a mask into a lane index, might be useful to chain with lane index operations (like insert or extract), but potientially hard to implement on SVE/Risc-V V.
  • vec.mX.count: count the number of active elements. I think it is mainly useful in conjunction with compact. The SVE version also takes 2 masks.

Also, for now, lane-wise operations have 4 mask policies:

  • nothing: the operation is not masked (all lanes are active).
  • _mz: The inactive lanes are set to 0.
  • _mm: The inactive lanes are copied from the first input (non-commutative operations like sub and div might need a reverse version that swaps the meaning of the inputs).
  • _mx: The inactive lanes have an undefined value. This would be useful for the WASM engine to chose was is the best policy to actually translate to.

I don't know if we actually need all 4 of them, especially the _mx one.

There is a last question about masks. In a masked ISA, almost all operations take a mask.
As WASM is a stack VM, you would naively need to "copy" the mask on top of the stak over and over for each operation.
How do we deal we this at the WASM level to avoid too many local.get?

  • The semantics of narrowing/widening operations

What would be the concern here? With values encoding lane types we can go between lane sizes relatively easily.

The question here is what lanes are combined together when narrowing 2 vectors together, and what lanes are selected when widening half of vector.
Basically, does narrowing concat inputs, or interleave inputs? Does widening selects low/high half or even/odd half?
This might not be an issue as there seem to be consensus on this subject across architectures (ie: narrowing concats and widening selects low/high).

Also, do narrowing operations take multiple input to form a full vector, or do we just fill half of the output?
I think the best way would to take multiple vectors as input just because it will a bit harder to implement the former behavior from the latter.

  • The semantics of swizzling

That is the most "interesting" aspect of flexible length vector operations - it is very hard to do general purpose swizzle and shuffle if you don't know the length (especially shuffle with compile-time masks). However it is definitely possible to do things like concat, unpack, shift lane, etc.

Full-width shuffles with compile-time masks are out of the question. The best we could propose is full-width runtime shuffles, where the index vector comes from an eventual vec.vX.const.

The common swizzling operations are however already defined here:

  • vec.vX.interleave_even|odd
  • vec.vX.interleave_low|high
  • vec.vX.concat_even|odd
  • vec.vX.lane_shift (and the mask only equivalent vec.vX.concat)

Full-width runtime shuffles are handled with vec.vX.LUT1. For now there is 2 out-of-bounds policies:

  • out-of-bounds are set to 0
  • out-of-bounds are copied from a fallback vector

The latter makes it simpler to implement larger LUTs (ie: from more inputs), but might not be super useful as it is already possible from the former policy.
We could also envision a third policy where indices are interpreted modulo the vector length (ie: no out-of-bounds), but I'm not sure is would be super useful.

Also, SVE and Neon provide LUT2, LUT3 and LUT4 (respectively taking from 2, 3, and 4 input vectors).
The equivalent of the intel permutexvar would be LUT1, and permutex2var would be LUT2 (indices are wrapped).

There is a potential issue with those LUT operations with vec.v8 if the width can be larger than 2048: a single byte would not be enough to address lanes from the whole vector (number of lanes > 256).

  • Is this model enough for most use cases?

The devil is in details - very basic algorithms would work with even less, but if we want to port code that already works with existing vector instruction sets we need to come up with something compatible. What sort of algorithms you have in mind? Also, there are some examples in #5.

I know that's the tricky bit... Here is a small list of algorithms that definitely need to be implementable:

  • gemm (matrix multiplication)
  • reductions (not necessarily additions)
  • Stencils
  • prefix sums (1D)
  • Integral images (like prefix sums but in 2D)
  • interpolations
  • On-The-Fly narrowing or widening (ie: chaining conversions with computation instead of having a separate conversion step)

I am also personally interested in more algorithms:

  • Run-Length Encoding
  • Union-Find (graph processing)
  • Background subtraction (like Sigma-Delta)
  • Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

Masks are tricky on those architectures, however "set length" approach is tricky as well (that's why it is now optional). Originally I thought of implementing set length by zeroing out left out portion of the vector, but that would be still quite expensive.

I think the best way would be to have a way for the engine to optimize away the mask when it is known to be full (or empty).
A vec_loop construct that would be semantically equivalent to a regular loop incrementing by vec.vX.length and with a vec.mX.index_lt could help as the engine can see the only the last iteration will have a non-full mask.
The engine could monomorphize all the functions taking a mask to specialize the case where the mask is full.

However, I don't know how it will play with your optional set_length.
That's why I made it very explicit that length is actually a runtime constant (I don't want a true set_length).

Sure, I see your point - setting length might mean setting mask length, or something like that, but that is not very practical.

I think set_length would not be necessary as vec.mX.index_lt generates a mask that as the same semantics as set_length.

@jan-wassenberg
Copy link

There is a potential issue with those LUT operations with vec.v8 if the width can be larger than 2048: a single byte would not be enough to address lanes from the whole vector (number of lanes > 256).

Right. SVE is limited to 2048 bit anyway but this could bite on RVV. FYI there seems to be a tendency there to impose at least a 64K element upper bound (previously there was none). I suspect that just wrapping around is the best we can do.

@lemaitre
Copy link
Author

@jan-wassenberg

I suspect that just wrapping around is the best we can do.

Out-of-bound policy is completely orthogonal to maximal vector length. I think for LUTs (ie: shuffles), out-of-bounds are 0 (or fallback) are really useful to chain LUTs when the actual LUT is larger than a single vector.

@jan-wassenberg
Copy link

For iota(), wraparound seems much more useful than 0 - it would still allow differentiating odd and even lanes.
For LUT2, could that not be implemented independently of overflow using compare+select/blend?

@lemaitre
Copy link
Author

For iota(), wraparound seems much more useful than 0 - it would still allow differentiating odd and even lanes.

Most operations on lanes use the wraparound policy, namely: extract_lane, replace_lane, splat_lane, load_lane, store_lane, lane_shift.
In fact, only LUT uses a different policy to make it simpler to chain them. Maybe we want both for LUTs.

For LUT2, could that not be implemented independently of overflow using compare+select/blend?

Yes, you can chain them with compare+select, but that could be less efficient on some architectures, especially Neon and SVE.
And I think that LUT will often be used in chains because you usually don't know how many registers you need for the LUT.
Also, emulating the wrap around is simpler (just a binary and) than emulating the out-of-bound zeros (compare+select).

@penzn
Copy link
Contributor

penzn commented Feb 24, 2021

Basically, does narrowing concat inputs, or interleave inputs? Does widening selects low/high half or even/odd half?
This might not be an issue as there seem to be consensus on this subject across architectures (ie: narrowing concats and widening selects low/high).

I think there is consensus, at least for some part that narrowing concats the intputs and widening uses low/high.

@penzn penzn mentioned this pull request Feb 26, 2021
penzn pushed a commit to penzn/flexible-vectors that referenced this pull request Feb 27, 2021
Change by Florian Lemaitre, orgially made in PR WebAssembly#27:

WebAssembly#27
This was referenced Feb 27, 2021
penzn pushed a commit that referenced this pull request Mar 1, 2021
Change by Florian Lemaitre, orgially made in PR #27:

#27
@lemaitre lemaitre mentioned this pull request May 17, 2021
Copy link

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

I also had a chance to look at the proposal.

Several other comments:

... let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element).
The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).

Not great for SVE either.

* `vec.mX.index_first|last`: transform a mask into a lane index, might be useful to chain with lane index operations (like insert or extract), but potientially hard to implement on SVE/Risc-V V.

IMHO it would not be hard to implement with SVE, but it would be very inefficient because it would involve a round-trip from the predicate registers to the general-purpose ones and back. Taking the combination of vec.mX.index_first and extract_lane as an example, my expectation would be that compilers would pattern-match the combination and just generate a LASTB instruction (simplifications are possible in specific cases). From SVE's perspective, the ideal solution would be to have an operation that extracts an element based on a mask value, but I am not sure how well that maps to other architectures.

proposals/flexible-vectors/README.md Show resolved Hide resolved
proposals/flexible-vectors/README.md Show resolved Hide resolved
- `vec.v16.UNOP(a: vec.v16) -> vec.v16`
- `vec.v32.UNOP(a: vec.v32) -> vec.v32`
- `vec.v64.UNOP(a: vec.v64) -> vec.v64`
- `vec.v128.UNOP(a: vec.v128) -> vec.v128`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal specifies 4 different vec.v128 interpretations - how do we determine what the expected interpretation is (same for the binary arithmetic operators)?

This might have implications for the representation of 128-bit element masks (which are not supported natively by SVE).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are much more way to interpret a vec.v128. Basically, a vec.v128 is a vector of v128 (from fixed-size SIMD), so there are as many ways to interpret a v128 than a vec.v128.

This section is a simplification of what the actual section will look, and the interpretation of the content of the vector will be part of the instruction. eg:

  • vec.i32.add(a: vec.v32) -> vec.v32
  • vec.f32.add(a: vec.v32) -> vec.v32

As element-wise unary and binary operation semantic is trivial, it has not been detailed in this proposal yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, they do look kind of trivial, but they still raise some questions - for example, aren't the unmasked vec.v128 operations completely redundant?

That's why I think it could be useful to state at least some of the masked operations concretely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect vec.i8x16.add to be a valid operation. The vec.v128.UNOP operations are for operations that deals with v128 as a block, like reverse bytes. I'm not sure how many of those operations there will be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's completely different; I was using the examples discussed in the description, i.e. neg and not. If you have such block use cases in mind, it's probably worth fleshing out the proposal a bit.


### LUT1 zero

Elements whose index is out of bounds are set to `0`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small note about your example - if TBL/TBX is what you have in mind for LUTn, then SVE2 supports only LUT1 and LUT2 (unlike Neon), but, of course, lookup into arbitrary large tables can be synthesized with TBX (and probably vector DEC*).


Applies swizzle to each v128 of the vector.

- `vec.i8x16.swizzle(a: vec.v128, s: vec.v128) -> vec.v128`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small remark: The LUT* operations take the index vector as the first parameter, while in this case it is the second one - is that for consistency with the WebAssembly SIMD specification?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for consistency with the i8x16.swizzle operation.

proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved
proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved
proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved
proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved
@lemaitre
Copy link
Author

I also had a chance to look at the proposal.

Thanks!

... let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element).
The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).

Not great for SVE either.

After thinking about it, I think we should provide multiple instructions to let the user decide how they want to store the masks:

  • 1 bit per element,
  • 1 bit per byte,
  • as a vector

IMHO it would not be hard to implement with SVE, but it would be very inefficient because it would involve a round-trip from the predicate registers to the general-purpose ones and back. Taking the combination of vec.mX.index_first and extract_lane as an example, my expectation would be that compilers would pattern-match the combination and just generate a LASTB instruction (simplifications are possible in specific cases). From SVE's perspective, the ideal solution would be to have an operation that extracts an element based on a mask value, but I am not sure how well that maps to other architectures.

Extracting a value based on a mask value would be a nightmare on x86, as you would have to first extract the mask in general purpose register, then find the index of the first set bit (with tzcnt), then move the result to back to SIMD register, shuffle the input vector, and finally extract the first lane value (would be 15 cycle latency, if I'm not mistaken).

So what I proposed should somehow minimize the overhead across architectures. And yes, some pattern matching could be make it more efficient.

Copy link

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@penzn

I think there is consensus, at least for some part that narrowing concats the intputs and widening uses low/high.

While SVE makes it possible to implement different approaches, there is hardly a consensus. For example, of all the narrowing and widening conversion operations defined in this proposal, only the widening integer conversions map to a single instruction, SUNPK* or UUNPK*, e.g. SUNPKHI (these instructions happen to be the building block for converting between the native SVE approach, i.e. overlapping even/odd elements or bottom/top in SVE parlance, and other representations).

@lemaitre

And yes, some pattern matching could be make it more efficient.

Keep in mind that may require pattern-matching across basic blocks, which might prove a challenge, especially for simpler JIT compilers.

proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved
proposals/flexible-vectors/README.md Outdated Show resolved Hide resolved

`idx` is interpreted modulo the length of the vector.

- `vec.s8.extract_lane(v: vec.v8, idx: i32) -> i32`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you do provide a motivation for avoiding immediate parameters, I think it is worth mentioning the trade-off - immediates make it easy to specialize the operation (and this is one advantage of the original proposal), which is particularly important for SVE, since explicit indices are not its natural approach, as discussed. E.g. extracting lane 0 (which I suspect is going to be very common, if not the most) can be implemented just with a Neon move. TBH I don't think Neon fares much better in this case (explicit indices are fine, but they have to be compile-time constants).

Is your expectation that most indices are going to be compile-time constants, in which case pattern-matching operations would help?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do expect indices to be almost always constants. And I do expect WASM engines to detect such constants with a constant folding pass (which should be both simpler and more robust than pattern matching, I assume).

We could have two instructions: one with runtime indices, and one with immediates. But I see two problems:

  • it doubles the number of opcodes,
  • except for 0, the index will usually not be an immediate as it will most likely depends on the vector length.

The vector length is a "platform constant", so can be detected by the engine as being constant, but is not an immediate. So constant detection looks more useful than immediates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* except for 0, the index will usually not be an immediate as it will most likely depends on the vector length.

BTW that might apply to more values, depending on how the discussion about the permissible vector lengths goes (given that you seem to be leaning towards powers of 2). E.g. in the case of SVE everything that fits within the lowest 16 bytes is fair game.

@lemaitre
Copy link
Author

Keep in mind that may require pattern-matching across basic blocks, which might prove a challenge, especially for simpler JIT compilers.

I know that pattern matching has limitations, and that's why the overhead without pattern matching should be minimal. Pattern matching should be the cherry on top of the cake.

Most of the time, it would be possible to duplicate the instructions to overcome the genericity issue (in that case, one instruction taking a scalar index, and one taking a mask), but I think the number of opcodes is rather limited, so I chose the one that I thought had less overhead overall.

@akirilov-arm
Copy link

The motivation is simply hardware availability (mostly) on the consumer market. On the other hand, we obviously don't want to close the door to SVE and RISC-V, though the case for those is not strong yet.

Just to add a bit to the SVE market availability angle - Arm has already announced processor cores with SVE support that cover almost the full range of the A architecture profile; the latest announcement was 3 weeks ago and concerned the next generation mobile cores. While it is true that it is not possible to buy anything featuring them on the market right now, if the past provides any indication about the future, then we should expect actual products in at most a year.

In order to cut down the attention spread, we can dedicate some time in the SIMD conference call (#37), if that works for both of you.

It should work for me; however, I think that @lemaitre and I are mostly in agreement, and I have just been asking clarifications on specific points.

@penzn
Copy link
Contributor

penzn commented Jun 15, 2021

Arm has already announced processor cores with SVE support that cover almost the full range of the A architecture profile; the latest announcement was 3 weeks ago and concerned the next generation mobile cores.

I have seen the SVE2 announcement, but missed this one :) Sure, this makes things different.

@sparker-arm
Copy link

Hi,

I've done a little experiment to see what the effects are of using masked memory operations vs using a vector loop with a scalar remainder. I've used the most basic loop I could think of:

void test_loop(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) {
for (uint32_t i = 0; i < N; ++i) {
  a[i] = b[i] + c[i];
}

And this is compiled, with clang-13 on Linux, using various pragmas to disable loop unrolling, set the vector width and enable/disable predication. Below shows the relative execution speed of the loop with different pragmas.

Loop i7-10610U Cortex-A72 (Raspberry Pi)
scalar 1.00 1.00
4xi32 vector 7.72 2.63
8xi32 vector 7.26 2.74
16xi32 vector 3.99 2.80
4xi32 predicated vector 2.20 0.74
8xi32 predicated vector 3.41 0.67
16xi32 predicated vector 3.52 0.68

I have no idea about AVX-2 and Skylake, but the super linear results for 4xi32 and 8xi32 surprised me. What wasn't so surprising was the negative affect that predicated memory operations have on performance. It would appear that using the native masking operations roughly halves the throughput compared to the non-predicated version.

On the NEON based Cortex-A72, we don't have any native masking support, so performance is significantly worse than the scalar loop.

So I can't see how introducing masked memory operations into wasm is currently viable, when they're not supported well (or at all) on the majority of currently available hardware.

@lemaitre
Copy link
Author

lemaitre commented Nov 11, 2021

@sparker-arm I don't know what you measured, but I think you have put too many constraints on the compiler for your benchmark to be relevant. I don't get where you expect masks to be used in your code. The compiler should not generate masks instructions for the loop body because you have no conditionals. The only place it would make sense in your code would be in the epilog, but it is hard to measure the time of the epilog. If your loop is too long, the epilog time would be negligible; if it is too short, you will measure the time of your timer, not your code.

So I gave a try on measuring the performance of masked stores on x86_64, but in the most complex case: ie, masked stores at every single iteration, and with random masks (not the special case ones from an epilog).

The kernel measured looks like that:

const int N = 1024;
int A[N], B[N];
void scalar() {
  for (int i = 0; i < N; ++i) {
    int b = B[i];
    if (b) A[i] = b;
  }
}

The complete list of versions and their descriptions is available in this gist: https://gist.github.com/lemaitre/581f3416dc3abc9944d576d08c2a444b

Here are the results on a i9-7900X (time in cycles per point):

version random always never
scalar_v0* 2.04 2.04 2.05
scalar_v1 12.7 2.03 3.02
scalar_v2 2.04 2.04 2.06
SSE4_v0* 0.539 0.531 0.539
SSE4_v1 12.5 1.54 1.95
SSE4_v2 2.13 2.11 2.13
SSE4_v3 7.84 2.29 2.29
SSE4_v4 7.99 1.27 1.03
AVX2_v0* 0.309 0.281 0.289
AVX2_v1 0.299 0.281 0.291
AVX512_v0* 0.156 0.148 0.139
AVX512_v1 0.139 0.145 0.117

Some caveats: v0s (*) are only valid in monothread context, and SSE4_v3 uses the maskmov instruction which bypass the cache and thus have a huge latency, but I only measured the throughput here. SSE4_v1 and v2 are plain old scalar emulation (so directly usable on ARM).

So basically, scalar emulation is on par with plain scalar (SSE4_v2), and even possibly faster if we know there is a predictable pattern (SSE4_v4).
This shows that we can propose masked instructions as they are as fast as scalar even though it is emulated with scalar.
With hardware support, it is incredibly fast.

For prologs, this can be made even faster because the masks have a special shape and can be further optimized.

@sparker-arm
Copy link

Hi @lemaitre Each loop was run over 100,000 elements and a minimum time was taken from 10 runs. I'm using predication to remove the epilogue, so all the memory operations in the loop are masked. I see this is one of the main reasons for having flexible vectors: so that wider vector units aren't penalised for smaller loops and also so we don't have to have distribute multiple versions of a loop. Or do you envision a compiler producing different versions of each loop?

And pardon my ignorance, but why are you only looking at stores?

@lemaitre
Copy link
Author

Each loop was run over 100,000 elements and a minimum time was taken from 10 runs.

Over such a large array on a simple computation like yours, you should have been limited by L3 cache bandwidth, not by masking.

I'm using predication to remove the epilogue, so all the memory operations in the loop are masked.

I don't think that is a viable way to generate efficient loops as even on AVX512 (with native masking), the overhead of computing the mask every single loop iteration can become non-negligible (even though the masking itself is).
What I envision is to duplicate the loop body for the epilog, and to make only the epilog using masks. This produce a larger bytecode, but keeps its efficiency on every architecture. In such a situation, masking is more efficient than scalar remainder, even with the scalar emulation for the masking operation.

So basically, the compiler would generate exactly 2 versions of the loop: without masks for provably full iteration, and with masks for the remainder. If we want, we could have specialized masks for more optimized epilog handling, but it is not even mandatory.

And pardon my ignorance, but why are you only looking at stores?

Because that's the hardest part of masking. Masked load could be done with full load, and in-register masking. For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load in that case (because inactive lanes should not trigger a segfault). We could even skip the check, and just trap the segfault.

The problem with stores is you should not write to inactive lanes, because another thread might write to them: the full-width store can introduce race-conditions. That's why my v0s are wrong.

If I make the masked store emulation as fast as scalar, I consider it a win, because I can write the rest of my kernel with SIMD without any extra overhead.

@akirilov-arm
Copy link

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load in that case (because inactive lanes should not trigger a segfault).

Just a quick note - in practice aren't almost all loads effectively unaligned ones? The alignment specified by the memarg is just a hint and I doubt that the Wasm compiler would be able to prove that the access is aligned that often.

@lemaitre
Copy link
Author

in practice aren't almost all loads effectively unaligned ones?

I also think so. At least if we keep the hint semantic. Personally, I would much prefer to make it a constraint.
But even though, the check that load does not cross page boundary should be cheap, especially because page boundary will not be crossed often, and also the branching pattern should be mostly regular.

@akirilov-arm
Copy link

akirilov-arm commented Nov 11, 2021

Sure, considering each branch in isolation, the branching pattern should be mostly regular, but the branch density might have negative interactions with branch prediction (which wouldn't be an issue with the scalar version). That could be compensated with inserting no-op instructions at the expense of code size.

@lemaitre
Copy link
Author

Sure, considering each branch in isolation, the branching pattern should be mostly regular, but the branch density might have negative interactions with branch prediction (which won't be an issue with the scalar version).

First, the naive (branchy) scalar version will eat up the same amount of "branch predictor space" than my hypothetical branch on page boundary crossing (if scalar is unrolled, then scalar would be worse in this regard). Then, I wrote a branchless masked-store that is on-par with the branchless conditional scalar store. We could have the exact same thing for loads.

Finally, I already mentioned it multiple times, but we could trap segfaults, and check if it comes from a legitimate masked load or not. The probability that such a masked load actually segfaults on inactive lanes is so low that the overhead from the signal is not relevant.

@sparker-arm
Copy link

Over such a large array on a simple computation like yours, you should have been limited by L3 cache bandwidth, not by masking.

Okay.... to make you happy, 1000 elements:

Loop i7-10610U Cortex-A72 (Raspberry Pi)
scalar 1.00 1.00
4xi32 vector 5.49 3.04
8xi32 vector 4.97 3.27
16xi32 vector 5.49 3.27
4xi32 predicated vector 2.13 0.74
8xi32 predicated vector 3.24 0.67
16xi32 predicated vector 3.33 0.68

What I envision is to duplicate the loop body for the epilog, and to make only the epilog using masks.

This is reasonable, but you'll still penalize SSE and Neon for kernels with small loop iterations, creating a performance cliff on certain architectures.

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load

Right... and as mentioned, that's all of them?

the check that load does not cross page boundary should be cheap

Do you know of any compilers that do this currently?

we could trap segfaults, and check if it comes from a legitimate masked load or not.

This sounds like a bad design choice, a compiler shouldn't take valid code, make it invalid and then rely on all the runtimes to handle the fault. We (runtimes) would most likely just have to scalarize and the branch density will not be good.

@lemaitre
Copy link
Author

Okay.... to make you happy, 1000 elements:

No, I did not make myself clear enough. My point was such a simple kernel is limited by something than memory bandwidth (or at least L3 bandwidth) for 12 MB data, your benchmark may be flawed somewhere (or you have an impossibly fast memory). But as you don't provide the code, I cannot tell you more.

This is reasonable, but you'll still penalize SSE and Neon for kernels with small loop iterations, creating a performance cliff on certain architectures.

How is it penalizing SSE and Neon? Can you write an epilog that is faster than using masked loads/stores? Scalar will be slower. And the larger the vector, the more expensive a scalar remainder will be. A vector remainder has not such problems, but is only possible using masked instructions.
Also, flexible vectors are all about "creating" a performance cliff: we want to leverage all the power from our fast archs.

The real point is: not penalizing old archs just for the sake to make recent ones faster. But introducing masked instructions does not cripple SSE/Neon, because the alternative would be scalar, which is not faster. In actual code, scalar will even be slower because the whole loop would be scalar, while here, only the masked memory accesses would be.

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load

Right... and as mentioned, that's all of them?

You don't get it. If your scalar have a branch for the conditional access, my solution does not introduce any extra ones (one branch in both cases). If your scalar code does not have a branch, fine, we can also implement masked load/store without any branch (SSE4_v2), it would be just a bit slower than the branchy one on predictable patterns, but still as fast as the scalar one (or even faster on more complex codes).

the check that load does not cross page boundary should be cheap

Do you know of any compilers that do this currently?

we could trap segfaults, and check if it comes from a legitimate masked load or not.

This sounds like a bad design choice, a compiler shouldn't take valid code, make it invalid and then rely on all the runtimes to handle the fault.

The compiler has nothing to do with all of that! The compiler will just generate the WASM bytecode that will contain masked memory instructions. Then, the WASM engine (ie: runtime) will convert this bytecode into actual machine code, and this is where the page boundary cross check would be integrated and/or the signal trap. As it is done by the runtime, the runtime itself can be sure that trap will be there.

We (runtimes) would most likely just have to scalarize and the branch density will not be good.

I don't get your point. First, it is possible to implement masked load/store without any branch (have you checked my code?), even if it is a "scalar emulation". Then, the signal approach does not introduce any branch either. Only where a load is at the edge of a system memory allocation that it would be possible to trigger the fault. It is very unlikely to happen in real code, and thus, this case can be slow, and we would not care.

@sparker-arm
Copy link

The only paper I've seen on this approach is this, and it certainly shows potential. If you have some more links I'd be interested in having a look. But for all your assertions about this approach being faster, I'd like to see some examples of a compiler choosing to vectorize, using masked load and stores, for an architecture that doesn't support them. Having an existing production system being able to make sensible choices would appease me completely, otherwise it seems like a leap of faith in we think a wasm compiler will be able to do. And if a wasm compiler can't use this stuff effectively, then it will just be disabled - and nobody wants that!

How is it penalizing SSE and Neon? Can you write an epilog that is faster than using masked loads/stores? Scalar will be slower.

Treat my example as an epilogue, scalar is faster on my raspberry pi using the current magic in LLVM. I think the problem here is that we're thinking about masking from two different angles, I think you're mainly considering removing control-flow and I'm considering how to represent a vector loop in the 'flexible' manner. I'd suggest that your primary use case is a separate issue from enabling wider vector widths and is, instead, the next step in Wasm SIMD saga.

@lemaitre
Copy link
Author

lemaitre commented Nov 12, 2021

But for all your assertions about this approach being faster, I'd like to see some examples of a compiler choosing to vectorize, using masked load and stores, for an architecture that doesn't support them

I don't care about what compilers are doing now. I care about what we can do. And we can do much better in that regard.
You want me to prove that masked remainder is faster than scalar, fine, here it is: https://gist.github.com/lemaitre/a3510fa5d1d3cace46adc56be73e07ab

The kernel looks like this:

void func() {
  for (int i = 0; i < N; ++i) {
    int a = A[i];
    for (int j = 0; j < a; ++j) {
      B[j] += 1;
    }
  }
}

I have 5 versions:

  • scalar_v0: scalar code with vectorization disabled
  • scalar_v1: scalar code and vectorization forced
  • sse4_v0: simd code with scalar remainder
  • sse4_v1: simd code with branchless masked remainder (optimized masks)
  • sse4_v2: simd code without remainder (lower bound of the processing time)

The number of iteration in the inner loop is small and random. The range is given in the header of the table:

version [0, 4] [0, 8] [0, 16] [0, 32]
scalar_v0 20.8 26.6 35.8 49.8
scalar_v1 22.0 31.4 39.6 45.3
sse4_v0 22.6 33.5 42.2 48.0
sse4_v1 9.57 18.6 30.2 34.5
sse4_v2* 7.97 14.3 23.3 27.5

We can clearly see that scalar remainder is slow on tiny loops. While my branchless masked remainder does a great job and is quite close to the no remainder code. The most impressive aspect of this is that the loop body is ridiculously tiny (a load, a plus, and a store), so this is where the cost of the masking will be the most visible. But it still outperforms the scalar remainder nonetheless.
I think a branchy masked remainder can be even more effective as long as there is not much pressure on the branch predictor.

And if a wasm compiler can't use this stuff effectively, then it will just be disabled - and nobody wants that!

I agree with that. And my results show that it can be effective. Ok, for now, it is hand-coded, but if you look at the code, you will see that automatic generation of such masked memory access will be easy.

Treat my example as an epilogue, scalar is faster on my raspberry pi using the current magic in LLVM.

I don't consider your example at all as your results show there is something weird with your implementation, yet you do not provide it for me to check why.

I think the problem here is that we're thinking about masking from two different angles, I think you're mainly considering removing control-flow and I'm considering how to represent a vector loop in the 'flexible' manner.

My stance is that masks are needed anyway, so we can just use them to implement remainder. I showed that, performance wise, it is nice, even though the masks are emulated.

@penzn penzn closed this Nov 12, 2021
@penzn
Copy link
Contributor

penzn commented Nov 12, 2021

Oops, closed by accident.

@penzn penzn reopened this Nov 12, 2021
@penzn
Copy link
Contributor

penzn commented Nov 12, 2021

So I can't see how introducing masked memory operations into wasm is currently viable, when they're not supported well (or at all) on the majority of currently available hardware.

@sparker-arm I am not sure about what is the point you are trying to make. Is that AVX2 (which your CPU has) doesn't have predicates? Or that there are challenges with trying to emit them in autovectorized programs?

There are obvious differences between predicate support between different ISA generation and different architectures, but I am confused about how your example is supposed to demonstrate those.

void test_loop(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) {
for (uint32_t i = 0; i < N; ++i) {
  a[i] = b[i] + c[i];
}

And this is compiled, with clang-13 on Linux, using various pragmas to disable loop unrolling, set the vector width and enable/disable predication. Below shows the relative execution speed of the loop with different pragmas.

Specifically, I am not quite sure what are you observing when you are adding loop pragmas to a scalar A + B loop (in addition to what @lemaitre already said about remainder vs the rest and memory hierarchy). I think the pragmas you are using would potentially interact with one another. The fact that simple vectorization runs much slower than scalar is a bit of a giveaway:

Loop i7-10610U Cortex-A72 (Raspberry Pi)
scalar 1.00 1.00
4xi32 vector 7.72 2.63
8xi32 vector 7.26 2.74
16xi32 vector 3.99 2.80
4xi32 predicated vector 2.20 0.74
8xi32 predicated vector 3.41 0.67
16xi32 predicated vector 3.52 0.68

(this a copy of the original table from #27 (comment), before reduction in size, to avoid scrolling up and down)

If the effect of vectorization was 3 or 8 times drop in performance, as in this data, then nobody would be really using it :) Autovectorization is a multi-stage process, where first the compiler has to find enough parallelism, then select the right operation, and so on, all of it interacting with other transformations. By adding extra pragmas you are likely disrupting it somewhere. Two more symptoms are that supposedly predicated code is actually running on non-predicated instruction set, and increasing vector size beyond what platform supports yielding better performance on x86. In a way, comparison between predicated and non-predicated vector code running time runs counter to the point you seem to be making about mask support, but I have doubts about this methodology in general.

If you want this to be a bit more thorough, it would be good to list assembly for each kernel paired with clang commands and full sources with the pragmas (in a gist or a separate repo).

@sparker-arm
Copy link

Sorry @penzn, the table shows the relative performance / speedup so hopefully that makes a bit more sense now! This is how I'm using the pragmas, to force the vectorizer into the various settings, and these work when targeting aarch64 or x64.

void test_predicated_vector_256(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) __attribute__((noinline)) {
  #pragma clang loop vectorize(enable)
  #pragma clang loop vectorize_width(8)
  #pragma clang loop vectorize_predicate(enable)
  #pragma clang loop unroll(disable)
  for (uint32_t i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
  }
}

@sparker-arm
Copy link

@lemaitre I'm still trying to get my head around your benchmarks, it's a lot of code for someone who doesn't know x86! I don't think I understand how your examples are valid in the wider context of wasm though, you can write branchless remainder code for x86, whereas we cannot for aarch64. In aarch32/thumb-2 this would be okay, but I'm not sure what other architectures support conditional load/stores.

I'm also thinking about the feasibility of using traps to handle 'bad' vector loads. For an OoB load, how would we (runtimes) determine that the load is actually 'okay'?

@lemaitre
Copy link
Author

lemaitre commented Nov 15, 2021

@sparker-arm

you can write branchless remainder code for x86, whereas we cannot for aarch64

Even though cmov in x86 can be used for conditional load, it is actually kind of useless as it does not prevent faults if the address is invalid and the condition not met. The trick I used is to use cmov to select either the address in the array or a dummy address on the stack, and then use an unconditional load/store. This can also be implemented in ARM using CSEL instruction. In fact, we could even compute the addresses (a bit) faster using SIMD.

Now, the code you generated with clang does work, but is horrendous: https://gcc.godbolt.org/z/zs38eMveK
No wonder why it is slow.

EDIT: The x86 version looks better, but it falls back to scalar operations, not masked vectors: https://gcc.godbolt.org/z/79brKWoxz

The thing you seem to have missed is that the loop body does not need predication, only the remainder. This is a very simple way to make the predication cost negligible.

In fact, that's what my code tries to benchmark. The inner loop is small and have a remainder (implemented either in scalar, or with predicated load/store), and the outer loop just repeats the inner one with a random number of iterations. That way, I can measure the time taken by the loop, while the remainder is not negligible. To be noted that my code is explicit about how to do the remainder.

Of course, the method implies that the loop body is duplicated, but I think it is a "necessary evil" to have efficient loops.
And just to put things in perspective, the code generated by your code have four times the loop body (if I counted correctly).

I'm also thinking about the feasibility of using traps to handle 'bad' vector loads. For an OoB load, how would we (runtimes) determine that the load is actually 'okay'?

We would need a list of masked memory loads. If the current instruction address is in the list, that was a masked load, and a more robust masked load should be used. In order to keep the list small, we could make this list per function. Or we could add some special noops before or after the load that the signal handler could check to see if the load was a masked one (and where the mask is). Finally, because WASM has only a single memory region, we could reserve an extra page at both end of the region to ensure a masked load with at least a valid lane would never fault.

@lars-t-hansen
Copy link

Finally, because WASM has only a single memory region, we could reserve an extra page at both end of the region to ensure a masked load with at least a valid lane would never fault.

By this I hope you mean that the source program will allocate extra memory in the wasm heap and not that the implementation will do this for you, since most practical Wasm implementations will insist on controlling the memory beyond the heap and use it for bounds check elimination / address calculation optimization purposes (among other things).

A couple of other drive-by comments about dealing with unaligned accesses that overlap the end of the heap in practical implementations:

It is true that one can use a signal handler to fixup a partially-out-of-bounds or unaligned access, and Firefox did that for a while for in-bounds unaligned accesses on ARM, but it's a mess in practice - kernel data structures necessary to do so are not always documented and system headers are frequently not shipped in a usable form, requiring difficult workarounds and sometimes a reading of licenses that makes some people nervous.

For partial-OOB stores the CPU specs have some tricky wording. Unaligned stores on ARMv8 will revert to byte-at-a-time semantics and it appears to be implementation-defined (since the Pixel2 Snapdragon and Apple M1 do this differently in my experience, though I need to investigate details here further) whether an address violation / accessibility check is performed for the entire store before the first byte is stored or whether each byte store is checked individually. As a consequence, an unaligned store that overlaps the end of the heap may store partial data at the end of the heap, or not. This is a bit of a mess and may require a Wasm implementation that uses page fault tricks for bounds checking to insert its own guard before each store on some CPUs if it can't prove that the store is completely in-bounds.

(The alignment hint in wasm memory accesses is generally ignored by implementations I believe - certainly we ignore these hints in all cases. Unaligned accesses have to be handled anyway, CPUs are increasingly able to deal with unaligned accesses transparently, toolchains don't produce the hints, and there's no benefit to generating special code for accesses hinted as unaligned.)

@sparker-arm
Copy link

The trick I used is to use cmov to select either the address in the array or a dummy address on the stack, and then use an unconditional load/store.

Ah, thanks, that makes sense now. I'll try this on my rpi.

We would need a list of masked memory loads.
we could add some special noops before or after the load that the signal handler could check to see if the load was a masked one (and where the mask is).

Yes, I was thinking more about the complications of finding the mask value. So more like a map of a load to a compare, instead of a list, but then I'm still not sure what I'd do with that. But that maybe because I've never written a handler before... For the case where we have an 'masked' load, which will be merged in-register, the compare for the select may not have even been executed yet for that iteration, so how do I calculate my mask and recover gracefully? If I had source-level exceptions, I think I could just have the scalar loop body in the 'catch' block, but I'm really sure not at the runtime level.

Firefox did that for a while for in-bounds unaligned accesses on ARM, but it's a mess in practice - kernel data structures necessary to do so are not always documented and system headers are frequently not shipped in a usable form, requiring difficult workarounds and sometimes a reading of licenses that makes some people nervous.
As a consequence, an unaligned store that overlaps the end of the heap may store partial data at the end of the heap, or not. This is a bit of a mess and may require a Wasm implementation that uses page fault tricks for bounds checking to insert its own guard before each store on some CPUs if it can't prove that the store is completely in-bounds.

Interesting! And just to point out, V8 on aarch64 (android/linux) currently doesn't use trap handlers for OoB.

I'm definitely not a fan of the idea of relying on trap handlers and it's why I'm more inclined to think that the compiler would insert some runtime checks instead... Then I think we should be able to have a vectorized loop that wouldn't fault and a remainder to handle the rest. I need to think some more about this.

@penzn
Copy link
Contributor

penzn commented Nov 16, 2021

@sparker-arm thank you for clarification on pragmas.

Sorry @penzn, the table shows the relative performance / speedup so hopefully that makes a bit more sense now!

OK, that makes better sense, I thought it was relative execution time. Raspberry PI vector performance looks a bit low though.

@lemaitre
Copy link
Author

lemaitre commented Nov 16, 2021

@lars-t-hansen

By this I hope you mean that the source program will allocate extra memory in the wasm heap and not that the implementation will do this for you, since most practical Wasm implementations will insist on controlling the memory beyond the heap and use it for bounds check elimination / address calculation optimization purposes (among other things).

I do mean that the runtime adds the extra page around the heap, because the runtime cannot be sure that the user did it, so has to consider they have not, so the runtime needs another mechanism to ensure it works in all circumstances.
But if a runtime wants to tradeoff speed for a bit of safety, fine, they can do it. I just explore multiples ways to efficiently implement masked loads/stores. And these readable guard pages are a very easy and very efficient way to implement masked loads.

Also, I doubt that many oob accesses can be caught this way because WASM has a single memory region so most software oob accesses would be within the region, and thus not catchable that way. This is very different from native processes where you can easily detect stack overflows using this method.

It is true that one can use a signal handler to fixup a partially-out-of-bounds or unaligned access, and Firefox did that for a while for in-bounds unaligned accesses on ARM, but it's a mess in practice - kernel data structures necessary to do so are not always documented and system headers are frequently not shipped in a usable form, requiring difficult workarounds and sometimes a reading of licenses that makes some people nervous.

You don't really need to access kernel data structures to do the fallback. You could just make the signal handler setting the context inside a function where all registers are callee-saved, and that returns just after the "not-so faulty" load. Of course, the signal handler should set the stack-frame correctly, and the function should return the loaded value in the same register as the one used in the actual load, but it is definitely possible. And if this special function faults, that would be fine because the memory accesses inside it are not registered are masked ones, and the signal handler will thus forward the signal.

Maybe it would be possible to just execute the fallback inside the signal handler, but I don't know how current platforms handle nested faults.

For partial-OOB stores the CPU specs have some tricky wording. ...

I don't get your point here. To me, it should be fine to say oob (like in: outside the heap) is undefined behavior, instead of saying it traps. Also, this has nothing to do with masks or even flexible vectors, because this problem can already arise today with 128-bit SIMD.

Also, SIMD memory accesses must be allowed to slice, because no arch makes any guarantee about it. So the fact that some archs fallback to byte access is actually fine.

(The alignment hint in wasm memory accesses is generally ignored by implementations I believe...)

I am disappointed about alignment handling in WASM SIMD. A hint only is of no use and it would have been much better to have it a proper constraint with undefined behavior if not met at runtime. But I bet that's not the case because people don't want UB...


@sparker-arm

Yes, I was thinking more about the complications of finding the mask value. So more like a map of a load to a compare, instead of a list, but then I'm still not sure what I'd do with that.

You got it. Basically, the map will tell you in which register the mask is, and the signal handler would just make the call to the fallback implementation (see above).

I'm definitely not a fan of the idea of relying on trap handlers and it's why I'm more inclined to think that the compiler would insert some runtime checks instead...

That is not the role of the compiler. The compiler would just generate WASM masked loads/stores (for the remainder only, or for the whole loop, and skip the remainder, as it pleases). The runtime checks (if any) would be generated by the runtime (ie: WASM engine) in case there is no masked load/store instructions on the runtime platform. And just to be clear, I do not object to that. I just want to explore multiple ways to do it. Maybe some runtimes will chose different strategies, and that's fine.

Then I think we should be able to have a vectorized loop that wouldn't fault and a remainder to handle the rest.

That would definitely be the case if the loop body is not predicated, and only the remainder actually uses masks.

@lars-t-hansen
Copy link

@lars-t-hansen

By this I hope you mean that the source program will allocate extra memory in the wasm heap and not that the implementation will do this for you, since most practical Wasm implementations will insist on controlling the memory beyond the heap and use it for bounds check elimination / address calculation optimization purposes (among other things).

I do mean that the runtime adds the extra page around the heap, because the runtime cannot be sure that the user did it, so has to consider they have not, so the runtime needs another mechanism to ensure it works in all circumstances. But if a runtime wants to tradeoff speed for a bit of safety, fine, they can do it. I just explore multiples ways to efficiently implement masked loads/stores. And these readable guard pages are a very easy and very efficient way to implement masked loads.

Sure. My point is merely that the runtime has to balance needs, and that the mechanism suggested in this thread to allow a wide read to succeed even when technically OOB is (so far as I can see) at odds with the mechanism runtimes already use to perform bounds checking. Many runtimes will probably prioritize the latter over the former. An argument that the former technique can be used to improve the performance of a computation has to come with the caveat that the technique may not be available in many runtimes.

Also, I doubt that many oob accesses can be caught this way because WASM has a single memory region so most software oob accesses would be within the region, and thus not catchable that way.

I don't know what you mean by that. Firefox uses the page protection mechanism to catch almost all OOB accesses on 64-bit systems; normally we emit virtually no explicit bounds checks at all.

It is true that one can use a signal handler to fixup a partially-out-of-bounds or unaligned access, and Firefox did that for a while for in-bounds unaligned accesses on ARM, but it's a mess in practice - kernel data structures necessary to do so are not always documented and system headers are frequently not shipped in a usable form, requiring difficult workarounds and sometimes a reading of licenses that makes some people nervous.

You don't really need to access kernel data structures to do the fallback. You could just make the signal handler setting the context inside a function where all registers are callee-saved, and that returns just after the "not-so faulty" load.

And for that you need access to "kernel" data structures. While integer registers are in the well-defined user-exposed part of the sigcontext and can be updated easily, floating point registers are not always so - at least not for floating point registers on ARM. For that you need to know the layout of the hidden bits of the sigcontext.

For partial-OOB stores the CPU specs have some tricky wording. ...

I don't get your point here. To me, it should be fine to say oob (like in: outside the heap) is undefined behavior, instead of saying it traps.

Undefined behavior is not acceptable on the web (and even implementation-defined behavior is undesirable), and yeah, this is a web context. In wasm an OOB access causes a trap which terminates the program, but the surrounding JS code can catch that trap and has access to the program's memory and can in principle (and frequently will) run more code in it. You can argue about whether that's completely sensible, but it will happen. In that case, the rogue writes near the end of the heap need to be avoided.

@sparker-arm
Copy link

So from a runtime implementer perspective, it sounds like a big headache to try to emulate the masking, so I somehow doubt it will be done - there's already enough to do! But my realisation yesterday was that, from a compiler implementer perspective, we will treat web assembly the same as any other backend, and only use features when it is likely to make sense for the majority of users. This doesn't have to be as coarse grained as a wasm proposal, but can be at the opcode level too. Depending on how fast this proposal moves, the majority of hardware and runtime support might not be ready to take advantage of everything here, and that's fine. So, I think this would be a really great use case for feature detection - the compiler can be less pessimistic while producing performance portable code for everyone.

@lemaitre
Copy link
Author

lemaitre commented Nov 20, 2021

@lars-t-hansen

Sure. My point is merely that the runtime has to balance needs, and that the mechanism suggested in this thread to allow a wide read to succeed even when technically OOB is (so far as I can see) at odds with the mechanism runtimes already use to perform bounds checking. Many runtimes will probably prioritize the latter over the former. An argument that the former technique can be used to improve the performance of a computation has to come with the caveat that the technique may not be available in many runtimes.

Agreed.

Also, I doubt that many oob accesses can be caught this way because WASM has a single memory region so most software oob accesses would be within the region, and thus not catchable that way.

I don't know what you mean by that. Firefox uses the page protection mechanism to catch almost all OOB accesses on 64-bit systems; normally we emit virtually no explicit bounds checks at all.

Sorry, I mixed up "C oob" and "runtime oob". A realization I just had: if you use signals to catch oob access, you can also use it to handle masked accesses. So there would be no need for an extra page.

You don't really need to access kernel data structures to do the fallback. You could just make the signal handler setting the context inside a function where all registers are callee-saved, and that returns just after the "not-so faulty" load.

And for that you need access to "kernel" data structures. While integer registers are in the well-defined user-exposed part of the sigcontext and can be updated easily, floating point registers are not always so - at least not for floating point registers on ARM. For that you need to know the layout of the hidden bits of the sigcontext.

I thought that FP registers were neither saved nor restored automatically by a signal handler. If that's indeed the case, that means you can just access FP registers directly.

In that case, the rogue writes near the end of the heap need to be avoided.

I believe you cannot ensure that "for an unaligned access, CPU will trap before any write happens". But I think that's not an issue. If the code "segfaults", the dev should consider their code buggy, period. We should not make it harder for ourselves just because people are allowed to write buggy codes.

@sparker-arm

So from a runtime implementer perspective, it sounds like a big headache to try to emulate the masking

I think you underestimate the effort that goes in the runtime in general. I don't see emulating masking harder than most of the features implemented in the runtime. But that's true it's a complex beast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants