Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lane types #6

Closed
penzn opened this issue May 20, 2020 · 18 comments
Closed

Lane types #6

penzn opened this issue May 20, 2020 · 18 comments

Comments

@penzn
Copy link
Contributor

penzn commented May 20, 2020

With #1 open, there is an interesting detail - how to define the types. Since we want to slice operations by lane type, there are a few ways to approach the 'register' type:

Edited, thanks to @ngzhian for clarifying questions:

  • One size fits all - define a single type, lets say vec or fvec, which would be used by all the operations, regardless of the lane type
    • Instructions would encode lane type, but the Wasm storage type would be the same for all
    • Examples: vec.f32.add :: vec -> vec -> vec, vec.i32.mul :: vec -> vec -> vec, vec.load :: i32 -> vec
    • The closest approach to simd proposal
  • Types broken by lane size - let's say vec.v8, vec.v16, vec.v32, ,vec.v64 - integer and floating point operations working with the same lane size would take the same type
    • Instructions would still encode lane type, operations working on the same size would share operand types
    • Examples: vec.f32.add :: vec.v32 -> vec.v32 -> vec.v32, vec.i32.mul :: vec.v32 -> vec.v32 -> vec.v32, vec.f64.add :: vec.v64 -> vec.v64 -> vec.v64, vec.i64.add :: vec.v64 -> vec.v64 -> vec.v64, vec.v8.load :: i32 -> vec.v8
    • More type safe than above, but less than below
  • Types broken by data type (and size) - vec.i8, ,vec.i32, vec.f32, etc - everything specific to a particular data type
    • Instructions encode lane type and types also encode lane type
    • Examples: vec.f32.add :: vec.f32 -> vec.f32 -> vec.f32, vec.i32.mul :: vec.i32 -> vec.i32 -> vec.i32, vec.i16.sub :: vec.i16 -> vec.i16 -> vec.i16, vec.i8.load :: i32 -> vec.i8
    • Type handling closer to scalar Wasm - different types of the same size would require conversion

I am leaning towards the first solution, with the single type completely interchangeable between various operations, mainly because it is simpler and better aligns with hardware.

@abrown
Copy link

abrown commented May 20, 2020

With "one size fits all," how do we know what instructions to lower to? We have to get the lane size from somewhere, right?
[edit: do you mean that the instruction already has this present, e.g. add.v8?]

@penzn
Copy link
Contributor Author

penzn commented May 20, 2020

The instructions would still have the lane size, just the input and output types would be the same.

For example if the type is called vec then all arithmetic ops would use the same type for input and output, but would have the lane type built into the instruction:

vec.i8.add :: vec -> vec -> vec
vec.f32.mul :: vec -> vec -> vec
vec.load :: int32 -> vec
...

@ngzhian
Copy link
Member

ngzhian commented May 20, 2020

One size fits all is more consistent with v128.

But having vec.i8 (fine-grained breakdown by underlying type and size) has some benefits too:

  • easier (citation needed) to avoid performance penalties, e.g. perform vector float op on the result of an int vector op could have performance penalty, so by having vec.add :: vec.T -> vec.T -> vec.T where T must be the same, we can help avoid that, it will require an explicit cast, like vec.int_to_float
  • you have more types, but less instructions, a single vec.add would do, the vec depends on data type on stack, but perhaps you end up having more instructions to convert between each type (overall I think we will still end up with less instructions)
  • easier for debugging, we have been working on SIMD support in debugger, and there is an outstanding issue on how we want to display a simd128 value. An array of 16 bytes? 4 uint32_t? In the source code, the user has a clear idea of what this simd128 holds, maybe a f64x2, but by the time it gets to V8, we don't immediately have an idea of what it is anymore, and perhaps require some analysis pass to figure this out

There is a spectrum here, on one end we have the user source code, where we know what type they want and are dealing with, on the other end is the hardware implementation, where it's all bits. I think my leaning towards closer to the user source code because I have recently been working on debugging :)

For codegen, I don't think there will be a big difference with any of these types.

@abrown
Copy link

abrown commented May 20, 2020

and there is an outstanding issue on how we want to display a simd128 value. An array of 16 bytes? 4 uint32_t?

I've dealt with this a lot in Cranelift. And it is a pain.

@ngzhian
Copy link
Member

ngzhian commented May 20, 2020

and there is an outstanding issue on how we want to display a simd128 value. An array of 16 bytes? 4 uint32_t?

I've dealt with this a lot in Cranelift. And it is a pain.

:) Will be interested to hear specifics, a high level description of what you've done please?

@abrown
Copy link

abrown commented May 20, 2020

Well, one part is the display of values but that's not the biggest pain (e.g. bytecodealliance/wasmtime#1650). Cranelift does have a type system with types like i8x16, f32x4, etc., so as long as we have those types around we can represent the values correctly with a bit of work. The bigger pain is that, e.g., across function boundaries we lose type information (everything becomes a v128; in Cranelift we are using i8x16 to represent this). I described this in more detail in bytecodealliance/wasmtime#1147 if you ignore the Cranelift-specific raw_bitcast stuff--I don't see any good way to recover the original type of a v128.

To try to bring it back to flexible vectors: if we added types to vectors here for the reasons you describe, should we add them to the 128-bit proposal as well?

@penzn
Copy link
Contributor Author

penzn commented May 20, 2020

Sorry for confusion, I did not mean to propose storing type with the value. What I thought for typed lanes is that the return type is specific to lane type and the instruction is also specific to lane type.

simd proposal is agnostic to lane types, while Wasm in general requires a match between the operation type and operand type. So in the list above I thought of still requiring a match between operation and the input for the latter two approaches. This might necessitate conversion instructions to cast into a different type, otherwise it would force use of loads and stores for conversion.

I do think more type safety is better than less type safety, but wasn't sure whether conversions would be an issue 😄

  • easier (citation needed) to avoid performance penalties, e.g. perform vector float op on the result of an int vector op could have performance penalty, so by having vec.add :: vec.T -> vec.T -> vec.T where T must be the same, we can help avoid that, it will require an explicit cast, like vec.int_to_float

From the point of view of separating types this definitely works, but checking the values we have loaded from the stack would mean a branch (and also would mean that the actual hardware instruction would be picked when such instruction executes, which can be an issue too).

  • easier for debugging, we have been working on SIMD support in debugger, and there is an outstanding issue on how we want to display a simd128 value. An array of 16 bytes? 4 uint32_t? In the source code, the user has a clear idea of what this simd128 holds, maybe a f64x2, but by the time it gets to V8, we don't immediately have an idea of what it is anymore, and perhaps require some analysis pass to figure this out

There are two thing which make backtracking the value type very hard, if not impossible - memory is untyped (we don't know what was the intended type of a value in memory) and there can be control flow affecting value type. I don't think there has been a good solution to this yet.

Debugging might benefit from 'views' into the value - user can flip between what this value can be, trying to match with the type they think they are going to see.

@ngzhian
Copy link
Member

ngzhian commented May 20, 2020

Sorry for confusion, I did not mean to propose storing type with the value. What I thought for typed lanes is that the return type is specific to lane type and the instruction is also specific to lane type.

Ah, got it, so you're choosing between these 3 signatures? The actual value type is just vec:

vec.add     :: vec -> vec -> vec
vec.v32.add :: vec -> vec -> vec
vec.i32.add :: vec -> vec -> vec

The rest of my comment builds on my misunderstanding of having different vec value types, vec_i8, vec_f32, etc.

From the point of view of separating types this definitely works, but checking the values we have loaded from the stack would mean a branch (and also would mean that the actual hardware instruction would be picked when such instruction executes, which can be an issue too).

Yea good point on the branch, so probably we don't want vec.add, but vec.i8.add like you suggested originally.

vec.i8.add :: vec_i8 -> vec_i8 -> vec_i8

There are two thing which make backtracking the value type very hard, if not impossible - memory is untyped (we don't know what was the intended type of a value in memory) and there can be control flow affecting value type. I don't think there has been a good solution to this yet.

Debugging might benefit from 'views' into the value - user can flip between what this value can be, trying to match with the type they think they are going to see.

This sounds like you're agreeing that having types associated with the values (vec_i8) is something desirable, yes? :)

If we had typed loads, vec.i8.load :: i32 -> vec_i8, then we can deal with the first problem.

Control flow is more manageable, it might require an ah-hoc control-flow-stack validation like we already do, but since control flow is structured anyway, I think we can always figure it out. But if we had typed vectors, then this also won't be an issue, since validation will take care that the merges always have the same types (sorry this part is a bit fuzzy I am not super familiar with the details but I think that's how it works).

@lemaitre
Copy link

There is something crucial for flexible vectors that has been overlooked here: masks (or lane predicates).

If you don't have masks, support for AVX512 and SVE will be strongly crippled.
For instance comparisons return masks and not vectors (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=cmp&techs=AVX_512).
The problem is, AVX512 masks are 1-bit per lane masks, whereas in previous architectures (like SSE or AVX), masks are usual vectors where all bits of a lane are the same (either 0 or 1).

The only way to have efficient masks both on legacy architectures and new architecture with native masks is to have separate mask types per lane width.
There cannot be one size fits all here.

But we still can have the same mask type for different types as long as they have the same lane width.
So I strongly recommend the following scheme:

  • vec.v8 : vector of 8-bit elements
  • vec.m8 : mask for vector of 8-bit elements
  • vec.v16 : vector of 16-bit elements
  • vec.m16 : mask for vector of 16-bit elements
  • vec.v32 : vector of 32-bit elements
  • vec.m32 : mask for vector of 32-bit elements
  • vec.v64 : vector of 64-bit elements
  • vec.m64 : mask for vector of 64-bit elements

vec alone would be an invalid type

I don't see much benefit for having strongly typed vectors otherwise, but I think it would not harm either.


Let's give some examples for unsigned integer types (just for the sake of writing C types instead of registers):

type SSE AVX AVX512 Neon SVE
vec.v8 __m128i __m256i __m512i uint8x16_t svuint8_t
vec.m8 __m128i __m256i __mmask64 uint8x16_t svbool_t
vec.v16 __m128i __m256i __m512i uint16x8_t svuint16_t
vec.m16 __m128i __m256i __mmask32 uint16x8_t svbool_t
vec.v32 __m128i __m256i __m512i uint32x4_t svuint32_t
vec.m32 __m128i __m256i __mmask16 uint32x4_t svbool_t
vec.v64 __m128i __m256i __m512i uint64x2_t svuint64_t
vec.m64 __m128i __m256i __mmask8 uint64x2_t svbool_t

Please note how in AVX512, the mask type is suffixed with the number of lanes and not the width of the lanes or the width of the vector.

@penzn
Copy link
Contributor Author

penzn commented May 21, 2020

@ngzhian, thank you for the feedback, I updated the list in the issue description.

This sounds like you're agreeing that having types associated with the values (vec_i8) is something desirable, yes? :)

I think that is desirable. We can start with per-lane-type types and see if that can be efficiently supported. Hardware SIMD registers don't distinguish lane types, but I don't think adding this would cost anything.

Having lane types encoded into the types of local variables would help with debugging - it wouldn't be necessary to track down operations to find out how to display the vector.

@lemaitre we are proposing set_length (lifted from risc-v). That might not be as expressive as true predicates, but it would still map to AVX and SVE for the cases it can support. See #1, close to the bottom of the new file - exact signature of those instruction might change as result of this issue.

@lemaitre
Copy link

@penzn Yes, I have seen that you propose set_length, and I know you can emulate it with masks to some extent.
My point is: it is far from enough to efficiently target AVX512/SVE as they do not have comparisons that output vectors, and legacy ISAs can easily emulates masks instructions as soon as they support masked stores natively (which is the case for SSE since the beginning, for instance).

set_length impose a global state that is avoidable and does not give much benefits compared to masks.
That's why I strongly suggests to drop set_length and have masks instead.

As you said in #7, masks can emulate set_length, but not the opposite.
Why not getting the more versatile alternative from the very start, then?


I think mask question is more related to this issue than to #7, so I continue here, but we can move back if you prefer.

Flex length might be easier to implement with ISAs that don't support any masking at all.

I think that is actually not true.
Masking is easily implementable with bitwise operations on legacy architectures.
You can even use blendv on SSSE3 that is a masked select.

Of course, it requires to have a "select" for each and every masked instruction, but not all instructions actually require to be masked, even within a branch.
Only the assignments to final registers (the ones that escape the if, the phi-nodes in SSA) need a real "select", and the final stores.
That way, we can drastically limit the required number of selects to insert for legacy architectures.

You could even envision to have multiple masking policies, one of them being "undefined" or "don't care".
That would put undefined values in inactive lanes.
A valid implementation of such a "don't care" policy would be to ignore completely the mask and perform the operation on all lanes unconditionally.
This policy can be used within branches and is easily and efficiently implementable on all architectures, even the legacy ones.

Such a policy might not be useful at first glance, but first it could be used to save power on mask aware architectures, and it can also be used by the WASM virtual machine to fuse registers of separate branches (with disjoint masks) if the target architecture supports masks natively.

Actually, this very policy already exist in the C bindings for SVE.

@jan-wassenberg
Copy link

set_length impose a global state that is avoidable and does not give much benefits compared to masks.
That's why I strongly suggests to drop set_length and have masks instead.

I agree set_length is difficult outside of RiscV, and that separate mask types are desirable
(FYI discussion at WebAssembly/simd#192).
But are masks the only answer? I've found them to be rather expensive, and not available for all types.

Some code wants to only read/write e.g. 2 int32, for which sub-128-bit (but power of two) vector types can be useful. Their load/store only touch e.g. 8 bytes (e.g. _mm_loadl_epi64). This works even on SVE, which still provides the 128-bit NEON, and RiscV can mask.

So that's efficient for <128 bit but what about apps that only want to read/write 123 ints?
Due to the uncertainty of the vector length and cost of masks on AVX2, wouldn't it be simpler to require apps to either pad their data (vastly preferable) or have single-lane loops for any remainders? The latter can typically be implemented with the same source code/template.

Of course, it requires to have a "select" for each and every masked instruction, but not all instructions actually require to be masked

Thumbs up. Highway takes this approach, and masks are mainly passed to the 'ternary operator' aka IfThenElse(m, yes, no).

@lemaitre
Copy link

I've found them to be rather expensive, and not available for all types.

Memory access aside, masks are available for all types as it can always be implemented as a bit vector of the same size and perform bitwise operations.
For the speed, it is a problem on legacy architectures if many instruction results are masked, but in practice, it is not so often the case (only at the end of branches, the phi nodes in SSA to be more accurate).

In my previous comment, I also described a "don't care" masking policy that would be efficient on all architectures (even legacy as it could just be ignored) allowing to have the complete branch masked, and still pay for it on legacy architectures only at the end of the branch where the masking policy will not be "don't care".

Some code wants to only read/write e.g. 2 int32, for which sub-128-bit (but power of two) vector types can be useful.

Those already exists (to some extent): they are called int64, int32 and int16.
It could be nice to have some Sub-Word Parallelism (SWP) instructions, but they don't really fit in here, I think.

Now, I would argue that either you only need 2 int32, then you could do kind-of fine with int64, or you actually want to process pairs of int32.
I have the impression that this 2nd case is much more common that the first one.
In that case, why would you want to restrict yourself to only one pair where you could actually process many of them if you don't limit the vector size?

For this to work, you would need some sub-SIMD operations like independant shuffles on sub-elements of 64-bit lanes, or some kind of load low/high.
You would also probably need some kind of (masked?) gather/scatter.
More info on #7.

wouldn't it be simpler to require apps to either pad their data (vastly preferable)

Yes, but not always possible. So we still need a way to handle remainders when they do exist.

or have single-lane loops for any remainders?

Why would you want to go scalar when you can easily stay in vector with masks for the remainder only, and have a huge gain on small loops?
(single lane vectors, besides their type, are actually scalar)

@jan-wassenberg
Copy link

Memory access aside, masks are available for all types as it can always be implemented as a bit vector of the same size and perform bitwise operations.

I agree masks are helpful and should be provided as separate types, my only concern is a programming model that claims all hardware can do StoreOnlyN(vector, count, ptr) for all types efficiently.

Those already exists (to some extent): they are called int64, int32 and int16.
Now, I would argue that either you only need 2 int32, then you could do kind-of fine with int64, or you actually want to process pairs of int32.

Yes, that use case processed pairs at a time and I'm not sure it would work with SWAR.

In that case, why would you want to restrict yourself to only one pair where you could actually process many of them if you don't limit the vector size?

I agree length-agnostic code is always better when possible. In this case, each pair depended on the previous pair.

For this to work, you would need some sub-SIMD operations like independant shuffles

Yes, it would be harder if we wanted to do shuffles on those pairs, but often simple independent-lane ops are enough.
When we need shuffles, we could temporarily cast from a sub-128bit type to the full vector and back afterwards.

Why would you want to go scalar when you can easily stay in vector with masks for the remainder only, and have a huge gain on small loops?

I'm worried about "moral hazard" - if StoreOnlyN is provided, it will be used, and probably much more frequently than really required (for remainder handling), which would be counterproductive. The more painful remainder handling is, the likelier that apps would fine a way to actually pad :) Do we have some examples where apps really can't on principle, as opposed to implementation convenience for legacy code that didn't foresee this?

(single lane vectors, besides their type, are actually scalar)

An interesting point of definition. Mathematically yes, but doesn't the vector type behave very differently? int overflow is allowed, floats have extra ops such as rsqrt/AND, etc.

@lemaitre
Copy link

I agree masks are helpful and should be provided as separate types, my only concern is a programming model that claims all hardware can do StoreOnlyN(vector, count, ptr) for all types efficiently.

It depends on your threshold for "efficient".
For me, as soon as it is at least as fast as the alternative, that's enough.
And in this case, the only alternative faster that I can think of is actually the padding of the data.
This a good solution, but is not a universal solution and cannot be done by compilers as they are not allowed to do it by themselves.
So the problem remains and we still need another solution.

I have no problem recommending users to use a special function to pad their data, but a fallback solution is still required.
And masking looks like a pretty good candidate.

Also, it is not because we provide an instruction, that this instruction is necessarily fast.
Just that you cannot do the same operation faster (bugs aside).
For instance, there is a SQRT instruction: you cannot compute a square faster than this, even though, it is slow in practice.
Having some padding would change the operations, so would not fall into this.

I agree length-agnostic code is always better when possible. In this case, each pair depended on the previous pair.

It sounds like a scan (or prefix-"sum").
This is an example of algorithms that seem like impossible to do in parallel, and yet it is possible.
If that's the case, the difference here would be that your elements are pairs of int32 instead of scalars.

If it's really not that, and cannot be adapted in a similar way, then you're probably out of luck and short (128-bit) SIMD, or SWP might be the answer.

But I have the impression that it is very rare in practice.
Also, I don't know how set_length would help you there as you would still need to use the length-agnostic ISA.

I'm worried about "moral hazard" - if StoreOnlyN is provided, it will be used, and probably much more frequently than really required (for remainder handling), which would be counterproductive. The more painful remainder handling is, the likelier that apps would fine a way to actually pad :)

I understand you concern. But I think StoreOnlyN or maskedStore can be effecient enough on most architectures that it would not be a problem in practice.
And only codes that are aimed to be super fast (and not only fast) would need to pad.

There is 2 axes: make most apps fast and make it possible to have apps as fast as possible.
I think padding falls in the second, whereas masking (or StoreOnlyN) would fall in the first.
So my goal is to improve 1st axe without impacting 2nd, whereas it seems you want to move apps from 1st to 2nd.
Also, standard libraries is an easy fit for 2nd axe, which will improve 1st axe (as they are used by most apps).

Do we have some examples where apps really can't on principle, as opposed to implementation convenience for legacy code that didn't foresee this?

I would say any code working on packed structures in a concurrent context (for the store problem).
The load problem should be much less frequent.

But to be fair, I have no concrete example.

(single lane vectors, besides their type, are actually scalar)

An interesting point of definition. Mathematically yes, but doesn't the vector type behave very differently? int overflow is allowed, floats have extra ops such as rsqrt/AND, etc.

My point here is: if you process elements one by one, you've lost the speed battle, whatever the actual operations you apply to them.

@penzn penzn changed the title Types Lane types May 27, 2020
penzn added a commit to penzn/flexible-vectors that referenced this issue May 27, 2020
@penzn
Copy link
Contributor Author

penzn commented May 27, 2020

This issue originally was about whether lanes should be typed or not (which I now reflected in the title, apologies for not doing that earlier). So far hope is that we can make lanes typed, not just in terms of size, but also in terms of what can go in it - it would be more in line with what Wasm does and would be easier to debug.

@lemaitre @jan-wassenberg let's move the discussion on masks on #9 if you don't mind.

@lemaitre
Copy link

@penzn Sorry for the off-topic.

My initial point was, if you have masks (and I think we don't have the choice here, even if we have set_lenght), then you cannot have a single mask type because of AVX512/SVE, and then having single vector type would not make sense.

So you would need at least per lane-width types. (v8, v16, v32 and 64).

Now, there is an argument I just remembered to have fully typed vectors: in AVX1, there is no integer operations on 256-bit registers.
With fully typed vectors, you could simplify the code generation for integers on AVX1.

@jan-wassenberg
Copy link

@penzn Thanks for bringing us back to the original topic. It sounded like people were viewing lane-size-and-type (vec.i32) favorably, and I'd also agree type-safety is useful for vectors. For the masks themselves, maybe the size is enough?
It's a bit onerous to also have to cast masks when using both signed/unsigned vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants