You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NDArrayMatrixMultiplyA16 does not contain simd async copy instructions, although the kernel for A14 does. Starting with AGX3 (A15), there are some new instructions used for GEMM and Conv. I haven't checked whether they're accessible from __asm (SIMD futures are not).
; Function Attrs: nounwind memory(write)declarevoid@llvm.agx3.store.with.emask.global.i16.v2i16(ptraddrspace(1), <2 x i16>, i16, i16, i16) #7; Function Attrs: nounwind memory(write)declarevoid@llvm.agx3.store.with.emask.global.i32.v2i32(ptraddrspace(1), <2 x i32>, i16, i16, i16) #7; Function Attrs: nounwind speculatable memory(none)declarei16@llvm.agx3.edgecheck(i32, i32, i32) #8; Function Attrs: nounwind memory(read)declare <2 x i16> @llvm.agx3.load.with.emask.global.v2i16.i16(ptraddrspace(1), i16, i16, i16) #10; Function Attrs: nounwind memory(read)declare <2 x i32> @llvm.agx3.load.with.emask.global.v2i32.i32(ptraddrspace(1), i16, i16, i16) #10; Function Attrs: nounwind memory(read)declare <1 x i16> @llvm.agx3.load.with.emask.global.v1i16.i16(ptraddrspace(1), i16, i16, i16) #10; Function Attrs: nounwind memory(read)declare <1 x i32> @llvm.agx3.load.with.emask.global.v1i32.i32(ptraddrspace(1), i16, i16, i16) #10; Function Attrs: nounwind memory(read)declare <4 x i16> @llvm.agx3.load.with.emask.global.v4i16.i16(ptraddrspace(1), i16, i16, i16) #10; Function Attrs: nounwind memory(read)declare <4 x i32> @llvm.agx3.load.with.emask.global.v4i32.i32(ptraddrspace(1), i16, i16, i16) #10
Furthermore, unlike A14/M1, at least A16 can access 65536 bytes of registers from a single SIMD-group. That is more than physically possible.
%31 = alloca [16 x [16 x %"struct.metal::simdgroup_matrix"]], align256callvoid@llvm.lifetime.end.p0(i6465536, ptrnonnull%292) #14
Luckily, SIMD futures run correctly and performantly on A15/A16. I do worry that this MPS kernel is referencing their unreleased A16 ray tracing GPU (or the in-development M3), which might remove support for SIMD futures.
NDArrayMatrixMultiplyA16 does not contain simd async copy instructions, although the kernel for A14 does. Starting with AGX3 (A15), there are some new instructions used for GEMM and Conv. I haven't checked whether they're accessible from
__asm
(SIMD futures are not).Furthermore, unlike A14/M1, at least A16 can access 65536 bytes of registers from a single SIMD-group. That is more than physically possible.
Luckily, SIMD futures run correctly and performantly on A15/A16. I do worry that this MPS kernel is referencing their unreleased A16 ray tracing GPU (or the in-development M3), which might remove support for SIMD futures.
Source: https://gist.github.com/philipturner/939d4ffda26e66f10a142c82d8d498e9
Results (A15)
Results (A16)
The text was updated successfully, but these errors were encountered: