Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD futures on A16/M3 GPU #44

Open
philipturner opened this issue Jun 9, 2023 · 0 comments
Open

SIMD futures on A16/M3 GPU #44

philipturner opened this issue Jun 9, 2023 · 0 comments

Comments

@philipturner
Copy link

NDArrayMatrixMultiplyA16 does not contain simd async copy instructions, although the kernel for A14 does. Starting with AGX3 (A15), there are some new instructions used for GEMM and Conv. I haven't checked whether they're accessible from __asm (SIMD futures are not).

; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i16.v2i16(ptr addrspace(1), <2 x i16>, i16, i16, i16) #7

; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i32.v2i32(ptr addrspace(1), <2 x i32>, i16, i16, i16) #7

; Function Attrs: nounwind speculatable memory(none)
declare i16 @llvm.agx3.edgecheck(i32, i32, i32) #8

; Function Attrs: nounwind memory(read)
declare <2 x i16> @llvm.agx3.load.with.emask.global.v2i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <2 x i32> @llvm.agx3.load.with.emask.global.v2i32.i32(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <1 x i16> @llvm.agx3.load.with.emask.global.v1i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <1 x i32> @llvm.agx3.load.with.emask.global.v1i32.i32(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <4 x i16> @llvm.agx3.load.with.emask.global.v4i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <4 x i32> @llvm.agx3.load.with.emask.global.v4i32.i32(ptr addrspace(1), i16, i16, i16) #10

Furthermore, unlike A14/M1, at least A16 can access 65536 bytes of registers from a single SIMD-group. That is more than physically possible.

  %31 = alloca [16 x [16 x %"struct.metal::simdgroup_matrix"]], align 256
  call void @llvm.lifetime.end.p0(i64 65536, ptr nonnull %292) #14

Luckily, SIMD futures run correctly and performantly on A15/A16. I do worry that this MPS kernel is referencing their unreleased A16 ray tracing GPU (or the in-development M3), which might remove support for SIMD futures.

Source: https://gist.github.com/philipturner/939d4ffda26e66f10a142c82d8d498e9

Results (A15)

GEMM dimensions: 256x256x256
2023-06-09 12:16:16.984966-0400 SIMDFuturesA15[32193:1233147] Metal GPU Frame Capture Enabled
2023-06-09 12:16:16.985621-0400 SIMDFuturesA15[32193:1233147] Metal API Validation Enabled

Metal FlashAttention: 'f16'
GFLOPS: 269
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 48
MFA vs MPS Euclidean Distance: 30.969585

GEMM dimensions: 512x512x512

Metal FlashAttention: 'f16'
GFLOPS: 465
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 249
MFA vs MPS Euclidean Distance: 209.93332

GEMM dimensions: 768x768x768

Metal FlashAttention: 'f16'
GFLOPS: 948
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 749
MFA vs MPS Euclidean Distance: 894.5818

GEMM dimensions: 1024x1024x1024

Metal FlashAttention: 'f32'
GFLOPS: 1215
Metal Performance Shaders: 'f32'
GFLOPS: 1353
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1184
Metal Performance Shaders: 'f16'
GFLOPS: 1265
MFA vs MPS Euclidean Distance: 2008.9558

Metal FlashAttention: 'f16'
GFLOPS: 1262
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1194
MFA vs MPS Euclidean Distance: 2009.3207

GEMM dimensions: 1280x1280x1280

Metal FlashAttention: 'f16'
GFLOPS: 1618
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1972
MFA vs MPS Euclidean Distance: 5536.6294

GEMM dimensions: 1536x1536x1536

Metal FlashAttention: 'f16'
GFLOPS: 1611
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2177
MFA vs MPS Euclidean Distance: 10459.943

GEMM dimensions: 1792x1792x1792

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2355
MFA vs MPS Euclidean Distance: 16553.246

GEMM dimensions: 2048x2048x2048

Metal FlashAttention: 'f32'
GFLOPS: 1397
Metal Performance Shaders: 'f32'
GFLOPS: 1326
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16'
GFLOPS: 1303
MFA vs MPS Euclidean Distance: 24126.244

Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2405
MFA vs MPS Euclidean Distance: 24127.777

GEMM dimensions: 4096x4096x4096

Metal FlashAttention: 'f16'
GFLOPS: 1594
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2941
MFA vs MPS Euclidean Distance: 305298.94
Results (A16)
GEMM dimensions: 256x256x256
2023-06-09 08:53:08.328381-0700 TestUI[48119:10608193] Metal GPU Frame Capture Enabled
2023-06-09 08:53:08.328473-0700 TestUI[48119:10608193] Metal API Validation Enabled
2023-06-09 08:53:09.303047-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.303155-0700 TestUI[48119:10608193] Errors found! Invalidating cache...
2023-06-09 08:53:09.346694-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.346763-0700 TestUI[48119:10608193] Errors found! Invalidating cache...

Metal FlashAttention: 'f16'
GFLOPS: 632
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 68
MFA vs MPS Euclidean Distance: 30.969585

GEMM dimensions: 512x512x512

Metal FlashAttention: 'f16'
GFLOPS: 378
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 387
MFA vs MPS Euclidean Distance: 209.93332

GEMM dimensions: 768x768x768

Metal FlashAttention: 'f16'
GFLOPS: 957
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1087
MFA vs MPS Euclidean Distance: 894.5818

GEMM dimensions: 1024x1024x1024

Metal FlashAttention: 'f32'
GFLOPS: 1163
Metal Performance Shaders: 'f32'
GFLOPS: 1070
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1130
Metal Performance Shaders: 'f16'
GFLOPS: 1273
MFA vs MPS Euclidean Distance: 2008.9558

Metal FlashAttention: 'f16'
GFLOPS: 1401
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1949
MFA vs MPS Euclidean Distance: 2009.3207

GEMM dimensions: 1280x1280x1280

Metal FlashAttention: 'f16'
GFLOPS: 1606
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2799
MFA vs MPS Euclidean Distance: 5536.6294

GEMM dimensions: 1536x1536x1536

Metal FlashAttention: 'f16'
GFLOPS: 1610
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3306
MFA vs MPS Euclidean Distance: 10459.943

GEMM dimensions: 1792x1792x1792

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3159
MFA vs MPS Euclidean Distance: 16553.246

GEMM dimensions: 2048x2048x2048

Metal FlashAttention: 'f32'
GFLOPS: 1474
Metal Performance Shaders: 'f32'
GFLOPS: 1379
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1623
Metal Performance Shaders: 'f16'
GFLOPS: 1305
MFA vs MPS Euclidean Distance: 24126.244

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3484
MFA vs MPS Euclidean Distance: 24127.777

GEMM dimensions: 4096x4096x4096
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant