Skip to content

audiovention/wgpu-mm

 
 

Repository files navigation

wgpu-mm

How many flops can we squeeze out of wgpu? The test harness is inspired by Bram Wasti's work here.

GEMM

The M1 8 core GPU can supposedly hit 2.6 TFLOPS of FP32.

A custom metal shader from Tinygrad can hit 2000 GFLOPS or ~75% of theoretical peak. This shader uses SIMD groups which WebGPU doesn't support yet - but it's been proposed a few times e.g here.

The best shader we have is an altered version of that by Tensorflow.JS, which reaches ~900GFLOP on my M1.

GEMV

GEMV is a different problem since it is entirely memory-bound.

We use the formula for bandwidth to be M (GB/s) = M=10-9.(m.n+m+n)*sizeof(scalar type)/T.

For the problem size [1,384] @ [384, 51868] (Whisper logits GEMV), we can calculate the minimum possible runtime to be 1198266.33ns. The best kernel in here, gemv_2, hits ~1300000ns.

As it is memory bound, lower precision is extremely important. We can see our HGEMV can perform the same [1,384] @ [384, 51868] in ~694500ns, ~2x faster.

Read More

NVIDIA Performance Guide

TODO

[ ] - 128 bit loads for half & quarter precision tiled matmul [ ] - Flash Attention

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • WGSL 57.5%
  • Rust 39.1%
  • Python 3.4%