Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensor core abstractions #1346

Open
j-stephan opened this issue Jun 23, 2021 · 4 comments
Open

Add tensor core abstractions #1346

j-stephan opened this issue Jun 23, 2021 · 4 comments

Comments

@j-stephan
Copy link
Member

In the meeting on 25 May 2021 we discussed having an alpaka abstraction for the various tensor core APIs found in recent versions of CUDA and ROCm. Opening this issue for broader discussion (and avoiding to forget this wish).

@bernhardmgruber
Copy link
Member

A quick google search revealed this to me: https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

So it looks like programmatic access to tensor cores is given via special API calls and they can essentially only do FMA on 4x4 matrices in single and half precision. That sounds very limited to me. But hey, that's what special purpose hardware is all about! I see potential use for linear algebra. 4x4 matrices are also heavily used in 3D graphics and computational geometry. Still, although these fields were the prime target of GPUs, the need for tensor cores only appeared much much later with deep learning.

I have not found the corresponding APIs in HIP, also not in OpenCL or SYCL. So I don't know how AMD exposes them. For CPU targets I guess you have to model these 4x4 matrix FMAs with just normal floats. There is also a new BF16 float type, but that is super new: https://stackoverflow.com/a/49997863/2406044

I think access to tensor cores and reduced precision FP operations are too vendor specific for the moment to design a meaningful API. But please proof me wrong! :)

@j-stephan
Copy link
Member Author

AMD calls them Matrix Cores, at least one GPU (MI100) has them already. I haven't found the accompanying API in HIP yet, though.

@bernhardmgruber
Copy link
Member

So, I found out today that the public facing API for access tensor cores from CUDA is via cutlass
. Specifically mma.h, which essentiall sets up some blocks of floats and calls a PTX mneumonic.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 14, 2024

Isn't it documented also in the CUDA Programming Guide under 7.24. Warp Matrix Functions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants