arch-vega: Template MFMA instructions #1128

mjkpolo · 2024-05-13T20:22:13Z

templated

v_mfma_f64_16x16x4f64

added support for

v_mfma_f32_32x32x2f32
v_mfma_f32_4x4x1_16b_f32
v_mfma_f32_16x16x4f32

formula for gprs needed

formulas for register layouts and lanes used in computation

Change-Id: I15d6c0a5865d58323ae8dbcb3f6dcb701a9ab3c7

abmerop · 2024-05-15T16:32:02Z

Added new unit test for the new data types.

abmerop · 2024-05-15T23:16:48Z

This is ready from my side.

src/arch/amdgpu/common/dtype/fp8_e4m3.hh

src/arch/amdgpu/common/dtype/fp8_e5m2.hh

src/arch/amdgpu/common/dtype/mxfp.hh

src/arch/amdgpu/vega/insts/instructions.hh

src/arch/amdgpu/vega/insts/mubuf.cc

src/arch/amdgpu/vega/insts/op_encodings.hh

src/arch/amdgpu/vega/insts/vop1.cc

src/arch/amdgpu/vega/insts/vop3.cc

mattsinc · 2024-05-16T03:15:24Z

@mjkpolo feel free to add comments as appropriate.

abmerop

Thanks for the thorough review! Fixed most of your suggestions or explained why I didn't

src/arch/amdgpu/common/dtype/fp8_e4m3.hh

src/arch/amdgpu/common/dtype/fp8_e5m2.hh

abmerop · 2024-05-16T15:11:43Z

src/arch/amdgpu/common/dtype/fp16_e5m10.hh

+        bias = 15,
+
+        inf = 0x7c000000,
+        nan = 0x7c001000,


I fixed this btw. Last 4 hex digits should be 0 for this type. Changed to 0x7c10'0000

src/arch/amdgpu/common/dtype/mxfp.hh

src/arch/amdgpu/vega/insts/instructions.hh

src/arch/amdgpu/vega/insts/mubuf.cc

src/arch/amdgpu/vega/insts/op_encodings.hh

src/arch/amdgpu/vega/insts/vop1.cc

src/arch/amdgpu/vega/insts/vop3.cc

mattsinc

Just the one comment seems to be missing or I'm missing it?

src/arch/amdgpu/vega/insts/mubuf.cc

abmerop · 2024-05-17T17:32:04Z

Just the one comment seems to be missing or I'm missing it?

Hmm, I must have messed up the rebase. I won't be able to update until Monday, unfortunately.

templated - v_mfma_f64_16x16x4f64 added support for - v_mfma_f32_32x32x2f32 - v_mfma_f32_4x4x1_16b_f32 - v_mfma_f32_16x16x4f32 [formula for gprs needed](https://github.com/ROCm/amd_matrix_instruction_calculator) [formulas for register layouts and lanes used in computation](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) Change-Id: I15d6c0a5865d58323ae8dbcb3f6dcb701a9ab3c7

The open compute project (OCP) microscaling formats (MX) are used in the GPU model. The specification is available at [1]. This implements a C++ version of MXFP formats with many constraints that conform to the specification. Actually arithmetic is not performed directly on the MXFP types. They are rather converted to fp32 and the computation is performed. For most of these types this is acceptable for the GPU model as there are no instruction which directly perform arithmetic on them. For example, the DOT/MFMA instructions operating may first convert to FP32 and then perform arithmetic. Change-Id: I7235722627f7f66c291792b5dbf9e3ea2f67883e

This instruction serves as a test for the MXFP8 type. Change-Id: I2ce30bf7f3a3ecc850a445aebdf971c37c39a79e

This class can be used to load multiple operand dwords into an array and then select bits from the span of that array. It handles cases where the bits span two dwords (e.g., you have four dwords for a 128-bit value and want to select bits 35:30) and cases where multiple values < 32-bits are packed into a single dword (e.g., two bf16 values). This is most useful for packed arrays and instructions which have more than two dwords. Beyond two dwords, the operator[] overload of VectorOperand is not available requiring additional logic to select from an operand. This helper class handles that additional logic itself. Change-Id: I74856d0f312f7549b3b6c405ab71eb2b174c70ac

The microscaling formats (MXFP) and INT8 types require additional size checks which are not needed for the current MFMA template. The size check is done using a constexpr method exclusive to the MXFP type, therefore create a special class for MXFP types. This is preferrable to attempting to shoehorn into the existing template as it helps with readability. Similar, INT8 requires a size check to determine number of elements per VGPR, but it not an MXFP type. Create a special template for that as well. This additionally implements all of the MFMA types which have test cases in the amd-lab-notes repository (https://github.com/amd/amd-lab-notes/). The implementations were tested using the applications in the matrix-cores subfolder and achieve L2 norms equivalent or better than MI200 hardware. Change-Id: Ia5ae89387149928905e7bcd25302ed3d1df6af38

This adds the decodings for all of the matrix fused multiply add (MFMA) and sparse matrix fused multiply accumulate (SMFMAC) instructions up to and including MI300. This does not yet provide the implementation for these instructions, however it is easier and less tedious to add them in bulk rather that one at a time. Change-Id: I5acd23ca8a26bdec843bead545d1f8820ad95b41

This instruction is new in MI300 and is used in some of the example applications used to test MFMAs. Change-Id: I739f8ab2be6a93ee3b6bdc4120d0117724edb0d4

These instructions are used in some of the F16 MFMA example applications to convert to/from floating point types. Change-Id: I7426ea663ce11a39fe8c60c8006d8cca11cfaf07

Add a unit test for the MXFP types (bf16, fp16, fp8, bf8). These types are not currently operated on directly. Instead the are cast to float values and then arithmetic is performed. As a result, the unit test simply checks that when we convert a value from MXFP type to float and back that the values of the MXFP type match. Exact values are used to avoid discrepancies with rounding. Can be run using scons build/VEGA_X86/unittests.opt . Change-Id: I596e9368eb929d239dd2d917e3abd7927b15b71e

Implement a bfloat16 MFMA. This was tested with PyTorch using dtype=torch.bfloat16. Change-Id: I35b4e60e71477553a93020ef0ee31d1bcae9ca5d

This implements some missing loads and store that are commonly used in applications with MFMA instructions to load 16-bit data types into specific register locations: DS_READ_U16_D16, DS_READ_U16_D16_HI, BUFFER_LOAD_SHORT_D16, BUFFER_LOAD_SHORT_D16_HI. Change-Id: Ie22d81ef010328f4541553a9a674764dc16a9f4d

ivanaamit · 2024-05-20T17:51:37Z

@abmerop, could you let us know if this is complete and ready for a re-review? Thanks.

abmerop · 2024-05-20T17:53:40Z

This is complete. I think @mattsinc just wanted one additional comment which I have added

src/arch/amdgpu/vega/insts/mubuf.cc

The requested comment was added.

mattsinc requested review from mattsinc and abmerop and removed request for mattsinc May 13, 2024 20:24

mattsinc added the arch-vega The VEGA ISA label May 13, 2024

abmerop force-pushed the feature-mfma-template branch 2 times, most recently from 9a67157 to c0f8c67 Compare May 14, 2024 14:55

abmerop force-pushed the feature-mfma-template branch from 612ffe8 to e487f9f Compare May 15, 2024 22:46

mattsinc requested changes May 16, 2024

View reviewed changes

abmerop reviewed May 16, 2024

View reviewed changes

abmerop force-pushed the feature-mfma-template branch from e487f9f to 588a3d9 Compare May 16, 2024 15:26

mattsinc previously requested changes May 17, 2024

View reviewed changes

src/arch/amdgpu/vega/insts/mubuf.cc Show resolved Hide resolved

mjkpolo and others added 11 commits May 20, 2024 09:27

arch-vega: Implement v_cvt_pk_fp8_f32

2bb62a0

This instruction serves as a test for the MXFP8 type. Change-Id: I2ce30bf7f3a3ecc850a445aebdf971c37c39a79e

arch-vega: Implement v_mov_b64

a062229

This instruction is new in MI300 and is used in some of the example applications used to test MFMAs. Change-Id: I739f8ab2be6a93ee3b6bdc4120d0117724edb0d4

arch-vega: Implement F32 <-> F16 conversions

de11dae

These instructions are used in some of the F16 MFMA example applications to convert to/from floating point types. Change-Id: I7426ea663ce11a39fe8c60c8006d8cca11cfaf07

arch-vega: Implement v_mfma_f32_32x32x8_bf16

a4f0d9e

Implement a bfloat16 MFMA. This was tested with PyTorch using dtype=torch.bfloat16. Change-Id: I35b4e60e71477553a93020ef0ee31d1bcae9ca5d

abmerop force-pushed the feature-mfma-template branch from 588a3d9 to b91c9be Compare May 20, 2024 14:29

abmerop added this to the v24.0 milestone May 20, 2024

abmerop reviewed May 21, 2024

View reviewed changes

src/arch/amdgpu/vega/insts/mubuf.cc Show resolved Hide resolved

ivanaamit approved these changes May 22, 2024

View reviewed changes

abmerop merged commit 1616d34 into gem5:develop May 22, 2024
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arch-vega: Template MFMA instructions #1128

arch-vega: Template MFMA instructions #1128

mjkpolo commented May 13, 2024

abmerop commented May 15, 2024

abmerop commented May 15, 2024

mattsinc commented May 16, 2024

abmerop left a comment

abmerop May 16, 2024

mattsinc left a comment

abmerop commented May 17, 2024

ivanaamit commented May 20, 2024

abmerop commented May 20, 2024

arch-vega: Template MFMA instructions #1128

arch-vega: Template MFMA instructions #1128

Conversation

mjkpolo commented May 13, 2024

abmerop commented May 15, 2024

abmerop commented May 15, 2024

mattsinc commented May 16, 2024

abmerop left a comment

Choose a reason for hiding this comment

abmerop May 16, 2024

Choose a reason for hiding this comment

mattsinc left a comment

Choose a reason for hiding this comment

abmerop commented May 17, 2024

ivanaamit commented May 20, 2024

abmerop commented May 20, 2024