CUDA prompt processing performance is gimped by ~5% on Ampere or newer with GGML_NATIVE=OFF

As of right now we are compiling the CUDA code with architectures up to Turing by default if `GGML_NATIVE=OFF`. However, in `mma.cuh` there is code like this:

```CUDA
#if __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
        asm("mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 {%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3};"
            : "+r"(x[0]), "+r"(x[1]), "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[0]), "r"(mma_A.x[1]), "r"(mma_A.x[2]), "r"(mma_A.x[3]), "r"(mma_B.x[0]), "r"(mma_B.x[1]));
#else
        // On Turing m16n8k32 mma is not available, use 4x m8n8k16 mma instead:
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[0]), "+r"(x[1])
            : "r"(mma_A.x[0]), "r"(mma_B.x[0]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[1]), "r"(mma_B.x[0]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[0]), "+r"(x[1])
            : "r"(mma_A.x[2]), "r"(mma_B.x[1]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[3]), "r"(mma_B.x[1]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
```

As a consequence the code is using suboptimal PTX instructions for Ampere or newer. This reduces prompt processing performance by ~5% with LLaMA 3 8b q4_0 on an RTX 3090. The issue could be fixed by adding Ampere to the list of compute capabilities but this would naturally also increase compilation time and binary size. CMake lets you I think set compute capabilities per `.cu` file but since MMQ compilation is the heaviest we would likely not save that much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA prompt processing performance is gimped by ~5% on Ampere or newer with GGML_NATIVE=OFF #11587

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA prompt processing performance is gimped by ~5% on Ampere or newer with GGML_NATIVE=OFF #11587

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions