Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 14, 2023

For my GPU acceleration PR #1412 I used a template to decouple the code for matrix vector multiplication and dequantization. This PR applies the same principle to the dequantization during prompt processing: the same dequantization kernels can be reused with a different template. This allows for the deduplication of CUDA code to ensure consistency. As a side effect the new kernels are also slightly faster on my hardware.

Performance numbers for perplexity calculation on the first 100 lines of wikitext:

GPU Model ms/t master ms/t PR
RTX 3090 7b q4_0 3.61 3.60
RTX 3090 7b q4_1 3.82 3.74
RTX 3090 7b q5_0 3.75 3.64
RTX 3090 7b q5_1 3.75 3.67
RTX 3090 7b q8_0 4.29 4.05
RTX 3090 7b f16 4.91 4.86
GTX 1070 7b q4_0 9.78 7.39
GTX 1070 7b q4_1 9.86 7.67
GTX 1070 7b q5_0 10.01 7.63
GTX 1070 7b q5_1 10.12 7.79
GTX 1070 7b q8_0 11.88 8.28
GTX 1070 7b f16 10.62 10.69

I will also add GTX 1070 numbers once I have them. Done.

The goal of this PR is not to optimize performance. The goal is to simplify the code base to allow for easier development. If the new kernels don't cause a performance regression I consider that good enough.

@JohannesGaessler JohannesGaessler added the refactoring Refactoring label May 14, 2023
@ggerganov
Copy link
Member

I don't see performance degradation on RTX 4080

@ggerganov ggerganov requested a review from slaren May 14, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants