Skip to content

CUDA: tighter VRAM scratch size for 65b/70b#2551

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-tighter-65b-70b-scratch
Aug 8, 2023
Merged

CUDA: tighter VRAM scratch size for 65b/70b#2551
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-tighter-65b-70b-scratch

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

This PR is a followup to #2056 . At the time I did not have enough VRAM to properly measure the minimum required VRAM scratch sizes for 65b (and 70b was not yet published). This PR tightens VRAM scratch sizes based on testing. The specific methodology is that I hard-coded VRAM scratch sizes with a granularity of 1 MiB and determined the minimum VRAM scratch size at which perplexity calculations are not being affected. I then added a ~25% margin on top of the minimum. These are the test results that the new numbers are based on:

Model Context size Min. VRAM scratch size [MiB] Delta [MiB]
65b q2_k 512 272 -
65b q2_k 1024 342 70
65b q2_k 1536 486 144
65b q2_k 2048 636 150
65b q2_k 2560 700 64
65b q2_k 3072 764 64
65b q2_k 3584 828 64
65b q2_k 4096 892 64
Model Context size Min. VRAM scratch size [MiB] Delta [MiB]
70b q4_k_m 512 324 -
70b q4_k_m 1024 336 12
70b q4_k_m 1536 402 66
70b q4_k_m 2048 576 174
70b q4_k_m 2560 724 148
70b q4_k_m 3072 788 64
70b q4_k_m 3584 852 64
70b q4_k_m 4096 916 64

When I tested with smaller models the min. required VRAM scratch at some point always started increasing linearly with scratch size (measured up to 8192 context). Because the results are so close I just used the same scratch size for both 65b and 70b.

@JohannesGaessler JohannesGaessler merged commit acfc547 into ggml-org:master Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants