Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Memory Requirements for Different Model Sizes #13

Closed
NightMachinery opened this issue Mar 11, 2023 · 18 comments
Closed

[Q] Memory Requirements for Different Model Sizes #13

NightMachinery opened this issue Mar 11, 2023 · 18 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@NightMachinery
Copy link

No description provided.

@satyajitghana
Copy link

satyajitghana commented Mar 11, 2023

7B (4-bit): 4.14 GB MEM
65B (4-bit): 38 GB MEM

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 11, 2023

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

  • 7B => ~4 GB
  • 13B => ~8 GB
  • 30B => ~16 GB
  • 65B => ~32 GB

@cannin
Copy link

cannin commented Mar 12, 2023

With an M1 Max 64GB with 4-bit

65B: 38.5GB, 850 ms per token
30B: 19.5GB, 450 ms per token
13B: 7.8GB, 150 ms per token
7B: 4.0GB, 75 ms per token

@ggerganov ggerganov added question Further information is requested documentation Improvements or additions to documentation labels Mar 12, 2023
@dbddv01
Copy link

dbddv01 commented Mar 15, 2023

For the record, Intel® Core™ i5-7600K CPU @ 3.80GHz × 4, 16Gb ram, under Ubuntu, model 13B runs with acceptable response time. Note that as mentioned by previous comments, -t 4 parameter gives the best results.
main: mem per token = 22357508 bytes
main: load time = 83076.67 ms
main: sample time = 267.12 ms
main: predict time = 193441.61 ms / 367.76 ms per token
main: total time = 277980.41 ms

Great work !

@ggerganov
Copy link
Owner

Should add these to readme

@sinanisler
Copy link

@prusnak is that pc ram or gpu vram ?

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 18, 2023

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

@whitepapercg
Copy link

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

Is it possible that at some point we will get a video card version?

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 18, 2023

Is it possible that at some point we will get a video card version?

I don' think so. You can use run the original Whisper model on a GPU: https://github.com/openai/whisper

@mrpher
Copy link

mrpher commented Mar 18, 2023

Fwiw, running on my M2 Macbook Air w 8GB of ram comes to a grinding halt. At first run about 2-3 minutes of completely unresponsive machine (mouse and keyboard locked), then about 10-20 seconds per response word. Didn't expect great response times, but thats a bit slower than anticipated.

  • Edit: using 7B model

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 18, 2023

M2 Macbook Air w 8GB

Close every other app, ideally reboot to clean state. This should help. If you see unresponsive machine, then it is swapping memory to disk. 8GB is not that much, especially if you have Browsers, Slack, etc. running.

@j-f1
Copy link
Collaborator

j-f1 commented Mar 18, 2023

Also make sure you’re using 4 threads instead of 8 — you don’t want to be using any of the 4 efficiency cores.

@mrpher
Copy link

mrpher commented Mar 18, 2023

Working well now, good recommendations @prusnak @j-f1 thank you

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 18, 2023

Requirements added in #269

@prusnak prusnak closed this as completed Mar 18, 2023
@SpeedyCraftah
Copy link

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

  • 7B => ~4 GB
  • 13B => ~8 GB
  • 30B => ~16 GB
  • 64 => ~32 GB

32gb is probably a little too optimistic, I have DDR4 32gb clocked at 3600mhz and it generates each token every 2 minutes.

@prusnak
Copy link
Sponsor Collaborator

prusnak commented Mar 21, 2023

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

@SpeedyCraftah
Copy link

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

I see. That makes more sense since you mention the whole model is loaded into memory as of now. Linux would probably run better in this case from the better swap handling and lower memory usage.

Thanks!

@Yitzhokchaim
Copy link

What languages ​​does it work with? Does it work in the same output and input languages ​​GPT?

abetlen pushed a commit to abetlen/llama.cpp that referenced this issue Apr 10, 2023
SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this issue Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests