Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware Requirement for Running Llama-2 inferences #61

Open
shang-zhu opened this issue Jan 19, 2024 · 2 comments
Open

Hardware Requirement for Running Llama-2 inferences #61

shang-zhu opened this issue Jan 19, 2024 · 2 comments

Comments

@shang-zhu
Copy link

Hi, I successfully ran the inferences with Llama-2-7b and unlimiformer but ran into memory errors when jumped to larger models. What are the minimum GPU memory requirements for running 13b and 70b models? Thank you!

@abertsch72
Copy link
Owner

Thanks for your interest in our work!

The memory required depends on two things:

  1. The base memory needed for that model (as you'd expect!). I haven't personally tried the 70b model, but this NVIDIA guide gives numbers that look pretty reasonable to me:

The file size of the model varies on how large the model is.
Llama2-7B-Chat requires about 30GB of storage.
Llama2-13B-Chat requires about 50GB of storage.
Llama2-70B-Chat requires about 150GB of storage.

  1. The number of layers you apply Unlimiformer at. The good news here is that the additional cost from Unlimiformer doesn't depend on the model size (since we're only saving hidden states, and the models all have the same hidden dimension). You can calculate this for your input/use case by looking at the difference in GPU memory used between your 7b Llama+Unlimiformer setup and the base 7b model.

As a general recipe, I'd guess (amount of memory for the model) + (2-3 GBs per layer you'd like to apply Unlimiformer at) will get you pretty close to the amount needed, but this depends on how long your inputs are and whether you choose flat or trained indices.

@zhangwenhao666
Copy link

I would like to ask about the input of about 100,000 tokens. Using the llama2-13b model, how long does it take to run it on an H100 graphics card?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants