Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware requirements for Llama 2 #425

Closed
g1sbi opened this issue Jul 19, 2023 · 22 comments
Closed

Hardware requirements for Llama 2 #425

g1sbi opened this issue Jul 19, 2023 · 22 comments
Assignees
Labels
invalid Not an issue

Comments

@g1sbi
Copy link

g1sbi commented Jul 19, 2023

Similar to #79, but for Llama 2. Post your hardware setup and what model you managed to run on it.

@hans-ekbrand
Copy link

hans-ekbrand commented Jul 20, 2023

Using https://github.com/ggerganov/llama.cpp (without BLAS) for inference and quantization I ran a INT4 version of 7B on CPU and it required 3.6 GB of RAM.

13B, quantised to 3 bits per parameter, https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/blob/main/llama-2-13b-chat.ggmlv3.q3_K_S.bin on a Pentium(R) Dual-Core CPU E5400 @ 2.70GHz (bogomips=5400.11, address sizes: 36 bits physical, 48 bits virtual)
Ram allocation 5.6 GB, Tokens per second: 0.14
13B, quantised to 4 bits per parameter, https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/blob/main/llama-2-13b-chat.ggmlv3.q4_K_S.bin
RAM allocation 7.1 GB, Tokens per second: 0.09

On a newer computer, 13B quantised to INT8, https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/blob/main/llama-2-13b-chat.ggmlv3.q8_0.bin hardware info: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
RAM allocation: 13 GB, Tokens per second: 1.68

@bobbyt2012
Copy link

I ran an unmodified llama-2-7b-chat.

2x E5-2690v2
576GB DDR3 ECC
RTX A4000 16GB

Loaded in 15.68 seconds, used about 15GB of VRAM and 14GB of system memory (above the idle usage of 7.3GB)

@ssutharzan
Copy link

I ran an unmodified llama-2-7b-chat.

2x E5-2690v2 576GB DDR3 ECC RTX A4000 16GB

Loaded in 15.68 seconds, used about 15GB of VRAM and 14GB of system memory (above the idle usage of 7.3GB)

How about the heat generation during continuous usage?

@bobbyt2012
Copy link

How about the heat generation during continuous usage?

I have it in a rack in my basement, so I don't really notice much. I've used this server for much heavier workloads and it's not bad. The GPU is only 140W at full load. This is only really using 1-2 CPU cores. This server puts out a lot more heat with high CPU loads.

@ssutharzan
Copy link

How about the heat generation during continuous usage?

I have it in a rack in my basement, so I don't really notice much. I've used this server for much heavier workloads and it's not bad. The GPU is only 140W at full load. This is only really using 1-2 CPU cores. This server puts out a lot more heat with high CPU loads.

Thanks for info!

@dmavroeidis
Copy link

dmavroeidis commented Jul 22, 2023

I ran an unmodified llama-2-7b-chat.

2x E5-2690v2 576GB DDR3 ECC RTX A4000 16GB

Loaded in 15.68 seconds, used about 15GB of VRAM and 14GB of system memory (above the idle usage of 7.3GB)

Thanks for the useful information!
Can you also provide info for inference? For instance, tokens/second. Thanks!

@longyee
Copy link

longyee commented Jul 23, 2023

(Last update: 2023-08-12, added NVIDIA GeForce RTX 3060 Ti)

Using llama.cpp, llama-2-13b-chat.ggmlv3.q4_0.bin, llama-2-13b-chat.ggmlv3.q8_0.bin and llama-2-70b-chat.ggmlv3.q4_0.bin from TheBloke.

MacBook Pro (6-Core Intel Core i7 @ 2.60GHz, 16 GB RAM)

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 3.32 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): can run, but extremely slow, unusable

Gaming Laptop (12-Core Intel Core i5 @ 2.70GHz, 16GB RAM, GeForce RTX 3050 mobile 4GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 2.12 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 2.10 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded  8/43 layers to GPU): 5.51 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 16/43 layers to GPU): 6.68 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded  8/43 layers to GPU): 3.10 tokens per second

Cloud Server (4-Core Intel Xeon Skylake @ 2.40GHz, 12GB RAM, NVIDIA GeForce RTX 3060 Ti 8GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 38/43 layers to GPU): 11.06 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 21/43 layers to GPU):  2.56 tokens per second

Cloud Server with 2x GPUs (8-Core Intel Xeon Skylake @ 2.40GHz, 24GB RAM, 2x NVIDIA GeForce RTX 3080 10GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 33.27 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): 28.27 tokens per second

Cloud Server (4-Core Intel Xeon Skylake @ 2.40GHz, 24GB RAM, NVIDIA GeForce RTX 3090 24GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 1.52 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 1.18 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 62.81 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): 36.39 tokens per second

Cloud Server (24-Core Intel Xeon CPU E5-2650 v4 @ 2.20GHz, 96GB RAM, NVIDIA GeForce A40 48GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 3.81 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 2.24 tokens per second
- llama-2-70b-chat.ggmlv3.q4_0.bin (CPU only): 0.74 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 22.46 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): 16.49 tokens per second
- llama-2-70b-chat.ggmlv3.q4_0.bin (offloaded 83/83 layers to GPU):  7.06 tokens per second

Cloud Server (4-Core Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 16GB RAM, NVIDIA GeForce RTX A4000 16GB)

llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 2.20 tokens per second
llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 1.42 tokens per second
llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 41.58 tokens per second
llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 40/43 layers to GPU):  9.87 tokens per second
llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): CUDA error, out of memory

Cloud Server (8-Core AMD Ryzen Threadripper 3960X @ 2.20GHz, 32GB RAM, NVIDIA GeForce RTX A6000 48GB)

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 2.96 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 1.85 tokens per second
- llama-2-70b-chat.ggmlv3.q4_0.bin (CPU only): 0.62 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 27.63 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): 19.94 tokens per second
- llama-2-70b-chat.ggmlv3.q4_0.bin (offloaded 83/83 layers to GPU):  9.37 tokens per second

Google Colab (2-Core Intel Xeon CPU @ 2.20GHz, 13GB RAM, NVIDIA Tesla T4 16GB)

How to run llama.cpp in Google Colab?

!git clone https://github.com/ggerganov/llama.cpp.git
!(cd llama.cpp; make) # make LLAMA_CUBLAS=1 if GPU
!wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin
!llama.cpp/main  ...

- llama-2-13b-chat.ggmlv3.q4_0.bin (CPU only): 0.13 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (CPU only): 0.02 tokens per second
- llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 20.75 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 40/43 layers to GPU):  3.97 tokens per second
- llama-2-13b-chat.ggmlv3.q8_0.bin (offloaded 43/43 layers to GPU): CUDA error, out of memory

(Note: the cloud server sometimes is not reliable, its results depend on other users in the same host.)

@zacps
Copy link

zacps commented Jul 24, 2023

Using llama.cpp llama-2-70b-chat converted to fp16 (no quantisation) works with 4 A100 40GBs (all layers offloaded), fails with three or fewer.

Best result so far is just over 8 tokens/s.

@dmavroeidis
Copy link

Thanks a lot @longyee!

@iakashpaul
Copy link

Llama2 7B-Chat on RTX 2070S with bitsandbytes FP4, Ryzen 5 3600, 32GB RAM

Completely loaded on VRAM ~6300MB, took ~12 seconds to process ~2200 tokens & generate a summary(~30 tokens/sec).

Also ran the same on A10(24GB VRAM)/LambdaLabs VM with similar results

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

model_id = 'meta-llama/Llama-2-7b-chat-hf'

if torch.cuda.is_available():
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map='auto', load_in_4bit=True
    )

@AmoghM
Copy link

AmoghM commented Jul 27, 2023

Ran llama2-7b-chat on CPU via llama.cpp quantized to 4 bit on Macbook M1 Pro 32 GB RAM.

load time = 186.99 ms
sample time = 430.34 ms / 512 runs ( 0.84 ms per token, 1189.75 tokens per second)
prompt eval time = 6034.92 ms / 271 tokens (22.27 ms per token, 44.91 tokens per second)
eval time = 67462.46 ms / 510 runs (132.28 ms per token, 7.56 tokens per second)
total time = 73999.12 ms

@bobbyt2012
Copy link

Ran llama2-70b-chat with llama.cpp with ggmlv3 quantized to 6 bit from TheBloke on CPU. Dual Xeon E5-2690v2. Consumed roughly 55GB of RAM.

load time: 5s
sample time: 1100 tokens per second
prompt eval: .52 tokens per second
eval time: .45 tokens per second

@iakashpaul
Copy link

iakashpaul commented Jul 28, 2023

M1 MacBook Pro 16GB RAM 10c with Llama2 using Replicate+Greganov bash script.

2GB RAM, 17tokens per second, 8threads

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build it
LLAMA_METAL=1 make

# Download model
export MODEL=llama-2-7b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
    curl -L "https://huggingface.co/TheBloke/Llama-2-7B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi

# Set prompt
PROMPT="Hello! How are you?"

# Run
./main -m ./models/llama-2-7b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

@hans-ekbrand
Copy link

Using https://github.com/ggerganov/llama.cpp (with CUBLAS) for inference and quantization I ran a INT4 version of 13B on 4 x GTX 1060 at 7.5 tokens per second. In order to make it work, I had to load fewer layers to the main device with the -ts argument, see below.

./main -ngl 43 -ts 7,12,12,12 -m llama-2-13b-chat.ggmlv3.q4_1.bin --instruct
main: build = 926 (8a88e58)
main: seed  = 1690800072
ggml_init_cublas: found 4 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1
  Device 1: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1
  Device 2: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1
  Device 3: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1
llama.cpp: loading model from [...] llama-2-13b-chat.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5,0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0,11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce GTX 1060 3GB) as main device
llama_model_load_internal: mem required  =  463,77 MB (+  400,00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 8422 MB
llama_new_context_with_model: kv self size  =  400,00 MB

system_info: n_threads = 2 / 2 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

@niryuu
Copy link

niryuu commented Aug 2, 2023

Gaming Laptop (8-Core Ryzen 9 7940HS @ 5.20GHz, 32GB RAM, GeForce RTX 4080 mobile 12GB)
Turbo:

  • llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 37.39 tokens per second

Power Saving:

  • llama-2-13b-chat.ggmlv3.q4_0.bin (offloaded 43/43 layers to GPU): 29.61 tokens per second

@kaderghal
Copy link

I ran: TheBloke_Llama-2-7b-chat-fp16.

CPU: Core™ i9-13900K
GPU: RTX 4080 (16 GB VRAM)
RAM: DDR5 32GB

Loaded in 12.68 seconds, used about 14GB of VRAM.

@Brenden2008
Copy link

I ran: TheBloke_Llama-2-7b-chat-fp16.

CPU: Core™ i9-13900K GPU: RTX 4080 (16 GB VRAM) RAM: DDR5 32GB

Loaded in 12.68 seconds, used about 14GB of VRAM.

How many tokens per second did you get?

@vantuyen-dev
Copy link

vantuyen-dev commented Aug 9, 2023

I want to ask more if the above hardware for 1 Q&A session can meet the needs of multi-chat sessions. Or add load balancer and queue for concurrent processing
I want to ask for more configuration to expand for practical application.

@FaraneJalaliFarahani
Copy link

Using llama.cpp llama-2-70b-chat converted to fp16 (no quantisation) works with 4 A100 40GBs (all layers offloaded), fails with three or fewer.

Best result so far is just over 8 tokens/s.

Can you please explain it in more detail, like how you offloaded all layers?

@calypso
Copy link

calypso commented Aug 11, 2023

I ran an unmodified llama-2-7b-chat.

2x E5-2690v2 576GB DDR3 ECC RTX A4000 16GB

Loaded in 15.68 seconds, used about 15GB of VRAM and 14GB of system memory (above the idle usage of 7.3GB)

Can this be scaled accross multiple cards with something like k8s to abstract multiple GPU's?

@IvanSivak
Copy link

Just FYI for somebody looking at non-quantized default llama-2-70b-chat model.

During inference on 8xA100 40GB SXM:
(torchrun --nproc_per_node 8 example_text_completion.py --ckpt_dir llama-2-70b-chat/ --tokenizer_path tokenizer.model --max_seq_len 384 --max_batch_size 8)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                  Off |
| N/A   33C    P0   130W / 400W |  19657MiB / 40960MiB |     67%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                  Off |
| N/A   32C    P0   119W / 400W |  19801MiB / 40960MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:08:00.0 Off |                  Off |
| N/A   32C    P0   134W / 400W |  19801MiB / 40960MiB |     52%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:09:00.0 Off |                  Off |
| N/A   32C    P0   142W / 400W |  19801MiB / 40960MiB |     56%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                  Off |
| N/A   33C    P0   139W / 400W |  19801MiB / 40960MiB |     65%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                  Off |
| N/A   33C    P0   136W / 400W |  19801MiB / 40960MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:0C:00.0 Off |                  Off |
| N/A   33C    P0   135W / 400W |  19801MiB / 40960MiB |     78%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:0D:00.0 Off |                  Off |
| N/A   33C    P0   135W / 400W |  19657MiB / 40960MiB |     85%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     76635      C   /usr/bin/python3                19654MiB |
|    1   N/A  N/A     76636      C   /usr/bin/python3                19798MiB |
|    2   N/A  N/A     76637      C   /usr/bin/python3                19798MiB |
|    3   N/A  N/A     76638      C   /usr/bin/python3                19798MiB |
|    4   N/A  N/A     76640      C   /usr/bin/python3                19798MiB |
|    5   N/A  N/A     76642      C   /usr/bin/python3                19798MiB |
|    6   N/A  N/A     76644      C   /usr/bin/python3                19798MiB |
|    7   N/A  N/A     76645      C   /usr/bin/python3                19654MiB |
+-----------------------------------------------------------------------------+

@namaggarwal namaggarwal added the invalid Not an issue label Sep 6, 2023
@namaggarwal namaggarwal self-assigned this Sep 6, 2023
@slrealtech
Copy link

slrealtech commented Jul 7, 2024

Can I run this Model in My 4GB ram PC?
I5 3rd GEN
😅😥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid Not an issue
Projects
None yet
Development

No branches or pull requests