-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
What happened?
When trying to run FatLlama-1.7T-Instruct llama.cpp crashes while loading the model with the error: n > N_MAX: 525 > 512 for key llama.feed_forward_length. This is because the model has 525 layers but LLAMA_MAX_LAYERS is hardcoded to 512.
In order to run FetLlama 1.7T it is required to bump LLAMA_MAX_LAYERS to 525 or larger and recompile llama.cpp. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing LLAMA_MAX_LAYERS gets unfeasible for the general user.
To fix this issue LLAMA_MAX_LAYERS could be increased or be made dynamic.
Name and Version
version: 3927 (10433e8)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
root@AI:/bpool# ./llama.cpp/llama-cli -m FATLLAMA-1.7T-Instruct.SOURCE.gguf -p "I believe the meaning of life is" -n 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 3927 (10433e8b) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 4729 tensors from FATLLAMA-1.7T-Instruct.SOURCE.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Merged
llama_model_loader: - kv 3: general.size_label str = 1.7T
llama_model_loader: - kv 4: llama.block_count u32 = 525
llama_model_loader: - kv 5: llama.context_length u32 = 131072
llama_model_loader: - kv 6: llama.embedding_length u32 = 16384
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 53248
llama_model_loader: - kv 8: llama.attention.head_count u32 = 128
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 16
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 1
llama_model_loader: - kv 13: llama.vocab_size u32 = 128256
llama_model_loader: - kv 14: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 22: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 1052 tensors
llama_model_loader: - type f16: 3677 tensors
llama_model_load: error loading model: error loading model hyperparameters: n > N_MAX: 525 > 512 for key llama.feed_forward_length
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'FATLLAMA-1.7T-Instruct.SOURCE.gguf'
main: error: unable to load model