Skip to content

Bug: LLAMA_MAX_LAYERS must be increased to run FatLlama 1.7T #9909

@nicoboss

Description

@nicoboss

What happened?

When trying to run FatLlama-1.7T-Instruct llama.cpp crashes while loading the model with the error: n > N_MAX: 525 > 512 for key llama.feed_forward_length. This is because the model has 525 layers but LLAMA_MAX_LAYERS is hardcoded to 512.

In order to run FetLlama 1.7T it is required to bump LLAMA_MAX_LAYERS to 525 or larger and recompile llama.cpp. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing LLAMA_MAX_LAYERS gets unfeasible for the general user.

To fix this issue LLAMA_MAX_LAYERS could be increased or be made dynamic.

Name and Version

version: 3927 (10433e8)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

root@AI:/bpool# ./llama.cpp/llama-cli -m FATLLAMA-1.7T-Instruct.SOURCE.gguf -p "I believe the meaning of life is" -n 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 3927 (10433e8b) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 4729 tensors from FATLLAMA-1.7T-Instruct.SOURCE.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Merged
llama_model_loader: - kv   3:                         general.size_label str              = 1.7T
llama_model_loader: - kv   4:                          llama.block_count u32              = 525
llama_model_loader: - kv   5:                       llama.context_length u32              = 131072
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 16384
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 53248
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 128
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32: 1052 tensors
llama_model_loader: - type  f16: 3677 tensors
llama_model_load: error loading model: error loading model hyperparameters: n > N_MAX: 525 > 512 for key llama.feed_forward_length
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'FATLLAMA-1.7T-Instruct.SOURCE.gguf'
main: error: unable to load model

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions