-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
I used the rpc backend example of llama.cpp, where both rpc backend and server backend are arm cpus. After the connection is successful, the server does not conduct distributed inference with rpc backend. I observed the RPC-server side, and its cpu and memory did not increase, which means that RPC-server did not participate in calculation. May I ask if the current rpc backend does not support cpu backend? The command I run is as follows:
rpc server:
./rpc-server -H 192.168.13.12 -p 50020
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.13.12') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
create_backend: using CPU backend
Starting RPC server on 192.168.13.12:50020, backend memory: 7937 MB
Accepted client connection, free_mem=8323272704, total_mem=8323272704
Client connection closed
Accepted client connection, free_mem=8323272704, total_mem=8323272704
Client connection closed
Accepted client connection, free_mem=8323272704, total_mem=8323272704
Client connection closed
rpc backend:
./llama-cli -m /root/root-data/06-llama-models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "I believe the life is" -n 1280 --rpc 192.168.13.12:50020
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
build: 4202 (9f91251) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device RPC[192.168.13.12:50020] (RPC[192.168.13.12:50020]) - 7937 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from /root/root-data/06-llama-models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 32000
llama_model_loader: - kv 3: llama.context_length u32 = 2048
llama_model_loader: - kv 4: llama.embedding_length u32 = 2048
llama_model_loader: - kv 5: llama.block_count u32 = 22
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_0: 155 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name = .
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
8
32
8
255
16
0
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 606.53 MiB
llm_load_tensors: CPU_AARCH64 model buffer size = 519.75 MiB
......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init: CPU KV buffer size = 88.00 MiB
llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: model was trained on only 2048 context tokens (4096 specified)
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
sampler seed: 756249252
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1280, n_keep = 1
Operating systems
Linux
GGML backends
CPU
Hardware
cat /proc/cpuinfo
processor : 0
BogoMIPS : 108.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x4
CPU part : 0xd0b
CPU revision : 1
processor : 1
BogoMIPS : 108.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x4
CPU part : 0xd0b
CPU revision : 1
processor : 2
BogoMIPS : 108.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x4
CPU part : 0xd0b
CPU revision : 1
processor : 3
BogoMIPS : 108.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x4
CPU part : 0xd0b
CPU revision : 1
Revision : d04170
Serial : cd446cc30f3bb088
Model : Raspberry Pi 5 Model B Rev 1.0
Models
No response
Problem description & steps to reproduce
build command:
mkdir build-rpc-cuda
cd build-rpc-cuda
cmake .. -DGGML_RPC=ON
cmake --build . --config Release
firstly i running on rpc-server:
./rpc-server -H 192.168.13.12 -p 50020
secondly I running command on rpc backend:
./llama-cli -m /root/root-data/06-llama-models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "I believe the life is" -n 1280 --rpc 192.168.13.12:50020
First Bad Commit
tinyllama-1.1b-chat-v1.0.Q4_0.gguf
Relevant log output
llama_load_model_from_file: using device RPC[192.168.13.12:50020] (RPC[192.168.13.12:50020]) - 7937 MiB free
all log is above