-
Notifications
You must be signed in to change notification settings - Fork 1.3k
How can I configure the parameters of llama #1669
Description
How can I configure the parameters to make llama-cpp-python faster? I have done some tests and found that the inference speed is about the same as the original model from Hugging Face.
How to configure llama parameters to maximize inference speed. Thanks
qwen1.5-7b model
I use a message
llm = Llama(
model_path="/llama.cpp/ggml-qwen-chat-f16.gguf",
n_gpu_layers=-1,
chat_format="qwen",
logits_all=False
verbose=False
)
start_time = time.time()
x = llm.create_chat_completion(
messages=messages,
top_p=1.0,
top_k=50,
temperature=1.0,
max_tokens=512
)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
llama-cpp-python need 5.1s
#huggingface qwen1.5-7b model
I use a same message
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
start_time = time.time()
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
huggingface qwen1.5-7b model need 6.5s