Skip to content

How can I configure the parameters of llama  #1669

@xiangxinhello

Description

@xiangxinhello

How can I configure the parameters to make llama-cpp-python faster? I have done some tests and found that the inference speed is about the same as the original model from Hugging Face.
How to configure llama parameters to maximize inference speed. Thanks
qwen1.5-7b model
I use a message
llm = Llama(
model_path="/llama.cpp/ggml-qwen-chat-f16.gguf",
n_gpu_layers=-1,
chat_format="qwen",
logits_all=False
verbose=False
)

start_time = time.time()
x = llm.create_chat_completion(
messages=messages,
top_p=1.0,
top_k=50,
temperature=1.0,
max_tokens=512
)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
llama-cpp-python need 5.1s

#huggingface qwen1.5-7b model
I use a same message
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
start_time = time.time()
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
huggingface qwen1.5-7b model need 6.5s

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions