How can I configure the parameters of llama 

How can I configure the parameters to make llama-cpp-python faster? I have done some tests and found that the inference speed is about the same as the original model from Hugging Face.
How to configure llama parameters to maximize inference speed. Thanks
qwen1.5-7b model
I use a message
llm = Llama(
    model_path="/llama.cpp/ggml-qwen-chat-f16.gguf",
    n_gpu_layers=-1,
    chat_format="qwen",
    logits_all=False
   verbose=False
)

start_time = time.time()
x = llm.create_chat_completion(
    messages=messages,
    top_p=1.0,
    top_k=50,
    temperature=1.0,
    max_tokens=512
)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
llama-cpp-python need 5.1s

#huggingface qwen1.5-7b model
I use a same message
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
start_time = time.time()
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()
elapsed_time = end_time - start_time
print(f"generate time: {elapsed_time:.6f} s")
huggingface qwen1.5-7b model need 6.5s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I configure the parameters of llama #1669

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How can I configure the parameters of llama #1669

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions