-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:
from llama_cpp import Llama
llm4 = Llama(model_path = "/path/to/model/Q4_K_M.gguf", chat_format = "chatml")
response = llm4.create_chat_completion(
messages = [
{
"role": "system",
"content": "You are a helpful dietologist.",
},
{
"role": "user",
"content": "Can I eat oranges after 7 pm?"
},
],
response_format = {
"type": "json_object",
},
temperature = 0.7,
)
print(response)
However, the output is consistently empty:
{'id': 'chatcmpl-d6b4c8ae-0f0a-4112-bb32-3c567f383d13', 'object': 'chat.completion', 'created': 1724142021, 'model': ‘path/to/model/Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{} '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 51, 'completion_tokens': 2, 'total_tokens': 53}}
Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?