LMQL fails to use llama-cpp-python to load models #297

klutzydrummer · 2023-12-13T18:34:12Z

I run into multiple issues when trying to use lmql in colab

When running a query without the verbose flag set or set to False, I get this error:

Code:

import requests
from pathlib import Path

model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
    print("Model file exists")
else:
    print("Model file does not exist")
    # download model weights
    model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
    r = requests.get(model_url)
    with open(model_file, "wb") as f:
        f.write(r.content)

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=False))

Output:

[Loading llama.cpp model from  llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf  with {'verbose': False} ]
Exception in thread scheduler-worker:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/lmtp_scheduler.py", line 269, in worker
    model = LMTPModel.load(self.model_identifier, **self.model_args)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/lmtp_model.py", line 41, in load
    return LMTPModel.registry[backend_name](model_name, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/llama_cpp_model.py", line 22, in __init__
    self.llm = Llama(model_path=model_identifier[len("llama.cpp:"):], logits_all=True, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 841, in __init__
    with suppress_stdout_stderr(disable=self.verbose):
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_utils.py", line 23, in __enter__
    self.old_stdout_fileno_undup = self.sys.stdout.fileno()
io.UnsupportedOperation: fileno

Restarting the runtime or deleting and spinning up a new instance and setting verbose to True at least clears that portion of code, but never progresses beyond that:
Code:

import requests
from pathlib import Path

model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
    print("Model file exists")
else:
    print("Model file does not exist")
    # download model weights
    model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
    r = requests.get(model_url)
    with open(model_file, "wb") as f:
        f.write(r.content)

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=True))

Output:

[Loading llama.cpp model from llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf  with  {'verbose': True} ]
lmtp generate: [1, 28824, 28747, 1824, 349, 272, 21790, 302, 272, 2296, 4058, 28747, 8789, 1014, 2887, 403, 1215, 1179, 28723, 13940, 28832, 28804, 13, 28741, 28747, 28705] / '<s>Q: What is the sentiment of the following review: ```The food was very good.```?\nA: ' (26 tokens, temperature=0.0, max_tokens=128)

GPU RAM never increases, System RAM usage will increase sometimes, but doesn't seem consistent. It appears the model never actually loads and so never does any inferencing.

I've tried following examples from both the documentation and from other sites with no luck. A minimal example of the error canbe found at the following colab notebook:
https://colab.research.google.com/drive/1aND43pi3v11fW_2kTYDHLaoxXV69HPWq?usp=sharing

The text was updated successfully, but these errors were encountered:

klutzydrummer · 2023-12-15T16:12:29Z

I realize the error with verbosity is a llama-cpp-python bug that has an open issue.

I've updated my example to show the lmql specific issue I am encountering.
I can load the LLM with llama-cpp-python directly and get completions, however, when I try to use lmql I still don't get output, just hanging.

lbeurerkellner · 2023-12-17T15:11:58Z

Hi there, I just tried to reproduce this on my workstation, and it seems to all work. Can you make sure to re-install llama.cpp with the correct build flags to enable GPU support?

My test code:

import requests
from pathlib import Path
import lmql

model_file = "/home/luca/repos/lmql/zephyr-7b-beta.Q5_K_M.gguf"

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where len(TOKENS(SENTIMENT)) < 32
"""

result = lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', 
        verbose=False
    ))

print([result])

Note that I put a token constraint on SENTIMENT, just to make sure we don't have termination issues here. For Colab, please note that sometimes the root of problems with termination is asyncio-related, so maybe try running in a standalone script first.

After re-installing llama.cpp do you still the same issue?

lawyinking · 2024-01-21T12:34:41Z

not sure why I got this error, lmql has no attribute 'run_sync'

klutzydrummer changed the title ~~LMQL fails to use llama to load models~~ LMQL fails to use llama-cpp-python to load models Dec 13, 2023

lbeurerkellner added the question Questions about using LMQL. label Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMQL fails to use llama-cpp-python to load models #297

LMQL fails to use llama-cpp-python to load models #297

klutzydrummer commented Dec 13, 2023

klutzydrummer commented Dec 15, 2023

lbeurerkellner commented Dec 17, 2023 •

edited

Loading

lawyinking commented Jan 21, 2024

LMQL fails to use llama-cpp-python to load models #297

LMQL fails to use llama-cpp-python to load models #297

Comments

klutzydrummer commented Dec 13, 2023

klutzydrummer commented Dec 15, 2023

lbeurerkellner commented Dec 17, 2023 • edited Loading

lawyinking commented Jan 21, 2024

lbeurerkellner commented Dec 17, 2023 •

edited

Loading