Why don't use gpu

## install
pip install llama-cpp-python \
   --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 --no-cache-dir


## code
from llama_cpp import Llama

model_path="/root/model/Llama3.1-8B-Chinese-Chat-gguf/Llama3.1-8B-Chinese-Chat.Q3_K_M.gguf"


model_kwargs = {
  "n_ctx":8192,    # Context length to use
  "n_threads":4,   # Number of CPU threads to use
  "n_gpu_layers": 20 ,# Number of model layers to offload to GPU. Set to 0 if only using CPU
}

llm = Llama(model_path=model_path, **model_kwargs)

generation_kwargs = {
    "max_tokens":2000, # Max number of new tokens to generate
    # "stop":["<|endoftext|>", "</s>"], # Text sequences to stop generation on
    "echo":False, # Echo the prompt in the output
    "top_k":3 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}

def chat(messages):
    res = llm.create_chat_completion(
        messages=messages
    )

    print(res['choices'][0]['message']['content'])

if __name__ == '__main__':
  while True:
      prompt = input()
      messages = [
        {"role": "user","content": prompt}
      ]
      chat(messages)

questions:
why do not use gpu?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why don't use gpu #1723

install

code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why don't use gpu #1723

Description

install

code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions