# Llama C++ Python Bindings

In [1]:
from llama_cpp import Llama

### Pulling models from Hugging Face Hub

In [2]:
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

qwen1_5-0_5b-chat-q8_0.gguf:   0%|          | 0.00/665M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Chat Completion

In [3]:
llm.create_chat_completion(
    messages = [
        {
            "role": "system",
            "content": "You are an assistant who perfectly describes images."
        },
        {
            "role": "user",
            "content": "Describe this image in detail please."
        }
    ]
)

{'id': 'chatcmpl-9ab286e5-533d-4c30-b852-8e4fb4f120be',
 'object': 'chat.completion',
 'created': 1716653870,
 'model': 'C:\\Users\\IRACK\\.cache\\huggingface\\hub\\models--Qwen--Qwen1.5-0.5B-Chat-GGUF\\snapshots\\cfab082d2fef4a8736ef384dc764c2fb6887f387\\.\\qwen1_5-0_5b-chat-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "I'm sorry, but I cannot describe an image without seeing it first. Please provide the image you would like me to describe, and I will do my best to assist you with your request."},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 30, 'completion_tokens': 39, 'total_tokens': 69}}

### JSON and JSON Schema Mode

#### JSON Mode

In [4]:
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
    },
    temperature=0.7,
)

{'id': 'chatcmpl-08ac2a4b-7ccb-4ab0-a661-a079234fae32',
 'object': 'chat.completion',
 'created': 1716653894,
 'model': 'C:\\Users\\IRACK\\.cache\\huggingface\\hub\\models--Qwen--Qwen1.5-0.5B-Chat-GGUF\\snapshots\\cfab082d2fef4a8736ef384dc764c2fb6887f387\\.\\qwen1_5-0_5b-chat-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': '{ "year": 2020, "team": "Los Angeles Dodgers", "title": "World Series", "result": "Los Angeles Dodgers defeated New York Mets, 6-4"}'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 35, 'completion_tokens': 42, 'total_tokens': 77}}

#### JSON Schema Mode

In [5]:
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)

{'id': 'chatcmpl-47249210-1b59-4543-b881-310cd6d3bc9b',
 'object': 'chat.completion',
 'created': 1716653921,
 'model': 'C:\\Users\\IRACK\\.cache\\huggingface\\hub\\models--Qwen--Qwen1.5-0.5B-Chat-GGUF\\snapshots\\cfab082d2fef4a8736ef384dc764c2fb6887f387\\.\\qwen1_5-0_5b-chat-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': '{ "team_name": "Los Angeles Angels of Baseball" }'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 35, 'completion_tokens': 13, 'total_tokens': 48}}

### Function Calling

In [6]:
llm.create_chat_completion(
    messages = [
        {
            "role": "system",
            "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
            "role": "user",
            "content": "Extract Jason is 25 years old"
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "UserDetail",
            "parameters": {
                "type": "object",
                "title": "UserDetail",
                "properties": {
                    "name": {
                        "title": "Name",
                        "type": "string"
                    },
                    "age": {
                        "title": "Age",
                        "type": "integer"
                    }
                },
                "required": [ "name", "age" ]
            }
        }
    }],
    tool_choice={
        "type": "function",
        "function": {
            "name": "UserDetail"
        }
    }
)

{'id': 'chatcmpl-1cf03656-5f21-4e0d-ab3a-62dd2f849882',
 'object': 'chat.completion',
 'created': 1716653960,
 'model': 'C:\\Users\\IRACK\\.cache\\huggingface\\hub\\models--Qwen--Qwen1.5-0.5B-Chat-GGUF\\snapshots\\cfab082d2fef4a8736ef384dc764c2fb6887f387\\.\\qwen1_5-0_5b-chat-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': None,
    'function_call': {'name': 'UserDetail',
     'arguments': '{ "name": "Jason", "age": 25 }'},
    'tool_calls': [{'id': 'call__0_UserDetail_cmpl-1cf03656-5f21-4e0d-ab3a-62dd2f849882',
      'type': 'function',
      'function': {'name': 'UserDetail',
       'arguments': '{ "name": "Jason", "age": 25 }'}}]},
   'logprobs': None,
   'finish_reason': 'tool_calls'}],
 'usage': {'prompt_tokens': 60, 'completion_tokens': 14, 'total_tokens': 74}}

### Multi-modal Models

In [7]:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

In [8]:
chat_handler = MoondreamChatHandler.from_pretrained(
    repo_id="vikhyatk/moondream2",
    filename="*mmproj*",
)

llm = Llama.from_pretrained(
    repo_id="vikhyatk/moondream2",
    filename="*text-model*",
    chat_handler=chat_handler,
    n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)

moondream2-mmproj-f16.gguf:   0%|          | 0.00/910M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


moondream2-text-model-f16.gguf:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from C:\Users\IRACK\.cache\huggingface\hub\models--vikhyatk--moondream2\snapshots\fa8398d264205ac3890b62e97d3c588268ed9ec4\.\moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.atte

In [9]:
response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =   47680.51 ms
llama_print_timings:      sample time =      16.83 ms /    76 runs   (    0.22 ms per token,  4516.28 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (-nan(ind) ms per token, -nan(ind) tokens per second)
llama_print_timings:        eval time =    5397.93 ms /    76 runs   (   71.03 ms per token,    14.08 tokens per second)
llama_print_timings:       total time =    5518.26 ms /    76 tokens


In [13]:
print(response["choices"][0]["message"]["content"])

 The image depicts a long wooden walkway that stretches across the center of the frame, leading towards an open field. The path is surrounded by lush green grass and trees, creating a serene atmosphere. The sky above is filled with clouds, suggesting a partly cloudy day. The scene conveys a sense of tranquility as one would expect to find in such a peaceful setting.
