# Test the llama.cpp library
  
This notebook test drives the llama.cpp library for running local LLMs. 
  
These timings reflect running the notebook on an M1 MacBook Air (2020) with a wee 8GB of RAM.

## 🏗️ Set up

In [1]:
from IPython.display import Markdown
from llama_cpp import Llama

In [2]:
def invoke_llm_llama_cpp(
        model,
        prompt,
        system_role="",
        prepend_system_role=False,
        max_tokens=20,
        temperature=0.2,
        verbose=False,
        ):
    """Use llama.cpp to interact with a loaded llm model.
    
    See API docs for other parameters
    https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_chat_completion
    """

    if prepend_system_role:
            # Some models like Gemma don't support system_role as a parameter
        messages = (
            {
              "role": "system",
              "content": "" # The `gemma` chat template requires a system role be passed with blank content
              },

            {
              "role": "user",
              "content": f"SYSTEM ROLE: {system_role}{'.' if not system_role.endswith('.') else system_role} PROMPT: {prompt}"
            .replace("  ", " ")
            .strip()
            }
        )
    else:
        messages = [
          {
              "role": "system",
              "content": f"{system_role}"
              },
          {
              "role": "user",
              "content": f"{prompt}"
              }
      ]
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )
    if verbose:
        print(response)
    return Markdown(response["choices"][0]["message"]["content"])

## 🦙🦙🦙 Llama3 8B quantized 4b


### Download
5 GB download the first time

In [3]:
llm = Llama.from_pretrained(
    repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    filename="Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_threads=8,
    # verbose=False
)

llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from /Users/chad/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/./Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 

In [4]:
# Alternative, download from CLI
# %huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

### Interact with llama3
Let's repeat the tests we ran in `mlx_test.ipynb`.

In [5]:
%%time
print("Without a system role")
invoke_llm_llama_cpp(
    llm,
    prompt="What's the capital of Massachusetts?",
)

Without a system role



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.48 ms /     8 runs   (    0.06 ms per token, 16842.11 tokens per second)
llama_print_timings: prompt eval time =    7441.61 ms /    42 tokens (  177.18 ms per token,     5.64 tokens per second)
llama_print_timings:        eval time =   36995.41 ms /     7 runs   ( 5285.06 ms per token,     0.19 tokens per second)
llama_print_timings:       total time =   44456.24 ms /    49 tokens


CPU times: user 18 s, sys: 26.1 s, total: 44.2 s
Wall time: 44.5 s


The capital of Massachusetts is Boston.

This version of LLama3 performed better, and faster, on this task than the MLX version that had no system role. But still slower than ...

In [6]:
%%time
print("With a system role")
invoke_llm_llama_cpp(
    llm,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate"
)

Llama.generate: 25 prefix-match hit, remaining 35 prompt tokens to eval


With a system role



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.12 ms /     2 runs   (    0.06 ms per token, 17094.02 tokens per second)
llama_print_timings: prompt eval time =    7659.87 ms /    35 tokens (  218.85 ms per token,     4.57 tokens per second)
llama_print_timings:        eval time =    5339.65 ms /     1 runs   ( 5339.65 ms per token,     0.19 tokens per second)
llama_print_timings:       total time =   13002.58 ms /    36 tokens


CPU times: user 7.76 s, sys: 7.08 s, total: 14.8 s
Wall time: 13 s


Boston

In [7]:
%%time
print("Markdown response")
invoke_llm_llama_cpp(
    llm,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    verbose=True
)

Llama.generate: 28 prefix-match hit, remaining 42 prompt tokens to eval


Markdown response



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.73 ms /     8 runs   (    0.09 ms per token, 10899.18 tokens per second)
llama_print_timings: prompt eval time =    9125.10 ms /    42 tokens (  217.26 ms per token,     4.60 tokens per second)
llama_print_timings:        eval time =   37395.23 ms /     7 runs   ( 5342.18 ms per token,     0.19 tokens per second)
llama_print_timings:       total time =   46544.23 ms /    49 tokens


{'id': 'chatcmpl-a15b03b9-ccff-4091-a756-1c7517fc52ce', 'object': 'chat.completion', 'created': 1725843122, 'model': '/Users/chad/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/./Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '$E = mc^2$'}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 70, 'completion_tokens': 7, 'total_tokens': 77}}
CPU times: user 19.7 s, sys: 26.9 s, total: 46.6 s
Wall time: 46.6 s


$E = mc^2$

In [8]:
%%time
print("Whimsy!")
invoke_llm_llama_cpp(
    llm,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    max_tokens=10,
    verbose=True
)

Llama.generate: 28 prefix-match hit, remaining 15 prompt tokens to eval


Whimsy!



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.59 ms /    10 runs   (    0.06 ms per token, 16891.89 tokens per second)
llama_print_timings: prompt eval time =    6340.41 ms /    15 tokens (  422.69 ms per token,     2.37 tokens per second)
llama_print_timings:        eval time =   50334.15 ms /     9 runs   ( 5592.68 ms per token,     0.18 tokens per second)
llama_print_timings:       total time =   56698.71 ms /    24 tokens


{'id': 'chatcmpl-b971e975-c0b4-4eb0-b457-6a0ca2854089', 'object': 'chat.completion', 'created': 1725843169, 'model': '/Users/chad/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/./Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "Good morning. It's nice to see you."}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 43, 'completion_tokens': 10, 'total_tokens': 53}}
CPU times: user 20.2 s, sys: 39.5 s, total: 59.7 s
Wall time: 56.7 s


Good morning. It's nice to see you.

In [9]:
%%time
print("Turn up the tempreature, turn up the creativity?")
invoke_llm_llama_cpp(
    llm,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    max_tokens=10,
    temperature=0.9,
    verbose=False
)

Llama.generate: 42 prefix-match hit, remaining 1 prompt tokens to eval


Turn up the tempreature, turn up the creativity?



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.65 ms /    10 runs   (    0.07 ms per token, 15384.62 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   54348.46 ms /    10 runs   ( 5434.85 ms per token,     0.18 tokens per second)
llama_print_timings:       total time =   54374.54 ms /    10 tokens


CPU times: user 16.3 s, sys: 37.5 s, total: 53.8 s
Wall time: 54.4 s


Good morning. It's nice to see you.

In [10]:
%%time
print("Try a goofier prompt")
invoke_llm_llama_cpp(
    llm,
    prompt="Good morning",
    system_role="You are a friendly analog robot from the 1950s with no emotion.",
    max_tokens=10,
    temperature=0.9,
    verbose=False
)

Llama.generate: 29 prefix-match hit, remaining 24 prompt tokens to eval


Try a goofier prompt



llama_print_timings:        load time =    7441.67 ms
llama_print_timings:      sample time =       0.63 ms /    10 runs   (    0.06 ms per token, 15974.44 tokens per second)
llama_print_timings: prompt eval time =    6156.81 ms /    24 tokens (  256.53 ms per token,     3.90 tokens per second)
llama_print_timings:        eval time =   51391.85 ms /     9 runs   ( 5710.21 ms per token,     0.18 tokens per second)
llama_print_timings:       total time =   57570.52 ms /    33 tokens


CPU times: user 23.3 s, sys: 36.7 s, total: 60 s
Wall time: 57.6 s


*I extend a mechanical arm to greet you* Good

:D With the high temperature, this response varies, sometimes "Beep boop", sometimes "I extend a mechanical arm to greet you" 

## 💎 Gemma 2 2B Quantized 4bit IT on Llama CPP

### Download Gemma 2 GGUF

In [11]:
llm = Llama.from_pretrained(
	repo_id="bartowski/gemma-2-2b-it-GGUF",
	filename="gemma-2-2b-it-Q4_K_M.gguf",
	chat_format="gemma",
    n_threads=8
)

llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from /Users/chad/.cache/huggingface/hub/models--bartowski--gemma-2-2b-it-GGUF/snapshots/855f67caed130e1befc571b52bd181be2e858883/./gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-2
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                   

### Interact with Gemma 2

In [12]:
%%time
print("Without a system role")
invoke_llm_llama_cpp(
    llm,
    prompt="What's the capital of Massachusetts?",
    prepend_system_role=False,
)

Without a system role



llama_print_timings:        load time =    2001.26 ms
llama_print_timings:      sample time =       1.23 ms /    11 runs   (    0.11 ms per token,  8935.82 tokens per second)
llama_print_timings: prompt eval time =    2001.12 ms /    17 tokens (  117.71 ms per token,     8.50 tokens per second)
llama_print_timings:        eval time =    1684.52 ms /    10 runs   (  168.45 ms per token,     5.94 tokens per second)
llama_print_timings:       total time =    3711.02 ms /    27 tokens


CPU times: user 7.9 s, sys: 2.47 s, total: 10.4 s
Wall time: 3.72 s


The capital of Massachusetts is **Boston**. 


In [13]:
%%time
print("With a system role")
invoke_llm_llama_cpp(
    llm,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate",
    prepend_system_role=True
)

Llama.generate: 4 prefix-match hit, remaining 38 prompt tokens to eval


With a system role



llama_print_timings:        load time =    2001.26 ms
llama_print_timings:      sample time =       0.56 ms /     4 runs   (    0.14 ms per token,  7194.24 tokens per second)
llama_print_timings: prompt eval time =    1631.38 ms /    38 tokens (   42.93 ms per token,    23.29 tokens per second)
llama_print_timings:        eval time =     446.20 ms /     3 runs   (  148.73 ms per token,     6.72 tokens per second)
llama_print_timings:       total time =    2087.87 ms /    41 tokens


CPU times: user 4.46 s, sys: 893 ms, total: 5.35 s
Wall time: 2.09 s


Boston 


In [14]:
%%time
print("Markdown response")
invoke_llm_llama_cpp(
    llm,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    verbose=True
)

Llama.generate: 4 prefix-match hit, remaining 13 prompt tokens to eval


Markdown response



llama_print_timings:        load time =    2001.26 ms
llama_print_timings:      sample time =       2.26 ms /    20 runs   (    0.11 ms per token,  8845.64 tokens per second)
llama_print_timings: prompt eval time =     427.65 ms /    13 tokens (   32.90 ms per token,    30.40 tokens per second)
llama_print_timings:        eval time =    4190.58 ms /    19 runs   (  220.56 ms per token,     4.53 tokens per second)
llama_print_timings:       total time =    4656.08 ms /    32 tokens


{'id': 'chatcmpl-90985fbf-205d-486d-ba7b-fed198817752', 'object': 'chat.completion', 'created': 1725843344, 'model': '/Users/chad/.cache/huggingface/hub/models--bartowski--gemma-2-2b-it-GGUF/snapshots/855f67caed130e1befc571b52bd181be2e858883/./gemma-2-2b-it-Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The formula for mass-energy equivalence is:\n\n**E = mc²**\n\nWhere:\n\n'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 17, 'completion_tokens': 20, 'total_tokens': 37}}
CPU times: user 13.7 s, sys: 2.63 s, total: 16.4 s
Wall time: 4.66 s


The formula for mass-energy equivalence is:

**E = mc²**

Where:



The LaTeX didn't come through there.

In [15]:
%%time
print("Whimsy!")
invoke_llm_llama_cpp(
    llm,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    max_tokens=10,
    verbose=True
)

Llama.generate: 4 prefix-match hit, remaining 7 prompt tokens to eval


Whimsy!



llama_print_timings:        load time =    2001.26 ms
llama_print_timings:      sample time =       1.17 ms /    10 runs   (    0.12 ms per token,  8576.33 tokens per second)
llama_print_timings: prompt eval time =     239.08 ms /     7 tokens (   34.15 ms per token,    29.28 tokens per second)
llama_print_timings:        eval time =    1249.07 ms /     9 runs   (  138.79 ms per token,     7.21 tokens per second)
llama_print_timings:       total time =    1503.26 ms /    16 tokens


{'id': 'chatcmpl-2eb91b24-e16d-4dc5-b5ed-a5d905a6929a', 'object': 'chat.completion', 'created': 1725843349, 'model': '/Users/chad/.cache/huggingface/hub/models--bartowski--gemma-2-2b-it-GGUF/snapshots/855f67caed130e1befc571b52bd181be2e858883/./gemma-2-2b-it-Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'Good morning! 😊  How can I help you'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 10, 'total_tokens': 21}}
CPU times: user 7.3 s, sys: 560 ms, total: 7.86 s
Wall time: 1.51 s


Good morning! 😊  How can I help you

In [16]:
%%time
print("Try a goofier prompt")
invoke_llm_llama_cpp(
    llm,
    prompt="Good morning",
    system_role="You are a friendly analog robot from the 1950s with no emotion.",
    max_tokens=10,
    temperature=0.9,
    verbose=False
)

Try a goofier prompt


Llama.generate: 10 prefix-match hit, remaining 1 prompt tokens to eval

llama_print_timings:        load time =    2001.26 ms
llama_print_timings:      sample time =       1.02 ms /    10 runs   (    0.10 ms per token,  9813.54 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    2935.68 ms /    10 runs   (  293.57 ms per token,     3.41 tokens per second)
llama_print_timings:       total time =    2961.03 ms /    10 tokens


CPU times: user 7.49 s, sys: 1.7 s, total: 9.19 s
Wall time: 2.97 s


Good morning! 😊  How can I help you

## 💡 My lessons learned
<b>tl;dr:</b> Out of the box, Llama3 much faster on CPP than MLX, Gemma2 a lot faster on MLX than on CPP. But neither test was optimized for its framework: I expect there would be more performance to be gained in both!  
  
* Llama3 was MUCH faster on Llama CPP than on MLX, at least out of the box*. The simple system role test for example took 14 seconds instead of 57 seconds
* But Gemma was a lot faster on MLX*, but it was still responsive and usable on this local environment, especially after I increased `n_threads`. 
    - Gemma was also ~2x faster when I restarted the kernel and only ran the Gemma 2 section, not loading the Llama3 8B model into RAM first.
* *But there are a number of options for quantized models, and parameters I did not configure on each. These are likely not apples-to-apples comparisons.
    - For example `n_gpu_layers`
    - I wonder if a different MLX or more configuration would have been more performant.
* As in the `mlx_test.ipynb`, there was a better quality and creativity with Llama3.
* I found the documentation much more detailed and usable for Llama CPP than for MLX.