# Running Llama 3 and Gemma 2 locally with MLX
  
This notebook test drives the [mlx-lm library](https://qwen.readthedocs.io/en/latest/run_locally/mlx-lm.html) for optimizing local LLMs on Apple Silicon. We use small, quantized versions of the open source models Llama 3 and Gemma 2.  
  
These timings reflect running the notebook on an M1 MacBook Air (2020) with a wee 8GB of RAM.

## 🏗️ Set up

In [1]:
from IPython.display import Markdown
from mlx_lm import generate, load

Create a generic function to interact with either LLM

In [14]:
def invoke_llm_mlx(
        model,
        tokenizer,
        prompt,
        system_role=None,
        prepend_system_role=False,
        max_tokens=500,
        verbose=True
        ):
    """Ask an LLM a question and get a response!"""
    if system_role:
        if prepend_system_role:
            # Some models like Gemma don't support system_role as a parameter
            prompt = (
                f"SYSTEM ROLE: {system_role}{'.' if not system_role.endswith('.') else system_role} PROMPT: {prompt}"
                .replace("  ", " ")
                .strip()
            )
        else:
            # Many local models like Llama support system_role explicitly.
            prompt = tokenizer.apply_chat_template(
                conversation=[
                    {"role": "system", "content": system_role}, 
                    {"role": "user", "content": prompt}
                    ], 
                tokenize=False,
                add_generation_prompt=True
        )
    return Markdown(
        generate(model, tokenizer, prompt, max_tokens=max_tokens, verbose=verbose)
        .replace("ANSWER:", "")
        .replace("Answer:", "")
        .replace("<end_of_turn>", "")
        .strip(". ") # Remove periods or spaces from either side
        )

Side note: the original prompt `What's the capital of Massachusetts?` led the model to generate multiple choice responses with different cities in Massacshueetts, like a quiz, and then select the correct answer

## 🦙 Llama 3

### Load
The first time you run the next line, it will download 5 GB of files for this version of Llama 3: 8B, instruction-tuned, 4-bit quantized.

In [3]:
model_llama3_8b_it, tokenizer_llama3_8b_it = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

### Basic invocation
Let's see how Llama 3 on `mlx` behaves without specifying a system role.

In [4]:
%%time
invoke_llm_mlx(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    "What's the capital of Massachusetts?",
    max_tokens=10
    )

Prompt: What's the capital of Massachusetts?
 A) Boston B) Springfield C) Worcester D
Prompt: 7 tokens, 0.367 tokens-per-sec
Generation: 10 tokens, 0.047 tokens-per-sec
Peak memory: 4.946 GB
CPU times: user 198 ms, sys: 25.9 s, total: 26.1 s
Wall time: 3min 30s


A) Boston B) Springfield C) Worcester D

THe prompt led the model to respond with multiple-choice answers. It's interesting to see the slow inference time (0.047 tokens/sec) and 5 GB peak memory. 

In [15]:
%%time
invoke_llm_mlx(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    "Name the capital of Massachusetts",
    max_tokens=10
    )

Prompt: Name the capital of Massachusetts
.
Answer: Boston....more
What is
Prompt: 5 tokens, 0.294 tokens-per-sec
Generation: 10 tokens, 0.051 tokens-per-sec
Peak memory: 6.317 GB
CPU times: user 210 ms, sys: 25.5 s, total: 25.7 s
Wall time: 3min 13s



 Boston....more
What is

A more optimal prompt yields a more direct answer, still wtih some fluff.

### Add a system role

In [16]:
%%time
invoke_llm_mlx(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate",
    max_tokens=10
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate<|eot_id|><|start_header_id|>user<|end_header_id|>

Name the capital of Massachusetts<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Boston
Prompt: 40 tokens, 1.074 tokens-per-sec
Generation: 2 tokens, 0.053 tokens-per-sec
Peak memory: 6.337 GB
CPU times: user 99.9 ms, sys: 7.37 s, total: 7.47 s
Wall time: 56.6 s


Boston

Substantially tighter with a precise role, with gross timing less due to fewer tokens to generate. The overall throughput is similar.

### Respond in LaTeX

In [17]:
%%time
invoke_llm_mlx(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    max_tokens=50
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX<|eot_id|><|start_header_id|>user<|end_header_id|>

Provide the formula for mass-energy equivalence<|eot_id|><|start_header_id|>assistant<|end_header_id|>


$E = mc^2$
Prompt: 50 tokens, 1.284 tokens-per-sec
Generation: 8 tokens, 0.061 tokens-per-sec
Peak memory: 6.343 GB
CPU times: user 168 ms, sys: 21.1 s, total: 21.2 s
Wall time: 2min 33s


$E = mc^2$

### Whimsy

In [18]:
%%time
invoke_llm_mlx(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    max_tokens=10
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a friendly robot.<|eot_id|><|start_header_id|>user<|end_header_id|>

Good morning<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Beep boop! Good morning to you too
Prompt: 23 tokens, 0.777 tokens-per-sec
Generation: 10 tokens, 0.051 tokens-per-sec
Peak memory: 6.343 GB
CPU times: user 233 ms, sys: 27.1 s, total: 27.3 s
Wall time: 3min 27s


Beep boop! Good morning to you too

## 💎 Gemma 2
Let's repeat the chats with a smaller model to compare responses and timings.  

### Load

In [19]:
model_gemma_2b_it, tokenizer_gemma_2b_it = load("mlx-community/gemma-2-2b-it-4bit")

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

### Basic invocation
No system role

In [20]:
%%time
invoke_llm_mlx(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    "Name the capital of Massachusetts",
    max_tokens=10
    )

Prompt: Name the capital of Massachusetts
.

What is the name of the largest ocean
Prompt: 6 tokens, 5.508 tokens-per-sec
Generation: 10 tokens, 14.728 tokens-per-sec
Peak memory: 7.654 GB
CPU times: user 195 ms, sys: 427 ms, total: 622 ms
Wall time: 1.74 s




What is the name of the largest ocean

Interesting answer 😁

### Add a system role

In [21]:
%%time
invoke_llm_mlx(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate",
    prepend_system_role=True,
    max_tokens=10,
    )

Prompt: SYSTEM ROLE: You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate. PROMPT: Name the capital of Massachusetts
. 


ANSWER: Boston 
<end_of_turn>
Prompt: 34 tokens, 123.626 tokens-per-sec
Generation: 10 tokens, 24.544 tokens-per-sec
Peak memory: 7.654 GB
CPU times: user 165 ms, sys: 174 ms, total: 339 ms
Wall time: 643 ms





 Boston 


### Respond in LaTeX

In [22]:
%%time
invoke_llm_mlx(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    prepend_system_role=True,
    max_tokens=50
    )

Prompt: SYSTEM ROLE: You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX. PROMPT: Provide the formula for mass-energy equivalence
. 

ANSWER:  $E=mc^2$ 
<end_of_turn>
Prompt: 44 tokens, 158.632 tokens-per-sec
Generation: 17 tokens, 23.530 tokens-per-sec
Peak memory: 7.654 GB
CPU times: user 322 ms, sys: 302 ms, total: 624 ms
Wall time: 959 ms




  $E=mc^2$ 


### Whimsy

In [23]:
%%time
invoke_llm_mlx(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    prepend_system_role=True,
    max_tokens=10
    )

Prompt: SYSTEM ROLE: You are a friendly robot.You are a friendly robot. PROMPT: Good morning
! How are you feeling today?  
<end_of_turn>
Prompt: 21 tokens, 103.105 tokens-per-sec
Generation: 10 tokens, 21.566 tokens-per-sec
Peak memory: 7.654 GB
CPU times: user 226 ms, sys: 208 ms, total: 434 ms
Wall time: 623 ms


! How are you feeling today?  


## 💡 My lessons learned
* `mlx-lm` is a neat library for inference on Apple Silicon.
* These examples use default temperature etc which may affect results.
* Gemma 2 2B was blazingly fast, plenty of throughput on a minimal local environment with only 8 GB of RAM
    - It's said to run in only 1 GB of RAM, and the timings here are for a kernel that still had Llama3 8B clogging up RAM too
    - It indeed it could be >100x faster than Llama3 8B (both 4bit quantized versions)
* Gemma 2 doesn't support a system role. It wasn't trained on one. It has to be added to the prompt. This will especially matter when we start using LangChain.
* As expected, the smaller model Gemma 2B gave some lower-quality results than Llama3 8B, but better prompts and system roles helped. More prompt engineering and tuning of system role?
* Avoid running these two `mlx` open models without a system role, even for straightforward questions. They give stranger, slower answers.