# Basics: Running Llama 3 and Gemma 2 locally

This is a 'Hello World!' for local LLMs!  
  
It runs the open source models Llama 3 and Gemma 2 in a relatively minimal local environment. It uses small, quantized models through the MLX library, which is optimized for Apple Silicon.
  
These timings reflect running the notebook on an M1 MacBook Air (2020) with a wee 8GB of RAM.

## 🏗️ Set up

In [1]:
from IPython.display import Markdown
from mlx_lm import generate, load

Create a generic function to interact with either LLM

In [2]:
def invoke_llm(
        model,
        tokenizer,
        prompt,
        system_role=None,
        prepend_system_role=False,
        max_tokens=500,
        verbose=True
        ):
    """Ask an LLM a question and get a response!"""
    if system_role:
        if prepend_system_role:
            # Some models like Gemma don't support system_role as a parameter
            prompt = (
                f"SYSTEM ROLE: {system_role}{'.' if not system_role.endswith('.') else system_role} PROMPT: {prompt}"
                .replace("  ", " ")
                .strip()
            )
        else:
            # Many local models like Llama support system_role explicitly.
            prompt = tokenizer.apply_chat_template(
                conversation=[
                    {"role": "system", "content": system_role}, 
                    {"role": "user", "content": prompt}
                    ], 
                tokenize=False,
                add_generation_prompt=True
        )
    return Markdown(
        generate(model, tokenizer, prompt, max_tokens=max_tokens, verbose=verbose)
        .replace("ANSWER:", "")
        .replace("Answer:", "")
        .replace("<end_of_turn>", "")
        .strip(". ") # Remove periods or spaces from either side
        )

## 🦙 Llama 3

### Load
The first time you run the next line, it will download 5 GB of files for this version of Llama 3: 8B, instruction-tuned, 4-bit quantized.

In [8]:
model_llama3_8b_it, tokenizer_llama3_8b_it = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

### Basic invocation
No system role

In [4]:
%%time
invoke_llm(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    "Name the capital of Massachusetts",
    max_tokens=10
    )

Prompt: Name the capital of Massachusetts
.
Answer: Boston....more
What is
Prompt: 5 tokens, 0.235 tokens-per-sec
Generation: 10 tokens, 0.046 tokens-per-sec
Peak memory: 4.945 GB
CPU times: user 221 ms, sys: 25.9 s, total: 26.1 s
Wall time: 3min 36s



 Boston....more
What is

Side note: the original prompt `What's the capital of Massachusetts?` led the model to generate multiple choice responses with different cities in Massacshueetts, like a quiz, and then select the correct answer

### Add a system role

In [5]:
%%time
invoke_llm(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate",
    max_tokens=10
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate<|eot_id|><|start_header_id|>user<|end_header_id|>

Name the capital of Massachusetts<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Boston
Prompt: 40 tokens, 1.132 tokens-per-sec
Generation: 2 tokens, 0.054 tokens-per-sec
Peak memory: 4.965 GB
CPU times: user 79 ms, sys: 7.25 s, total: 7.33 s
Wall time: 54.3 s


Boston

Substantially faster with a precise role

### Respond in LaTeX

In [6]:
%%time
invoke_llm(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    max_tokens=50
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX<|eot_id|><|start_header_id|>user<|end_header_id|>

Provide the formula for mass-energy equivalence<|eot_id|><|start_header_id|>assistant<|end_header_id|>


$E = mc^2$
Prompt: 50 tokens, 1.368 tokens-per-sec
Generation: 8 tokens, 0.049 tokens-per-sec
Peak memory: 4.971 GB
CPU times: user 172 ms, sys: 22.1 s, total: 22.3 s
Wall time: 2min 59s


$E = mc^2$

### Whimsy

In [7]:
%%time
invoke_llm(
    model_llama3_8b_it,
    tokenizer_llama3_8b_it,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    max_tokens=10
    )

Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a friendly robot.<|eot_id|><|start_header_id|>user<|end_header_id|>

Good morning<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Beep boop! Good morning to you too
Prompt: 23 tokens, 0.576 tokens-per-sec
Generation: 10 tokens, 0.044 tokens-per-sec
Peak memory: 4.971 GB
CPU times: user 232 ms, sys: 30.2 s, total: 30.4 s
Wall time: 4min 2s


Beep boop! Good morning to you too

## 💎 Gemma 2
Let's repeat the chats with a smaller model to compare responses and timings.  
  
> ℹ️ **Info:** The cell timings below are when the `Gemma 2` section was run in a fresh kernel, in which we did *not* first run the `Llama 3` section. When we load Llama 3 into RAM too, the Gemma timings were sometimes slower.

### Load

In [3]:
model_gemma_2b_it, tokenizer_gemma_2b_it = load("mlx-community/gemma-2-2b-it-4bit")

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

### Basic invocation
No system role

In [4]:
%%time
invoke_llm(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    "Name the capital of Massachusetts",
    max_tokens=10
    )

Prompt: Name the capital of Massachusetts
.

What is the name of the largest ocean
Prompt: 6 tokens, 9.938 tokens-per-sec
Generation: 10 tokens, 22.907 tokens-per-sec
Peak memory: 1.501 GB
CPU times: user 151 ms, sys: 213 ms, total: 364 ms
Wall time: 1 s




What is the name of the largest ocean

Interesting answer 😁

### Add a system role

In [5]:
%%time
invoke_llm(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Name the capital of Massachusetts",
    system_role="You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate",
    prepend_system_role=True,
    max_tokens=10,
    )

Prompt: SYSTEM ROLE: You are a back-end GIS system. Answer queries with the answer only, no conversation or boilerplate. PROMPT: Name the capital of Massachusetts
. 


ANSWER: Boston 
<end_of_turn>
Prompt: 34 tokens, 129.861 tokens-per-sec
Generation: 10 tokens, 25.351 tokens-per-sec
Peak memory: 1.534 GB
CPU times: user 145 ms, sys: 69 ms, total: 214 ms
Wall time: 619 ms





 Boston 


### Respond in LaTeX

In [6]:
%%time
invoke_llm(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Provide the formula for mass-energy equivalence",
    system_role="You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX",
    prepend_system_role=True,
    max_tokens=50
    )

Prompt: SYSTEM ROLE: You are a physics database. Answer queries with the answer only, no conversation or boilerplate. If the answer requires a formula, use LaTeX. PROMPT: Provide the formula for mass-energy equivalence
. 

ANSWER:  $E=mc^2$ 
<end_of_turn>
Prompt: 44 tokens, 164.438 tokens-per-sec
Generation: 17 tokens, 27.463 tokens-per-sec
Peak memory: 1.545 GB
CPU times: user 251 ms, sys: 99.8 ms, total: 351 ms
Wall time: 851 ms




  $E=mc^2$ 


### Whimsy

In [7]:
%%time
invoke_llm(
    model_gemma_2b_it,
    tokenizer_gemma_2b_it,
    prompt="Good morning",
    system_role="You are a friendly robot.",
    prepend_system_role=True,
    max_tokens=10
    )

Prompt: SYSTEM ROLE: You are a friendly robot.You are a friendly robot. PROMPT: Good morning
! How are you feeling today?  
<end_of_turn>
Prompt: 21 tokens, 137.741 tokens-per-sec
Generation: 10 tokens, 24.385 tokens-per-sec
Peak memory: 1.545 GB
CPU times: user 151 ms, sys: 64.2 ms, total: 215 ms
Wall time: 522 ms


! How are you feeling today?  


## 💡 My lessons learned
* MLX is a great library for inference on Apple Silicon.
* Gemma 2 2B is said to run in only 1 GB of RAM, and indeed it could be >100x faster than Llama3 8B (both 4bit quantized versions)
* Gemma 2 2B was plenty fast enough on a light local environment with only 8 GB of RAM, hitting several tokens per second. 
    - Except when the RAM was jammed, such as when Llama 3 was loaded into memory at the same time. 
* As expected, the smaller model Gemma 2B gave some lower-quality results than Llama3 8B. More prompt engineering and tuning of system role?
* Avoid running these two open models without a system role, even for straightforward questions. Both gave responses that were stranger and slower.