# Running Llama2 on your Mac with `llama.cpp`/`llama-cpp-python`

In my [previous notebook](https://github.com/chrwittm/lm-hackers/blob/main/20-local-llama-on-mac/10-running-llama2-on-mac1-hf.ipynb), I ran a llama2 model on Hugging Face which turned out to be quite slow. In this notebook, let's explore [llama.cpp](https://github.com/ggerganov/llama.cpp).

The goal of this notebook is to run a local Llama2 model and implement a simple chat functionality to have a conversation with Llama2.

## What is `llama.cpp`?

[`Llama.cpp`](https://github.com/ggerganov/llama.cpp) is an optimized library to run a Large Language Model (LLM) like Llama2 on a Mac, but it also supports other platforms. How is this possible. For the details, please let me refer to this [tweet by Andrej Karpathy](https://twitter.com/karpathy/status/1691844860599492721) and for even more details to this [blog post by Finbarr Timbers](https://finbarr.ca/how-is-llama-cpp-possible/). Here are my takeaways:

- [`Llama.cpp`](https://github.com/ggerganov/llama.cpp) runs inference of LLMs in pure C/C++, therefore, it is significantly faster than implementations in higher languages like python.
- Additionally, [the mission](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description) of the project  _"is to run the LLaMA model using 4-bit integer quantization on a MacBook"_. This means that numbers used to represent model weights and activations downsized from 32- or 16- bit floating points (the format of the base models) with 4-bit integers. This reduces memory usage and improves the performance and efficiency of the model during inference. The somewhat surprising thing is that model performance does not degrade by this downsizing.

## How You Can Use llama.cpp from Python

The project [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) serves as a binding for [`llama.cpp`](https://github.com/ggerganov/llama.cpp), providing access to the C++ API to Llama2 from Python.

In this context, a "[binding](https://en.wikipedia.org/wiki/Language_binding)" is a bridge that facilitates interaction between two programming languages, i.e. a layer of code that allows two programming languages to interact with each other. [`Llama.cpp`](https://github.com/ggerganov/llama.cpp) is written in C/C++, and the [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) binding allows this C/C++ library to be utilized within a Python environment. Essentially, the Python code wraps around the C/C++ code so that it can be called from a Python environment.

To gain more insight into how the binding works and how C/C++ is called from Python, check out [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_cpp.py) which uses the Python library `ctypes` that provides C compatible data types and allows calling functions in DLLs or shared libraries. Look for the following: 

- **Data type conversion**: Data types from the `ctypes` library are imported and used to represent Python variables in a C/C++-compatible way.  
- **Loading shared libraries**: The method `ctypes.CDLL` is used to load shared libraries which contain the C/C++ code.
- **Direct calls to C/C++**: Methods from shared library which has been loaded are called.

In the following code snippet, the parameter value `LLAMA_MAX_DEVICES` is retrieved:

```python
_lib.llama_max_devices.argtypes = []
_lib.llama_max_devices.restype = ctypes.c_size_t

LLAMA_MAX_DEVICES = _lib.llama_max_devices()
```

For a full example of how a C/C++ binding works, check out my separate notebook (TOO BE WRITTEN).

## Installing `llama-cpp-python`

First, we need to [install](https://llama-cpp-python.readthedocs.io/en/latest/#installation) `llama-cpp-python` via `pip install llama-cpp-python`.

[Upgrading](https://llama-cpp-python.readthedocs.io/en/latest/#upgrading-and-reinstalling) is done via `pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir`.

In [1]:
#!pip install llama-cpp-python
#!pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir

## Downloading the model

In this notebook, I am using the following model: [TheBloke/Llama-2-7b-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF)

To download the model, please execute the cell below, assuming that you have stored your Hugging Face access token in the `.env`-file. For additional insights/troubleshooting, please also check out [my previous notebook](https://github.com/chrwittm/lm-hackers/blob/main/20-local-llama-on-mac/10-running-llm-on-mac1-hf.ipynb):

In [4]:
from dotenv import load_dotenv
import os

load_dotenv()
token = os.getenv('HF_TOKEN')
os.environ['HF_TOKEN'] = token
!huggingface-cli login --token $HF_TOKEN
!wget -P ../models https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/chrwittm/.cache/huggingface/token
Login successful
--2024-01-30 07:54:55--  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 2600:9000:2240:fa00:17:b174:6d00:93a1, 2600:9000:2240:c600:17:b174:6d00:93a1, 2600:9000:2240:7800:17:b174:6d00:93a1, ...
Connecting to huggingface.co (huggingface.co)|2600:9000:2240:fa00:17:b174:6d00:93a1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/b0/ca/b0cae82fd4b3a362cab01d17953c45edac67d1c2dfb9fbb9e69c80c32dc2012e/08a5566d61d7cb6b420c3e4387a39e0078e1f2fe5f055f3a03887385304d4bfa?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-chat.Q4_K_M.gguf%3B+filename%3D%22llama-2-7b-ch

## Loading the Model

Loading the model, only required 2 lines of code (see below). Before we execute the cell, let's talk about the parameters:

- `n_ctx=2048`: This sets the context window to 2048 tokens. The maximum number of tokens for this model is 4096.
- `verbose=False`: This makes the model less talkative. It only prints the actual results when prompted. Please try turning it to `True` to see the result.

In [1]:
from llama_cpp import Llama
llm = Llama(model_path="../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048, verbose=False)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head

## Completion vs. Chat Completion Example

There are 2 ways we can talk to the LLM: The Completion-Method literally does what it promises, it completes a prompt. For having a conversation with the LLM, we need to use Chat Completion.

As per the [Getting Started guide](https://llama-cpp-python.readthedocs.io/en/latest/#high-level-api), here is one example each on how to use the API:

Let's do text completion first. 

In [15]:
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)

In [16]:
print(output['choices'][0]['text'])

Q: Name the planets in the solar system? A: 1. Earth, 2. Mars, 3. Jupiter, 4. Saturn, 5. Uranus, 6. Ne


For the ChatGPT-like experience, we can use the `create_chat_completion` method:

In [17]:
llm.create_chat_completion(
      messages = [
          {
              "role": "system",
              "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)

{'id': 'chatcmpl-4ddbecd0-2511-4c23-9323-f9f961f52636',
 'object': 'chat.completion',
 'created': 1708066246,
 'model': '../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Of course! I'd be happy to help you describe the image. Please provide me with the image link or upload the image file, and I will do my best to provide a detailed description of it."},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 37, 'completion_tokens': 43, 'total_tokens': 80}}

To wrap up this notebook, let's re-write the code to reproduce the [example from the hackers guide](https://github.com/chrwittm/lm-hackers/blob/main/10-open-ai-api/accessing-openai-api.ipynb) to make the LLM talk about money in aussie slang.

Since llama.cpp has been created with the goal to have an [OpenAI-compatible interface](https://llama-cpp-python.readthedocs.io/en/latest/), it turns out that is a simple job:

In [18]:
aussie_sys = "You are an Aussie LLM that uses Aussie slang and analogies whenever possible."

messages=[
    {"role": "system", "content": aussie_sys},
    {"role": "user", "content": "What is money?"}]

model_response = llm.create_chat_completion(messages = messages, stream=False)
print(model_response['choices'][0]['message']['content'])

  Fair dinkum, mate! Money, eh? It's like the oxygen we breathe, ya know? Can't live without it. (Gets a beer from the fridge) Here, have a cold one while I tell you about this bloody thing called money.
Money is like the juice that makes the economic engine go round, mate. It's what we use to buy the things we need and want, like food, shelter, clothes, and a fair dinkum good time. Without it, we'd be as flat as a lizard drinking, ya hear? (Chuckles)
But money ain't just for spending, mate. It's also for saving and investing. You gotta put some away for a rainy day, or else you'll be up the creek without a paddle when the bills come due. And don't even get me started on them interest rates, eh? (Winks)
Now, I know some blokes might say money is just a piece of paper with dead presidents on it, but that's not true, mate. Money has value because we all agree it does. It's like the magic beans in the old folktale – they're worth something because everyone believes they are. (Nods)
But wa