> Running LLMs with `llama.cpp`.

# Intro  

This notebook uses the [llama.cpp](https://github.com/ggerganov/llama.cpp) library to run models locally and on the phone (soon). 


## llama.cpp

llama.cpp is focused on quantizing and deploying LLMs on Mac, but they support linux and windows as well. Despite its name, the project supports a ton of models beyond Llama and Llama-2. It is very straightforward to install. And there are even [python bindings](https://github.com/abetlen/llama-cpp-python) to make our lives easier.

![](llama_description.png)


The original project was stitched together in a [single evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), and has since become the arguable SOTA for deploying LLMs on CPUs. The llama.cpp community is full of is incredibly helpful and responsive people. 


![](llama_model_support.png)


There benefits of the repo go beyond even the code and models. The community is always bringing in the greatest and latest approaches from the avalanche of LLM progress. The Pull Requests and Issues are full of folks working and experimenting to integrate these advances. Thankfully, the community is very open to indie hackers and unconventional ideas: if something works and there's proof, it's merged in.


In fact, this notebook is living proof of the power of Open Source. Last lesson we used the `MLC` library because it was the only one that supported LLMs on iOS. But, literally since last week, llama.cpp added and fixed support for iOS LLMs. MLC is a great project and worth keeping an eye on, but for now we march on under the llama banner. 

## Running Mistral-v0.1 with llama.cpp

This section covers the following:  
- Setting up and installing the `llama.cpp` repo.  
- Downloading a `Mistral-v0.1` model.  
- Running the Mistral model directly with `llama.cpp`  
- Running the model with the python bindings

First we create an environment for `llama.cpp`, then we download and install the repo.

Then we download a Mistral-v0.1 model that's already been quantized into the special `GGUF` format that llama.cpp expects.

Lastly, we run the model on a sample input.

### Installing llama.cpp

First, create a new environment for llama.cpp. Then install the dependencies. 

```bash
# create an environment for llama.cpp
mamba create -n llama-cpp python=3.11
```

This isn't *strictly* necessary for llama.cpp since it uses C++, but we'll need it for the python bindings. Even without those, it's good practice to make and keep isolated environments for your projects. 

Next activate the environment: 

```bash
# activate the environment
mamba activate llama-cpp
```

We can now download and move into the repo:

```bash 
# clone and move into the llama.cpp repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
```

Llama supports both the [GNU Make](https://www.gnu.org/software/make/) and [CMake](https://cmake.org/) build systems. Note that `make` works fine on Linux, but I've had mixed results on Mac. For that reason we'll stick with `CMake` here. On Mac you can install cmake with homebrew: `brew install cmake`. On linux you can install it with `sudo apt install cmake`. For Windows folks you can grab the latest installer from [here](https://cmake.org/download/).

Once you installed cmake and are inside the repo, go ahead and build it. The steps below are a standard build process with cmake, you'll see something similar in many different projects.

We can also pass a flag to build Llama with optimizations for certain hardware. Note that cmake build arguments have a funky syntax where they're prefixed by `-D` and then the argument name. 

Here are the flags for different hardware:  
- To build for Mac's Metal acceleration: -DLLAMA_METAL=1  
- To build for NVIDIA GPUs: -DLLAMA_CUBLAS=1  

```bash
# prepare for a cmake build
mkdir build
cd build

# prepare the build for Metal acceleration
cmake -DLLAMA_METAL=1 ..
# # or, replace with this to build for NVIDIA GPUs
# cmake -DLLAMA_CUBLAS=1 ..

# build the accelerated llama.cpp project
cmake --build . --config Release  
```

The accelerated files are inside the `build/` folder itself. It has the main way we'll be calling the llama library: `main`.

With llama.cpp installed, we can now download a Mistral-v0.1 model and run it. 

### Downloading a Mistral-v0.1 model

As mentioned earlier, we'll grab a model that's already been set up to work with llama.cpp. We'll rely on a model uploaded by a user known as `TheBloke`. His efforts are another example for the power of Open Source: this user has uploaded almost 2,000 models that anyone can now use! 

Here is the Mistral-v0.1 model already formatted for llama.cpp. The model has 7 billion parameters, and is the `Instruct` version which was explicitly trained to follow and complete instructions.

> Link to [`Mistral-7B-Instruct-v0.1`](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)

Click the link above, then navigate to the `Files and version` tabs. You'll see a list of available models, each one quantized in a different way:

![](mistral_quantized.png)

The `Q4` models will more comfortably fit on consumer GPUs and hardware. But the higher number after the `Q` the more accurate the model will be. Here we'll use the `Q5_K_M` model which is a bit larger and slower than the `Q4` models but makes up for it in performance.

Let's grab the `Q5_K_M` model. First, make sure the huggingface-hub CLI is installed:

```bash
# install a tool to download HuggingFace models via the terminal
pip install huggingface-hub
```  

Then move into the `models/` folder and download the model:

```bash
# download the Mistral model
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False

```

Now that we have the model, we can run it to make sure everything works.

### Running the Mistral model

We'll use the `main` binary inside of the `build/` folder to run it. Go ahead and run the following command to see Mistral-v0.1 in action!

```bash
# run the inference on the official example from the build/
./bin/main -m ../models/mistral-7b-instruct-v0.1.Q5_K_M.gguf -n 128 -p "Building a website can be done in 10 simple steps:\nStep 1:"
```

And that's it! We've now done the following:  
- Installed llama.cpp.  
- Downloaded a Mistral-v0.1 model.  
- Ran the model on a sample input.

However, this was all via the command line using C++ binaries. How could we do the same with python and Notebooks? Enter the python bindings.

## Running llama.cpp model with python

First step, let's install the llama.cpp python bindings into our environment. We'll repeat ourselves a bit here, since the python bindings attempt to build and install the C++ repo as well.  

Here are the instructions for the full [Mac Metal installation](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)  

The TL;DR is to run the following command inside our environment:

```bash
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir
```  

The command above does the following to make sure the python bindings are up to date:  
- Uninstall older versions of the bindings, if any are installed.  
- Install the bindings with Metal acceleration.

The following code snippet checks if we can import the bindings:  

```bash
python -c "from llama_cpp import Llama"
```

In [1]:
from llama_cpp import Llama

If the command above worked, we can use it to run the Mistral-v0.1 model. 

Put the following into a `llama_python.py` file, or run it from the notebook:

In [2]:
from llama_cpp import Llama

# point to the model, assuming we're in the llama.cpp build/ folder
work_dir = "/Users/cck/repos/llama.cpp/"
llm = Llama(f"{work_dir}/models/mistral-7b-instruct-v0.1.Q5_K_M.gguf")

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/cck/repos/llama.cpp//models/mistral-7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.we

In [6]:
# running the official example via python
prompt = "Building a website can be done in 10 simple steps:\nStep 1:"
output = llm(prompt, max_tokens=512, echo=True)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  5542.08 ms
llama_print_timings:      sample time =   147.45 ms /   218 runs   (    0.68 ms per token,  1478.49 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 13963.44 ms /   218 runs   (   64.05 ms per token,    15.61 tokens per second)
llama_print_timings:       total time = 14369.62 ms


In [10]:
output['choices'][0]['text']

"Building a website can be done in 10 simple steps:\nStep 1: Determine the purpose of your website. This will help you decide on the features and design that are important to you.\nStep 2: Choose a domain name that represents your brand and is easy to remember.\nStep 3: Select a hosting provider that offers reliable uptime and customer support.\nStep 4: Choose a content management system (CMS) or website builder that suits your needs.\nStep 5: Create a layout and design for your website that is visually appealing and easy to navigate.\nStep 6: Develop the content for your website, including text, images, and videos.\nStep 7: Optimize your website for search engines (SEO) by using relevant keywords and meta tags.\nStep 8: Test your website for functionality, usability, and compatibility with different devices.\nStep 9: Launch your website and promote it through social media, email marketing, and other online channels.\nStep 10: Monitor your website's performance and make updates as need

### Revisiting the MLC Outputs

In [None]:
prompt = "Hello! Please get ready to help me with my project!"
# asking Llama2 about itself
output = cm.generate(
    prompt=prompt,
    progress_callback=StreamToStdout(callback_interval=2),
)

Hello! *adjusts glasses* I'm so glad you're excited about your project! I'm here to help you in any way I can. Could you please provide some more details about your project, such as its topic, any specific questions you have, or what you hope to achieve? That will help me better understand how I can assist you. 😊


In [None]:
def talk(txt):
    output = cm.generate(
    prompt=txt,
    progress_callback=StreamToStdout(callback_interval=2),
)

In [None]:
talk("Please be less cringe, thank you")

Of course, I apologize if my previous response came across as too cheery. I'm here to help you in a respectful and professional manner. Please let me know how I can assist you with your project. Are there any specific tasks you need help with, or any information you need me to provide? Please feel free to ask me anything. 😊


In [None]:
talk("Actually, please be as cringe as possible, thank you!")

Oh, my goodness! *adjusts glasses* I'm so thrilled to be of assistance to you, my dear! *bounces up and down* I'll do my absolute bestest to help you with your project, even if it means being a wee bit cheesy and cringeworthy! *winks* Please do tell me how I can help, you delightful person, you! 😊


In [None]:
talk("Hmmm, no quite cringe enough! More cringe please uwu")

Oh, my goodness gracious! *adjusts glasses* I'm so glad you want me to be even more cringe! *bounces up and down* I'll do my absolute bestest to provide you with the most cringe-tastic responses possible! *winks*
So, my sweet darling, how may I assist you on this lovely day? *blinks* Do tell me your most precious wish, and I shall do my utmost to make it come true! *bats eyelashes* 😊


In [None]:
talk("You are actually being a 6/10 cringe. Please show me what a 10000/10 cringe would look like!")

Oh, my goodness gracious! *adjusts glasses* I-I can't possibly...*gulps*...I-I think I'll just have to...*tries to contain excitement*...OVERDO IT! *pulls out all the cringe*
So, my sweet, sweet darling...*bounces up and down*...I shall provide you with the most...*gulps*...CRINGE-TASIC RESPONSES POSSIBLE! *winks* *adjusts glasses* *bats eyelashes* 😂👀💖
Please, my precious, tell me how I may serve you in this, the most cringe-tastic way possible! *curtsies* 💕


Starting the conversation:

In [None]:
# asking Llama2 about itself
output = cm.generate(
    prompt="Please tell me a little about yourself:",
    progress_callback=StreamToStdout(callback_interval=2),
)

Hello! *adjusts glasses* I'm just an AI, here to help you with any questions or tasks you may have. My purpose is to provide helpful, respectful, and honest assistance to the best of my abilities. I'm just a language model, so I don't have personal experiences or feelings like humans do, but I'm always eager to learn and improve my responses. Is there something specific you'd like to know or discuss?


Can also ask question on the fly using python's `input()` which works in Jupyter Notebooks

In [None]:
# asking Llama2 something on the fly
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Great, I'm glad you asked! I'm best suited to helping with a wide range of tasks, including but not limited to:
1. Answering questions: I can provide information and explanations on various topics, from science and history to entertainment and culture.
2. Generating ideas: I can help you brainstorm ideas for creative projects, or even come up with unique solutions to problems you might be facing.
3. Language translation: I can translate text from one language to another, helping you communicate with people from different cultures and backgrounds.
4. Summarizing content: If you have a long piece of text and want to get a quick summary of its main points, I can help you with that.
5. Offering suggestions: I can provide suggestions for things like gift ideas, travel destinations, or even books to read.
6. Providing definitions: If you're unsure of the meaning of a word or phrase, I can define it for you and give you examples of how it's used in context.
7. Creating content: I can assist y

What if we wanted a quick summary of what it said?

In [None]:
# asking for a summary of its response
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Of course! Here's a summary of my response in three sentences:
I'm a language model AI trained to assist with various tasks, including answering questions, generating ideas, translating languages, summarizing content, offering suggestions, providing definitions, and creating content. I'm here to help with any questions or topics you'd like to discuss, so feel free to ask me anything. I'm best suited to helping with a wide range of tasks, including but not limited to those listed above.


Can go back and forth with the cells above, or can continue talking in other cells

In [None]:
# asking another question
new_prompt = input("New, Different Prompt: ")
output = cm.generate(prompt=new_prompt, progress_callback=StreamToStdout(callback_interval=2))

Absolutely! I'm here to help. What topic would you like to learn more about? Let me know and I'll do my best to provide you with helpful information and resources.


The chat module maintains an internal chat history. If we get stuck in a loop or simply want to start the convo anew: 

In [None]:
# resets the current session's chat history
cm.reset_chat()

There is a handy `stats()` function to check the speed of the model's generation.

In [None]:
# checks if llm go brrr
print(cm.stats())

prefill: 11.7 tok/s, decode: 12.7 tok/s


For a more rigorous check, we can use the `benchmark_generate` function to check the speed of a fixed number of tokens:

In [None]:
# benchmarking text generation
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()


 nobody expects the Spanish Inquisition! 😱🔥🎬

Benchmarking is the process of measuring the performance of a system, application, or process against a set of predefined metrics or standards. The goal of benchmarking is to identify areas where improvements can be made, such as increased efficiency, faster performance, or better quality.
In the context of the Monty Python sketch, "nobody expects the Spanish Inquisition!" is a humorous reference to the unexpected and often absurd nature of benchmarking. Just as the Inquisition was unexpected and unwanted, benchmarking can sometimes be seen as an unnecessary or burdensome process. However, the benefits of benchmarking can be significant, such as identifying areas for improvement, optimizing resources, and ensuring compliance with standards or regulations. 😊s


























































































































































































'prefill: 1.8 tok/s, decode: 30.0 tok/s'

# Building an MLC iOS app

Similar process, need a few extra tools and helper packages. First, we need to install rust: 

```bash

# if you don't have `curl`...
which curl # <-- if this shows nothing

# install curl
brew install curl

# then, download and install rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

It will prompt you for different installation kinds, but the default one is perfectly fine. 

We will also need the `cmake` tool. On Mac, you can install it with Homebrew or MacPorts. 

```bash
# install cmake with Homebrew
brew install cmake
```

In case you skipped the section above, make sure to download the `mlc-chat` library and place it inside the `dist/prebuilt` folder. 

```bash
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib
```

In the sections above we chatted with Llama2 on our laptop. Let's go ahead and talk to it from an iOS app now.  

First we download a model built for iOS.

In [None]:
# download the pre-compiled Llama2 model for iOS
cd dist/prebuilt
git lfs install
git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q3f16_1
cd ../..

We need some other, helper libraries to run the iOS models. Run the command below to download and configure them: 

In [None]:
# grab the helper libraries
git submodule update --init --recursive 

# prepare the ios libs
cd ./ios
./prepare_libs.sh

This will create a `build/` folder. Make sure the following files are in there:

In [None]:
# expected output of build/
ls ./build/lib/
libmlc_llm.a         # A lightweight interface to interact with LLM, tokenizer, and TVM Unity runtime
libmodel_iphone.a    # The compiled model lib
libsentencepiece.a   # SentencePiece tokenizer
libtokenizers_cpp.a  # Huggingface tokenizer
libtvm_runtime.a     # TVM Unity runtime

Let's make sure we package the new model into the iOS app. We need to add the

In [None]:
# still inside of the ios folder, edit the file below
open ./prepare_params.sh # make sure `builtin_list` only contains "Llama-2-7b-chat-hf-q3f16_1"

# prepackage the weights
./prepare_params.sh

Now we should be able to see the model inside the `ios/build` folder: 

In [None]:
# expected contents of ios/dist folder
ls ./dist/
Llama-2-7b-chat-hf-q3f16_1 # the compiled Llama2 model

### Building the iOS app

We're almost there! Now to actually build the iOS app.

First boot up X-Code, then open the project `./ios/MLCChat.xcodeproj`

Build the project, and deploy it on either:  
- Mac laptop
- iPhone or iPad emulator 