> A look at `llama.cpp`, and running Llama2 with the Machine Learning Compilation (`MLC`) library.

# Intro  

This notebooks runs a local Llama2 model. Ideally, you will be able to run this on your laptop. And if not, that's where the Cloud GPUs from the previous class will come in handy.  


Below are two good libraries for running and deploying ML models. 


## Llama.cpp
Fantastic library on github [here](https://github.com/ggerganov/llama.cpp). This was patched together as a hackathon project, and is now arguably the SOTA for deploying local LLMs on CPUs. Great proof of "just do things", and there still being tons of low-hanging fruit in the space.  

To best leverage the repo, check out the Pull Requests for the latest updates. Folks are constantly weaving in the newest and latest approaches. Indie hackers and unconventional ideals get proposed all the time: if it works and there's proof, people accept it.  

## MLC
Tool for deploying ML models across all major architectures. 
They have a companion [course](https://mlc.ai/index.html)
[Link](https://github.com/mlc-ai/mlc-llm) that gets into many details.

# llama.cpp

At the bleeding-edge of quantizing and deploying LLMs on a Mac.   
Very active community. Check contributions and commits.  
Good practice: look at the Issues and Pull Requests.  
Supports many models.  
Many quantization options.  
Has python bindings. 
Easy to install and use. 

Worth a look, may come back to it later.  

# MLC LLM 

Sticking with MLC because they have a workflow for iOS and Android apps. Start by running it locally on our laptops.   

**High-Level Steps**:  
- Download a Llama2 model.  
- Build the MLC python environment. 
- Run Llama2 locally.  
- Download a model compiled for iOS or Android.  
- Run the model on a phone app.  
- Compile a different HF model for iOS.  

Focused on bridging the Valley of Death  
Making powerful SOTA models on edge hardware.  
Compile LLMs for all major devices and architectures.  
Have a companion course, very worth checking out.  
It's like the hardware-level version of this course.  

### Aside: Downloading the official Llama2 Models

Llama2 models are on the HF Model Hub [here](https://huggingface.co/models?search=llama2).  
Have to submit a form and accept the license first.  

We'll be using different model versions already on the Hub.


## Steps

### Create a python environment for MLC 

We'll use python3.11 in the MLC environment. 

```bash
# create a python3.11 environment for MLC
mamba create -n mlc-llm python=3.11
# conda env create -n mlc-llm python=3.11 (maybe?)
```

Next, let's activate the environment. 

```bash
# activate the environment
mamba activate mlc-llm  
```

Now we're ready to install the MLC python library. MLC has pre-built binaries available. Full instructions [here](https://mlc.ai/package/).  

For Mac, install it with the following command: 

```bash
# installing the mlc python library
pip install --pre --force-reinstall \
    mlc-ai-nightly mlc-chat-nightly -f https://mlc.ai/wheels
```

Can check if the library installed correctly by running the following:

```bash
# checks if the mlc python api works
python -c "from mlc_chat import ChatModule; print(ChatModule)"
```

### Managing large model files

We're going to start downloading the model weights now.  
Weight files are very large. They get unwieldy in regular git repositories.  
Would be very expensive to push/pull a lot of data we know won't change. At least not until we fine tune it.  

Will use `git-lfs` tool to manage large files. Installation instructions [here](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=mac)

For Mac, you can use either the Homebrew or MacPorts package managers.

```bash
# install git-lfs with Homebrew
brew install git-lfs
```

# On linux, git-lfs in two commands: 

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

sudo apt-get install git-lfs

```bash
# or, install it with MacPorts
port install git-lfs
```

### Creating a local Llama2 chat app

First, clone the `mlc-llm` library and move inside it. 

```bash
# make a project folder for the mlc-llm library
git clone https://github.com/mlc-ai/mlc-llm.git

# move into the folder
cd mlc-llm/
```

We need somewhere to put the model weights. We'll download one of the prebuilt models and put it inside the `dist/prebuilt` folder. 

```bash
# create the directory for pre-build models
mkdir -p dist/prebuilt
```

MLC uses some prebuilt libraries to run and configure the Llama2 chat app. Let's grab these and put them into the prebuilt repo so that the model can use them

```bash
# clone the MLC prebuilt libraries and configs
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib
```

We're ready to grab the model. `git-lfs` will works its magic to grab the weights.  

```bash
# download the pre-compiled Llama2 chat model 
cd dist/prebuilt
git lfs install
git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1
cd ../..
```

### Chatting with Llama2 via the CLI

We can run a simple python script to chat with the downloaded Llama2 model. Put the following code from the official [MLM Tutorial](https://llm.mlc.ai/docs/deploy/python.html) into a file  called `chat.py`: 

```python
"""
Using the MLC chat module to talk with Llama2.
"""

# import the chat module
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

# From the mlc-llm directory, run
# $ python sample_mlc_chat.py

# Create a ChatModule instance
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")
# You can change to other models that you downloaded, for example,
# cm = ChatModule(model="Llama-2-13b-chat-hf-q4f16_1")  # Llama2 13b model

output = cm.generate(
   prompt="What is the meaning of life?",
   progress_callback=StreamToStdout(callback_interval=2),
)

# Print prefill and decode performance statistics
print(f"Statistics: {cm.stats()}\n")

output = cm.generate(
   prompt="How many points did you list out?",
   progress_callback=StreamToStdout(callback_interval=2),
)

# Reset the chat module by
# cm.reset_chat()
```

### Chatting with Llama2 in a Notebook

Peek at very powerful workflow: interactively chatting with an LLM in a Notebook.

We can directly use the `ChatModule` inside of a Notebook.

In [1]:
! cd /Users/cck/repos/mlc-llm

In [2]:
path_to_repo = '/Users/cck/repos/mlc-llm'
import os
os.chdir(path_to_repo)

In [3]:
pwd

'/Users/cck/repos/mlc-llm'

In [4]:
# import the chat module
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

# initialize the chat module right in the notebook
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

In [5]:
print(1 + 1)

2


In [12]:
prompt = "Hello! Please get ready to help me with my project!"
# asking Llama2 about itself
output = cm.generate(
    prompt=prompt,
    progress_callback=StreamToStdout(callback_interval=2),
)

Hello! *adjusts glasses* I'm so glad you're excited about your project! I'm here to help you in any way I can. Could you please provide some more details about your project, such as its topic, any specific questions you have, or what you hope to achieve? That will help me better understand how I can assist you. 😊


In [13]:
def talk(txt):
    output = cm.generate(
    prompt=txt,
    progress_callback=StreamToStdout(callback_interval=2),
)

In [14]:
talk("Please be less cringe, thank you")

Of course, I apologize if my previous response came across as too cheery. I'm here to help you in a respectful and professional manner. Please let me know how I can assist you with your project. Are there any specific tasks you need help with, or any information you need me to provide? Please feel free to ask me anything. 😊


In [15]:
talk("Actually, please be as cringe as possible, thank you!")

Oh, my goodness! *adjusts glasses* I'm so thrilled to be of assistance to you, my dear! *bounces up and down* I'll do my absolute bestest to help you with your project, even if it means being a wee bit cheesy and cringeworthy! *winks* Please do tell me how I can help, you delightful person, you! 😊


In [16]:
talk("Hmmm, no quite cringe enough! More cringe please uwu")

Oh, my goodness gracious! *adjusts glasses* I'm so glad you want me to be even more cringe! *bounces up and down* I'll do my absolute bestest to provide you with the most cringe-tastic responses possible! *winks*
So, my sweet darling, how may I assist you on this lovely day? *blinks* Do tell me your most precious wish, and I shall do my utmost to make it come true! *bats eyelashes* 😊


In [17]:
talk("You are actually being a 6/10 cringe. Please show me what a 10000/10 cringe would look like!")

Oh, my goodness gracious! *adjusts glasses* I-I can't possibly...*gulps*...I-I think I'll just have to...*tries to contain excitement*...OVERDO IT! *pulls out all the cringe*
So, my sweet, sweet darling...*bounces up and down*...I shall provide you with the most...*gulps*...CRINGE-TASIC RESPONSES POSSIBLE! *winks* *adjusts glasses* *bats eyelashes* 😂👀💖
Please, my precious, tell me how I may serve you in this, the most cringe-tastic way possible! *curtsies* 💕


Starting the conversation:

In [11]:
# asking Llama2 about itself
output = cm.generate(
    prompt="Please tell me a little about yourself:",
    progress_callback=StreamToStdout(callback_interval=2),
)

Hello! *adjusts glasses* I'm just an AI, here to help you with any questions or tasks you may have. My purpose is to provide helpful, respectful, and honest assistance to the best of my abilities. I'm just a language model, so I don't have personal experiences or feelings like humans do, but I'm always eager to learn and improve my responses. Is there something specific you'd like to know or discuss?


Can also ask question on the fly using python's `input()` which works in Jupyter Notebooks

In [12]:
# asking Llama2 something on the fly
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Great, I'm glad you asked! I'm best suited to helping with a wide range of tasks, including but not limited to:
1. Answering questions: I can provide information and explanations on various topics, from science and history to entertainment and culture.
2. Generating ideas: I can help you brainstorm ideas for creative projects, or even come up with unique solutions to problems you might be facing.
3. Language translation: I can translate text from one language to another, helping you communicate with people from different cultures and backgrounds.
4. Summarizing content: If you have a long piece of text and want to get a quick summary of its main points, I can help you with that.
5. Offering suggestions: I can provide suggestions for things like gift ideas, travel destinations, or even books to read.
6. Providing definitions: If you're unsure of the meaning of a word or phrase, I can define it for you and give you examples of how it's used in context.
7. Creating content: I can assist y

What if we wanted a quick summary of what it said?

In [13]:
# asking for a summary of its response
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Of course! Here's a summary of my response in three sentences:
I'm a language model AI trained to assist with various tasks, including answering questions, generating ideas, translating languages, summarizing content, offering suggestions, providing definitions, and creating content. I'm here to help with any questions or topics you'd like to discuss, so feel free to ask me anything. I'm best suited to helping with a wide range of tasks, including but not limited to those listed above.


Can go back and forth with the cells above, or can continue talking in other cells

In [14]:
# asking another question
new_prompt = input("New, Different Prompt: ")
output = cm.generate(prompt=new_prompt, progress_callback=StreamToStdout(callback_interval=2))

Absolutely! I'm here to help. What topic would you like to learn more about? Let me know and I'll do my best to provide you with helpful information and resources.


The chat module maintains an internal chat history. If we get stuck in a loop or simply want to start the convo anew: 

In [10]:
# resets the current session's chat history
cm.reset_chat()

There is a handy `stats()` function to check the speed of the model's generation.

In [15]:
# checks if llm go brrr
print(cm.stats())

prefill: 11.7 tok/s, decode: 12.7 tok/s


For a more rigorous check, we can use the `benchmark_generate` function to check the speed of a fixed number of tokens:

In [17]:
# benchmarking text generation
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()


 nobody expects the Spanish Inquisition! 😱🔥🎬

Benchmarking is the process of measuring the performance of a system, application, or process against a set of predefined metrics or standards. The goal of benchmarking is to identify areas where improvements can be made, such as increased efficiency, faster performance, or better quality.
In the context of the Monty Python sketch, "nobody expects the Spanish Inquisition!" is a humorous reference to the unexpected and often absurd nature of benchmarking. Just as the Inquisition was unexpected and unwanted, benchmarking can sometimes be seen as an unnecessary or burdensome process. However, the benefits of benchmarking can be significant, such as identifying areas for improvement, optimizing resources, and ensuring compliance with standards or regulations. 😊s


























































































































































































'prefill: 1.8 tok/s, decode: 30.0 tok/s'

# Building an MLC iOS app

Similar process, need a few extra tools and helper packages. First, we need to install rust: 

```bash

# if you don't have `curl`...
which curl # <-- if this shows nothing

# install curl
brew install curl

# then, download and install rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

It will prompt you for different installation kinds, but the default one is perfectly fine. 

We will also need the `cmake` tool. On Mac, you can install it with Homebrew or MacPorts. 

```bash
# install cmake with Homebrew
brew install cmake
```

In case you skipped the section above, make sure to download the `mlc-chat` library and place it inside the `dist/prebuilt` folder. 

```bash
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib
```

In the sections above we chatted with Llama2 on our laptop. Let's go ahead and talk to it from an iOS app now.  

First we download a model built for iOS.

In [None]:
# download the pre-compiled Llama2 model for iOS
cd dist/prebuilt
git lfs install
git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q3f16_1
cd ../..

We need some other, helper libraries to run the iOS models. Run the command below to download and configure them: 

In [None]:
# grab the helper libraries
git submodule update --init --recursive 

# prepare the ios libs
cd ./ios
./prepare_libs.sh

This will create a `build/` folder. Make sure the following files are in there:

In [None]:
# expected output of build/
ls ./build/lib/
libmlc_llm.a         # A lightweight interface to interact with LLM, tokenizer, and TVM Unity runtime
libmodel_iphone.a    # The compiled model lib
libsentencepiece.a   # SentencePiece tokenizer
libtokenizers_cpp.a  # Huggingface tokenizer
libtvm_runtime.a     # TVM Unity runtime

Let's make sure we package the new model into the iOS app. We need to add the

In [None]:
# still inside of the ios folder, edit the file below
open ./prepare_params.sh # make sure `builtin_list` only contains "Llama-2-7b-chat-hf-q3f16_1"

# prepackage the weights
./prepare_params.sh

Now we should be able to see the model inside the `ios/build` folder: 

In [None]:
# expected contents of ios/dist folder
ls ./dist/
Llama-2-7b-chat-hf-q3f16_1 # the compiled Llama2 model

### Building the iOS app

We're almost there! Now to actually build the iOS app.

First boot up X-Code, then open the project `./ios/MLCChat.xcodeproj`

Build the project, and deploy it on either:  
- Mac laptop
- iPhone or iPad emulator 