# Getting Started with MLC-LLM using the Llama 2 Model

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Llama2](https://ai.meta.com/llama/) model. For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, let's set up the Conda environment which we will be running this notebook in (not required if running in Google Colab).

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [1]:
!nvidia-smi

Sat Oct 14 05:42:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [2]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1689-cp310-cp310-manylinux_2_28_x86_64.whl (510.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.6/510.6 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev493-cp310-cp310-manylinux_2_28_x86_64.whl (47.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle (from mlc-ai-nightly-cu118)
  Downloading cloudpickle-2.2.1-py3-none-any.whl 

**Google Colab:** If in Google Colab, you may see a message warning you to restart the runtime. Simply run the following code in a new code cell to restart the runtime.

```python
import os
os.kill(os.getpid(), 9)
```

Next, let's download the model weights for the Llama2 model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

Note: If you are NOT running in **Google Colab** you may need to run this line `!conda install git git-lfs` to install `git` and `git-lfs` before running the following cell to fully install `git lfs`.

In [2]:
!git lfs install

Git LFS initialized.


These commands will download many prebuilt libraries as well as the chat configuration for Llama-2-7b that `mlc_chat` needs, which may take a long time. If in **Google Colab** you can verify that the files are being downloaded by clicking on the folder icon on the left and navigating to the `dist` and then `prebuilt` folders which should be updating as the files are being downloaded.

In [3]:
!mkdir -p dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Cloning into 'dist/prebuilt/lib'...
remote: Enumerating objects: 328, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 328 (delta 63), reused 68 (delta 59), pack-reused 253[K
Receiving objects: 100% (328/328), 118.48 MiB | 14.53 MiB/s, done.
Resolving deltas: 100% (235/235), done.
Updating files: 100% (77/77), done.


In [4]:
!cd dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Cloning into 'mlc-chat-Llama-2-7b-chat-hf-q4f16_1'...
remote: Enumerating objects: 129, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 129 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (129/129), 500.53 KiB | 20.02 MiB/s, done.
Filtering content: 100% (116/116), 3.53 GiB | 62.91 MiB/s, done.


## Let's Chat!

Before we can chat with the model, we must first import a library and instantiate a `ChatModule` instance. The `ChatModule` must be initialized with the appropriate model name.

In [5]:
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

Note that the above invocation abstracts away the logic for finding the relevant model directory and prebuilt library paths. To specify these manually, you could run the following instead (which would be equivalent to the above).

```python
cm = ChatModule(model="dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1", lib_path="dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so")
```

That is all what needed to set up the `ChatModule`. You can now chat with the model by entering any prompt you'd like. Try it out below!

In [6]:
output = cm.generate(
    prompt="When was Python released?",
    progress_callback=StreamToStdout(callback_interval=2),
)

Hello! I'm glad you asked! Python was first released in 1991 by Guido van Rossum. It was initially called "Python" because van Rossum was a fan of the British comedy group Monty Python's Flying Circus. The language was created with the goal of being easy to learn and use, and it has since become one of the most popular programming languages in the world. Python 2.0 was released in 1994, and Python 3.0 was released in 2008. I hope that helps! Let me know if you have any other questions.


You can also repeat running the code block below for multiple rounds to interact with the model in a chat style.

In [7]:
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Prompt: what is a stick
Hello! I'm here to help you with your question. However, I want to clarify that the term "stick" can have different meanings depending on the context. Could you please provide more information or context about what you mean by "stick"? Are you referring to a physical object, a tool, or something else? I want to make sure I give you the most accurate answer possible. Please let me know how I can help!


In [8]:
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Of course! Here is a summary of my response in three sentences:
Python was first released in 1991 by Guido van Rossum. The language was created with the goal of being easy to learn and use, and it has since become one of the most popular programming languages in the world. Python 2.0 was released in 1994, and Python 3.0 was released in 2008.
I hope this summary helps! Let me know if you have any other questions.


To check the generation speed of the chat bot, you can print the statistics.

In [9]:
print(cm.stats())

prefill: 211.6 tok/s, decode: 42.0 tok/s


By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [10]:
cm.reset_chat()

### Benchmark Performance

To benchmark the performance, we can use the `benchmark_generate` method of ChatModule. It takes an input prompt and the number of tokens to generate, ignores the system prompt and model stop criterion, generates tokens in a language model way and stops until finishing generating the desired number of tokens. After calling `benchmark_generate`, we can use `stats` to check the performance.

In [11]:
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()


 Unterscheidung between benchmark and standard?  Is there a difference between benchmark and standard?

Answer:
Benchmark and standard are related but distinct terms in the context of quality control and performance measurement. Here's how they differ:

Benchmark:
A benchmark is a reference point or standard against which something can be measured or compared. In quality control, a benchmark is a target or a standard against which the performance of a process or product is measured. For example, a company might use a benchmark to measure the efficiency of its manufacturing process or the quality of its products.
Standard:
A standard, on the other hand, is a set of rules, guidelines, or specifications that define how something should be done or what should be achieved. In quality control, a standard is a set of guidelines or requirements that define how a process or product should be designed, manufactured, or tested. For example, a company might have a quality standard that defines ho

'prefill: 48.1 tok/s, decode: 41.3 tok/s'