# Quantization Example: Running a Large Language Model with Different Levels of Quantization

In this example we want to compare the capabilities of an LLM in different quantized versions.

## Setup: Ollama

Unsurprisingly, LLMs are often large - in terms of file size and memory footprint. A conventient way of managing local LLMs is Ollama.

In [None]:
from ai_dojo import show

In [None]:
show.github_repo("https://github.com/ollama/ollama")

Follow the installation steps lined out in the ollama repo to install it and start it on your system. Alternatively, if you have docker, you can just do

```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
``` 

We are going to use this Python module to interact with the ollama system.

In [None]:
show.github_repo("https://github.com/ollama/ollama-python")

In [None]:
import ollama

## A LLM in Quantized Versions


> Rocket 🦝 is a 3 billion large language model that was trained on a mix of publicly available datasets. [...] The outcome is a highly effective chat model



### Quantized Versions

We are now going to get some quantized versions of this model. A 🤗 user has already applied various quantization methods to the model and provided the results.

In [None]:
import huggingface_hub

In [None]:
huggingface_user = "TheBloke"
model_name = "rocket-3B-GGUF"
full_model_name = f"{huggingface_user}/{model_name}"
model_page_url= f"https://huggingface.co/{full_model_name}"
model_page_url

**ℹ The GGUF File Format**

[**GGUF**](https://huggingface.co/docs/hub/gguf) is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. Unlike tensor-only file formats like safetensors, GGUF encodes both the tensors and a standardized set of metadata


![](https://cdn-lfs.huggingface.co/datasets/huggingface/documentation-images/60dc8f9e25311d5ab671019499edd6f847bf3c9796d97b5579240c652ef445da?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gguf-spec.png%3B+filename%3D%22gguf-spec.png%22%3B&response-content-type=image%2Fpng&Expires=1715686360&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTY4NjM2MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy82MGRjOGY5ZTI1MzExZDVhYjY3MTAxOTQ5OWVkZDZmODQ3YmYzYzk3OTZkOTdiNTU3OTI0MGM2NTJlZjQ0NWRhP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=hiPC3Nn7JtVghKOC-CCE5imxed5dmote3JYSSQkdHA-hM7bvvFoJ-QtGYYzxaDaiV6ldIzhjDkLO1FwWK3OC6fx3ocKxRvfh0ebB-l8qwGxOCuGVePaj7b2a9Z6i54EkGwD6n7lUX6A4i0Ip%7ExlAAJsj7I7NzWLfjtE7O-XZE-Lq-iQsqOyaAR26tYTAhD4PpaicGVfXt-Kxko8MzKxE2zrHs4vpXodjPsRKxSJWuytjWMmXNH1-PptvCV35Y1HM1UpbytyX8nhyGg3lchYFvQa6keI%7Ehhh-o05755%7EUin2dLXhZpdjcqtqpkM0BJytBspU4onjdIrlH-Ay4PUzaIQ__&Key-Pair-Id=KVTP0A1DKRTAX)

The model page lists a variety of quantized models in different quantization types. Let's focus on a few.

In [None]:
quantization_types = [
    "Q8_0",
    "Q4_K_M",
    "Q2_K",
]

These quantized models are not available from the Ollama model registry. Fortunately, adding custom models downloaded from 🤗 to Ollama is quite straightforward. We need to give Ollama a model name and a simple configuration file.

In [None]:
import tqdm

In [None]:
quantized_model_paths = {}
for qtype in tqdm.tqdm(quantization_types):
    model_file = f"rocket-3b.{qtype}.gguf" 
    tqdm.tqdm.write(f"Downloading {qtype} model")
    model_path = huggingface_hub.hf_hub_download(full_model_name, filename=model_file)
    quantized_model_paths[qtype] = model_path
    
quantized_model_paths

In [None]:
def make_model_file(model_path) -> str:
    """Creates a simple Ollama model configuration file for the given serialized model."""
    content = f"FROM {model_path}"
    return content

In [None]:
for qtype, model_path in quantized_model_paths.items():
    ollama_model_name = f"{model_name}:{qtype}"
    print(f"Creating Ollama model {ollama_model_name}")
    response = ollama.create(
        model=ollama_model_name,
        modelfile=make_model_file(model_path)
    )
    print(response["status"])


Let's verify that all of the models above are now available in Ollama. Also, have a look at the size on disk.

In [None]:
import pandas
from pandas import DataFrame


In [None]:
def show_ollama_model_table() -> DataFrame:
    """List Ollama models in tabular form"""
    model_data = pandas.json_normalize(ollama.list()["models"])
    model_data = model_data[["model", "size", "details.family", "details.format", "details.parameter_size", "details.quantization_level"]]
    model_data["size"] = model_data["size"].apply(lambda s: round(s / (1024**3), 2))
    return model_data

In [None]:
show_ollama_model_table()

## Test

Let's prompt the original and the quantized models to do something useful. This should give us a first impression of how model performance relates to the level of quantization.

In [None]:
test_prompt = "Recommend some sights in Florence, Italy."
show.text(test_prompt)

### 8-bit 

In [None]:
quantization_level = "Q8_0"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

In [None]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

### 4-bit

In [None]:
quantization_level = "Q4_K_M"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

In [None]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

### 2-bit

In [None]:
quantization_level = "Q2_K"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

In [None]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

How would you rate the quality of each response? Letting a large, highly factual LLM review them might help you spot the errors.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2024 [Christian Staudt](https://clstaudt.me), [Katharina Rasch](https://krasch.io)_