# Quantization Example: Running a Large Language Model with Different Levels of Quantization

In this example we want to compare the capabilities of an LLM in different quantized versions.

## Setup: Ollama

Unsurprisingly, LLMs are often large - in terms of file size and memory footprint. A conventient way of managing local LLMs is Ollama.

In [1]:
from ai_dojo import show

In [2]:
show.github_repo("https://github.com/ollama/ollama")

Follow the installation steps lined out in the ollama repo to install it and start it on your system. Alternatively, if you have docker, you can just do

```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
``` 

We are going to use this Python module to interact with the ollama system.

In [3]:
show.github_repo("https://github.com/ollama/ollama-python")

In [4]:
import ollama

## A LLM in Quantized Versions


> Rocket 🦝 is a 3 billion large language model that was trained on a mix of publicly available datasets. [...] The outcome is a highly effective chat model



### Quantized Versions

We are now going to get some quantized versions of this model. A 🤗 user has already applied various quantization methods to the model and provided the results.

In [5]:
import huggingface_hub

In [6]:
huggingface_user = "TheBloke"
model_name = "rocket-3B-GGUF"
full_model_name = f"{huggingface_user}/{model_name}"
model_page_url= f"https://huggingface.co/{full_model_name}"
model_page_url

'https://huggingface.co/TheBloke/rocket-3B-GGUF'

**ℹ The GGUF File Format**

[**GGUF**](https://huggingface.co/docs/hub/gguf) is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. Unlike tensor-only file formats like safetensors, GGUF encodes both the tensors and a standardized set of metadata


![](https://cdn-lfs.huggingface.co/datasets/huggingface/documentation-images/60dc8f9e25311d5ab671019499edd6f847bf3c9796d97b5579240c652ef445da?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gguf-spec.png%3B+filename%3D%22gguf-spec.png%22%3B&response-content-type=image%2Fpng&Expires=1715686360&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTY4NjM2MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy82MGRjOGY5ZTI1MzExZDVhYjY3MTAxOTQ5OWVkZDZmODQ3YmYzYzk3OTZkOTdiNTU3OTI0MGM2NTJlZjQ0NWRhP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=hiPC3Nn7JtVghKOC-CCE5imxed5dmote3JYSSQkdHA-hM7bvvFoJ-QtGYYzxaDaiV6ldIzhjDkLO1FwWK3OC6fx3ocKxRvfh0ebB-l8qwGxOCuGVePaj7b2a9Z6i54EkGwD6n7lUX6A4i0Ip%7ExlAAJsj7I7NzWLfjtE7O-XZE-Lq-iQsqOyaAR26tYTAhD4PpaicGVfXt-Kxko8MzKxE2zrHs4vpXodjPsRKxSJWuytjWMmXNH1-PptvCV35Y1HM1UpbytyX8nhyGg3lchYFvQa6keI%7Ehhh-o05755%7EUin2dLXhZpdjcqtqpkM0BJytBspU4onjdIrlH-Ay4PUzaIQ__&Key-Pair-Id=KVTP0A1DKRTAX)

The model page lists a variety of quantized models in different quantization types. Let's focus on a few.

In [7]:
quantization_types = [
    "Q8_0",
    "Q4_K_M",
    "Q2_K",
]

These quantized models are not available from the Ollama model registry. Fortunately, adding custom models downloaded from 🤗 to Ollama is quite straightforward. We need to give Ollama a model name and a simple configuration file.

In [8]:
import tqdm

In [9]:
quantized_model_paths = {}
for qtype in tqdm.tqdm(quantization_types):
    model_file = f"rocket-3b.{qtype}.gguf" 
    tqdm.tqdm.write(f"Downloading {qtype} model")
    model_path = huggingface_hub.hf_hub_download(full_model_name, filename=model_file)
    quantized_model_paths[qtype] = model_path
    
quantized_model_paths

  from .autonotebook import tqdm as notebook_tqdm


Downloading Q8_0 model


 67%|██████▋   | 2/3 [00:00<00:00,  3.72it/s]

Downloading Q4_K_M model
Downloading Q2_K model


100%|██████████| 3/3 [00:00<00:00,  3.87it/s]


{'Q8_0': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q8_0.gguf',
 'Q4_K_M': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q4_K_M.gguf',
 'Q2_K': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q2_K.gguf'}

In [10]:
def make_model_file(model_path) -> str:
    """Creates a simple Ollama model configuration file for the given serialized model."""
    content = f"FROM {model_path}"
    return content

In [11]:
for qtype, model_path in quantized_model_paths.items():
    ollama_model_name = f"{model_name}:{qtype}"
    print(f"Creating Ollama model {ollama_model_name}")
    response = ollama.create(
        model=ollama_model_name,
        modelfile=make_model_file(model_path)
    )
    print(response["status"])


Creating Ollama model rocket-3B-GGUF:Q8_0
success
Creating Ollama model rocket-3B-GGUF:Q4_K_M
success
Creating Ollama model rocket-3B-GGUF:Q2_K
success


Let's verify that all of the models above are now available in Ollama. Also, have a look at the size on disk.

In [12]:
import pandas
from pandas import DataFrame


In [13]:
def show_ollama_model_table() -> DataFrame:
    """List Ollama models in tabular form"""
    model_data = pandas.json_normalize(ollama.list()["models"])
    model_data = model_data[["model", "size", "details.family", "details.format", "details.parameter_size", "details.quantization_level"]]
    model_data["size"] = model_data["size"].apply(lambda s: round(s / (1024**3), 2))
    return model_data

In [14]:
show_ollama_model_table()

Unnamed: 0,model,size,details.family,details.format,details.parameter_size,details.quantization_level
0,rocket-3B-GGUF:Q2_K,1.12,stablelm,gguf,2.8B,Q2_K
1,rocket-3B-GGUF:Q4_K_M,1.59,stablelm,gguf,2.8B,Q4_K_M
2,rocket-3B-GGUF:Q8_0,2.77,stablelm,gguf,2.8B,Q8_0
3,tinyllama:1.1b-chat,0.59,llama,gguf,1B,Q4_0
4,codeqwen:7b-chat,3.89,qwen2,gguf,7B,Q4_0
5,tinyllama:latest,0.59,llama,gguf,1B,Q4_0
6,Mistral-7B-OpenOrca:Q2_K,2.87,llama,gguf,7B,Q2_K
7,TheBloke/Mistral-7B-OpenOrca-GGUF:mistral-7b-o...,2.87,llama,gguf,7B,Q2_K
8,dolphin-mixtral:latest,24.63,llama,gguf,47B,Q4_0
9,orca-mini:13b,6.86,llama,gguf,13B,Q4_0


## Test

Let's prompt the original and the quantized models to do something useful. This should give us a first impression of how model performance relates to the level of quantization.

In [15]:
test_prompt = "Recommend some sights in Florence, Italy."
show.text(test_prompt)

Recommend some sights in Florence, Italy.

### 8-bit 

In [16]:
quantization_level = "Q8_0"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q8_0'

In [17]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)


1. Duomo di Firenze (Florence Cathedral) - A stunning and iconic Gothic cathedral located in the heart of Florence.
2. Ponte Vecchio (Old Bridge) - An ancient stone bridge spanning over the Arno River that is lined with shops selling souvenirs and jewelry.
3. Uffizi Gallery - One of the world's oldest and most famous art museums, filled to the brim with breathtaking Renaissance masterpieces by Botticelli, Michelangelo, Leonardo da Vinci, and Raphael.
4. Piazza della Signoria (Square of the Citizens) - A picturesque public square that houses a number of stunning sculptures, including Michelangelo's famous "David."
5. Arno River - A scenic river that winds through Florence, offering great views from several terraces built along its banks.
6. Palazzo Vecchio (Old Palace) - A grand historical building that once served as the seat of power for the Signoria of Florence. It is now home to a number of art collections and exhibitions.
7. Bargello Museum - Housed in a 16th-century palace, this 

### 4-bit

In [18]:
quantization_level = "Q4_K_M"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q4_K_M'

In [19]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)



Florence is known for its beautiful and historic city center with stunning Gothic, Renaissance, and Baroque architecture. Here are a few places you must visit to get the full experience of Florence's charm:
1) Ponte Vecchio (Old Bridge): A medieval stone bridge spanning over the Arno River, lined with shops selling local handcrafted souvenirs such as leather goods, jewelry, and gold items.
2) Piazza della Signoria (Uffizi Gallery & Giotto’s Campo Dome): The grand plaza where you can find the iconic Uffizi Gallery housing a vast collection of Italian Renaissance artworks including Botticelli's "The Birth of Venus." Also, take a stroll to admire the stunning Campanile di Giotto (Giotto’s Bell Tower) situated opposite to it.
3) Ponte alle Carraje (Arno River Bridge): A bridge that connects two picturesque banks of the Arno river where you can enjoy scenic views of the river and historic buildings surrounding it.
4) Palazzo Vecchio: The medieval palace turned into a civic museum showcasi

### 2-bit

In [20]:
quantization_level = "Q2_K"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q2_K'

In [21]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

�#florence #italy #piazzadelduomo #domevimovie#vatican city#tuscanarchitecture#levinagroup#scenicviews. 

Florence is a beautiful city located in the heart of Tuscany. The historic center of Florence is recognized as a UNESCO World Heritage Site. Apart from being an artistic and cultural hub, Florence also attracts travelers for its captivating sceneries. Here are some picturesque sites to explore:
1) Piazza del Duomo: This beautiful square in the heart of Florence houses two historic churches - Santa Maria del Fiore and Duomo. The stunning cathedral with intricate designs on the facade is located here. You can climb up to the roof terrace for an incredible panoramic view of the city.
2) Dome in Movie: The symbol of Florence, the iconic dome has featured in several movies like "La La Land" and "Assassin's Creed." It’s a stunning architectural masterpiece adorned with intricate carvings, frescoes, and ornate sculptures.
3) Vatican City: Known as the "Eternal City," Vatican City is an in

How would you rate the quality of each response? Letting a large, highly factual LLM review them might help you spot the errors.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2024 [Christian Staudt](https://clstaudt.me), [Katharina Rasch](https://krasch.io)_