# Quantization Example: Running a Large Language Model with Different Levels of Quantization

In this example we want to compare the capabilities of an LLM in different quantized versions.

## Setup: Ollama

Unsurprisingly, LLMs are often large - in terms of file size and memory footprint. A convenient way of managing local LLMs is Ollama.

In [1]:
from ai_dojo import show

In [2]:
show.github_repo("https://github.com/ollama/ollama")

Follow the installation steps lined out in the ollama repo to install it and start it on your system. Alternatively, if you have docker, you can just do

```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
``` 

We are going to use this Python module to interact with the ollama system.

In [3]:
show.github_repo("https://github.com/ollama/ollama-python")

In [4]:
import ollama

## A LLM in Quantized Versions


> Rocket 🦝 is a 3 billion large language model that was trained on a mix of publicly available datasets. [...] The outcome is a highly effective chat model



### Quantized Versions

We are now going to get some quantized versions of this model. A 🤗 user has already applied various quantization methods to the model and provided the results.

In [5]:
import huggingface_hub

In [6]:
huggingface_user = "TheBloke"
model_name = "rocket-3B-GGUF"
full_model_name = f"{huggingface_user}/{model_name}"
model_page_url= f"https://huggingface.co/{full_model_name}"
model_page_url

'https://huggingface.co/TheBloke/rocket-3B-GGUF'

**ℹ The GGUF File Format**

[**GGUF**](https://huggingface.co/docs/hub/gguf) is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. Unlike tensor-only file formats like safetensors, GGUF encodes both the tensors and a standardized set of metadata


![](img/gguf-spec.png)

The model page lists a variety of quantized models in different quantization types. Let's focus on a few.

In [7]:
quantization_types = [
    "Q8_0",
    "Q4_K_M",
    "Q2_K",
]

These quantized models are not available from the Ollama model registry. Fortunately, adding custom models downloaded from 🤗 to Ollama is quite straightforward. We need to give Ollama a model name and a simple configuration file.

In [8]:
import tqdm

In [9]:
quantized_model_paths = {}
for qtype in tqdm.tqdm(quantization_types):
    model_file = f"rocket-3b.{qtype}.gguf" 
    tqdm.tqdm.write(f"Downloading {qtype} model")
    model_path = huggingface_hub.hf_hub_download(full_model_name, filename=model_file)
    quantized_model_paths[qtype] = model_path
    
quantized_model_paths

  from .autonotebook import tqdm as notebook_tqdm


Downloading Q8_0 model


 33%|███▎      | 1/3 [00:00<00:00,  2.06it/s]

Downloading Q4_K_M model


100%|██████████| 3/3 [00:01<00:00,  2.97it/s]

Downloading Q2_K model





{'Q8_0': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q8_0.gguf',
 'Q4_K_M': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q4_K_M.gguf',
 'Q2_K': '/Users/cls/.cache/huggingface/hub/models--TheBloke--rocket-3B-GGUF/snapshots/23c0500471cb5e466aab7e9ade9b4da32dea43d2/rocket-3b.Q2_K.gguf'}

In [10]:
def make_model_file(model_path) -> str:
    """Creates a simple Ollama model configuration file for the given serialized model."""
    content = f"FROM {model_path}"
    return content

In [11]:
for qtype, model_path in quantized_model_paths.items():
    ollama_model_name = f"{model_name}:{qtype}"
    print(f"Creating Ollama model {ollama_model_name}")
    response = ollama.create(
        model=ollama_model_name,
        modelfile=make_model_file(model_path)
    )
    print(response["status"])


Creating Ollama model rocket-3B-GGUF:Q8_0
success
Creating Ollama model rocket-3B-GGUF:Q4_K_M
success
Creating Ollama model rocket-3B-GGUF:Q2_K
success


Let's verify that all of the models above are now available in Ollama. Also, have a look at the size on disk.

In [12]:
import pandas
from pandas import DataFrame


In [13]:
def show_ollama_model_table() -> DataFrame:
    """List Ollama models in tabular form"""
    model_data = pandas.json_normalize(ollama.list()["models"])
    model_data = model_data[["model", "size", "details.family", "details.format", "details.parameter_size", "details.quantization_level"]]
    model_data["size"] = model_data["size"].apply(lambda s: round(s / (1024**3), 2))
    return model_data

In [14]:
show_ollama_model_table()

Unnamed: 0,model,size,details.family,details.format,details.parameter_size,details.quantization_level
0,rocket-3B-GGUF:Q2_K,1.12,stablelm,gguf,2.8B,Q2_K
1,rocket-3B-GGUF:Q4_K_M,1.59,stablelm,gguf,2.8B,Q4_K_M
2,rocket-3B-GGUF:Q8_0,2.77,stablelm,gguf,2.8B,Q8_0
3,tinyllama:1.1b-chat,0.59,llama,gguf,1B,Q4_0
4,codeqwen:7b-chat,3.89,qwen2,gguf,7B,Q4_0
5,tinyllama:latest,0.59,llama,gguf,1B,Q4_0
6,Mistral-7B-OpenOrca:Q2_K,2.87,llama,gguf,7B,Q2_K
7,TheBloke/Mistral-7B-OpenOrca-GGUF:mistral-7b-o...,2.87,llama,gguf,7B,Q2_K
8,dolphin-mixtral:latest,24.63,llama,gguf,47B,Q4_0
9,orca-mini:13b,6.86,llama,gguf,13B,Q4_0


## Test

Let's prompt the original and the quantized models to do something useful. This should give us a first impression of how model performance relates to the level of quantization.

In [15]:
test_prompt = "Recommend some sights in Florence, Italy."
show.text(test_prompt)

Recommend some sights in Florence, Italy.

### 8-bit 

In [16]:
quantization_level = "Q8_0"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q8_0'

In [17]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)



1. Ponte Vecchio: This iconic stone bridge over the Arno river offers stunning views of the historic Florence from above. It's lined with shops that sell gold and jewellery but beware as they tend to rip off tourists! 

2. Uffizi Gallery: A world-famous art museum that houses some of the most famous paintings such as Botticelli’s ‘The Birth of Venus,’ Michelangelo's ‘Doni Tondo,’ and Raphael’s ‘Sistine Madonna.’

3. Piazza della Signoria: Often referred to as the ‘Main Square of Florence,’ this central square is a gathering place for both locals and tourists alike, where you can admire the famous statue of David by Michelangelo which stands tall in the middle. 

4. Duomo di Firenze (Florence Cathedral): A magnificent Gothic cathedral that took over 400 years to complete with its iconic dome designed by Filippo Brunelleschi. The interiors are breathtaking, with intricate frescoes and mosaics adorning the walls and ceilings.

5. Boboli Gardens: Situated behind the Pitti Palace, these g

### 4-bit

In [18]:
quantization_level = "Q4_K_M"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q4_K_M'

In [19]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

 Florence is a city located in central Italy and is known for its rich history, art, and architecture. There are plenty of things to see and do in this beautiful city. Here are some must-visit attractions in Florence:
1. The Uffizi Gallery - This world-famous museum houses an incredible collection of Italian Renaissance artwork, including masterpieces by Leonardo da Vinci, Michelangelo, Botticelli, and Raphael.
2. Ponte Vecchio - A medieval bridge that spans the Arno River, this iconic landmark is lined with shops selling souvenirs and jewelry made from precious stones and metals.
3. Duomo di Firenze (Florence Cathedral) - This stunning Gothic-style cathedral boasts a striking red and white striped facade, towering spires, and intricate sculptures and decorations. Inside the cathedral, admire the beautiful mosaics and take in sweeping views of the city from its bell tower.
4. Piazza della Signoria (Town Square) - A bustling public square at the heart of Florence, this is where you'll f

### 2-bit

In [20]:
quantization_level = "Q2_K"
ollama_model_name = f"{model_name}:{quantization_level}"
ollama_model_name

'rocket-3B-GGUF:Q2_K'

In [26]:
response = ollama.chat(
    model=ollama_model_name,
    messages=[
        {
            "role": "user",
            "content": test_prompt,
        },
    ],
    stream=True,
)
show.stream(response)

 […]
Florence is a city full of history and art. Here are some famous sights to visit:
1. The Duomo: This iconic cathedral towers over the Piazza del Duomo, one of Italy's largest squares. Climb up to the roof terrace for an incredible panoramic view of Florence.
2. Gallerica di Santa Maria del Fiore: This is a beautiful church in Florence featuring stunning frescoes by Italian Renaissance painter Benoît de la Vallée.
3. Ponte Vecchio: This medieval bridge over the Arno river offers tourists an opportunity to browse local artisanal shops selling souvenirs, jewelry and leather goods.
4. Uffizi Gallery: This world-famous museum contains a vast collection of artwork and artifacts from various parts of history such as Ancient Greek sculptures.
5. The Baptistery: Located near the Duomo in Florence's historic center, this baptistry is known for its stunning Byzantine mosaics inside.
6. Piazza della Signoria: This famous public square offers an amazing view of Florence Cathedral and also hous

How would you rate the quality of each response? Letting a large, highly factual LLM review them might help you spot the errors.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2024 [Christian Staudt](https://clstaudt.me), [Katharina Rasch](https://krasch.io)_