# ZipNN Usage Example
In this example we show how to use the [compressed meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed) hosted on Hugging Face.

## Requirements

In [1]:
!pip install --upgrade transformers
!pip install --upgrade zipnn

Collecting transformers
  Downloading transformers-4.45.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m80.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m78.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled t

## Running the model

In [2]:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from zipnn import zipnn_hf

To run the model, simply add `zipnn_hf()` at the beginning of the file, and it will take care of decompression for you:

In [3]:
zipnn_hf()

In [4]:
# Load the model
model_id = "royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

config.json:   0%|          | 0.00/5.07k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/93.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors.znn:   0%|          | 0.00/3.28G [00:00<?, ?B/s]

model-00002-of-00005.safetensors.znn:   0%|          | 0.00/3.30G [00:00<?, ?B/s]

model-00003-of-00005.safetensors.znn:   0%|          | 0.00/3.26G [00:00<?, ?B/s]

model-00004-of-00005.safetensors.znn:   0%|          | 0.00/3.31G [00:00<?, ?B/s]

model-00005-of-00005.safetensors.znn:   0%|          | 0.00/974M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Decompressing model-00001-of-00005.safetensors.znn
Decompressing model-00002-of-00005.safetensors.znn
Decompressing model-00003-of-00005.safetensors.znn
Decompressing model-00004-of-00005.safetensors.znn
Decompressing model-00005-of-00005.safetensors.znn


generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

In [5]:
# Get an image for Llama3.2
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

In [6]:
# Build the messages
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]

In [7]:
# Process messages
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

In [8]:
# Get output!
output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>If I had to write a haiku for this one, it would be: <|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here is a haiku for the image:

A rabbit in a coat
Stands on a dirt path, flowers
A charming scene unfolds<|eot_id|>


### ZipNN Extra Tips
ZipNN also allows you to seemlessly save local disk space in your cache after the model is downloaded.

To compress the cached model, simply run:

In [9]:
!wget -i https://raw.githubusercontent.com/zipnn/zipnn/main/scripts/scripts.txt

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-09-30 09:34:57--  https://raw.githubusercontent.com/zipnn/zipnn/main/scripts/scripts.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347 [text/plain]
Saving to: 'scripts.txt'


2024-09-30 09:34:58 (5.30 MB/s) - 'scripts.txt' saved [347/347]

--2024-09-30 09:34:58--  https://raw.githubusercontent.com/royleibov/zipnn/main/scripts/zipnn_compress_file.py
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 5488 (5.4K) [text/plain]
Saving to: 'zipnn_compress_file.py'


2024-09-30 09:34:58 (51.6 MB/s) - 'zipnn_compress_file.py' saved [5488/5488]

--2024-09-30 09:34:58--  https://raw.githubusercontent.com/royleibov/zipnn/main/scripts/zipnn_compress_path.py
Reusing existin

In [10]:
!python zipnn_compress_path.py safetensors --model royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed --hf_cache

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Found repo royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed in cache
[93mFixing Hugging Face model json...[0m
Compressing /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors...
Compressed /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors to /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors.znn
[92mOriginal size:  4.63GB size after compression: 3.07GB, Remaining size is 66.37% of original, time: 38.44[0m
[93mReorganizing Hugging Face cache...[0m
Compressing /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5ba

The model will be decompressed automatically and safely as long as `zipnn_hf()` is added at the top of the file like in the example above.

To decompress manualy, simply run:

In [11]:
!python zipnn_decompress_path.py --model royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed --hf_cache

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Found repo royleibov/Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed in cache
[93mFixing Hugging Face model json...[0m
Decompressing /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors.znn...
Decompressed /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors.znn to /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-00002-of-00005.safetensors
[92mBack to original size: 4.63GB size before decompression: 3.07GB, time: 38.78[0m
[93mReorganizing Hugging Face cache...[0m
Decompressing /root/.cache/huggingface/hub/models--royleibov--Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed/snapshots/e8b34daefcb0a9c3a64c7edbae5bacfdd9033b60/model-