# Inference Turkish-LLaVA-v0.1 with 4bit Quantization

## 1. Introduction

Welcome! This notebook will guide you through running inference with the multimodal [Turkish-LLaVA-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-LLaVA-v0.1) model using 4-bit quantization for efficient memory usage.

What you'll learn:
- Setting up your environment for LLaVA inference
- Loading and quantizing the Turkish-LLaVA model
- Preparing and processing images and prompts
- Running inference and interpreting the results

Requirements:
- A GPU-enabled environment (NVIDIA A100 or similar recommended)
- Basic familiarity with Python and Jupyter Notebooks

> **Tip:** This notebook is designed for easy use on Google Colab, Kaggle, or your own GPU machine. If you’re new to multimodal models, don’t worry—each step is explained in detail!

## 2. Install Requirements

We start by installing all necessary libraries: LLaVA, Hugging Face Transformers, PEFT, and more.

⚠️ **Notes:**
- If you are running this on Google Colab, make sure to select a GPU runtime.
- After installation, you may need to restart the notebook kernel.
- The "flash-attn" library is required for efficient attention computation.
- "BitsAndBytes" enables 4-bit quantization for large models.

> If you encounter installation issues, try upgrading pip or restarting the runtime.

In [None]:
!pip install --upgrade pip  # enable PEP 660 support
!git clone https://github.com/haotian-liu/LLaVA.git
!(cd LLaVA && pip install -e . && pip install -e ".[train]")
!pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir
!pip install -U accelerate==0.34.2 peft==0.10.0 huggingface_hub datasets

## 3. Define Helper Functions

We define utility functions for:
- Downloading images from a URL
- Preprocessing prompts for the LLaVA model
- Running inference with the model

**Why?**
- These helpers make the code modular and reusable.
- The prompt formatting is crucial for multimodal models like LLaVA.

> **Tip:** You can modify the prompt templates to suit your use case.

In [None]:
import torch
import requests
from PIL import Image
from io import BytesIO
from llava.mm_utils import process_images, tokenizer_image_token


def download_image(url):
    content = requests.get(url).content
    return Image.open(BytesIO(content))


def preprocess_prompt(system_prompt, user_prompt):
    return (
        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        f"{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
        f"<image>\n{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    )


@torch.inference_mode()
def inference(model, tokenizer, image_processor, image, prompt, generation_config):
    input_ids = (
        tokenizer_image_token(
            prompt,
            tokenizer,
            return_tensors="pt",
        )
        .unsqueeze(0)
        .cuda()
    )
    image_tensor = (
        process_images(
            [image],
            image_processor,
            model.config,
        )
        .to(model.device, dtype=torch.float16)
    )
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        image_sizes=[image.size],
        **generation_config,
    )
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    return outputs

## 4. Download and Prepare the Model

We now download the pretrained Turkish-LLaVA model and set up 4-bit quantization using BitsAndBytes.

**Why quantize?**
- 4-bit quantization dramatically reduces memory usage, making it possible to run large models on consumer GPUs.
- The "nf4" quantization type is recommended for best performance.

**Tips:**
- If you encounter CUDA or memory errors, try reducing batch size or using a smaller model.
- Make sure your GPU supports bfloat16 (A100, RTX 30xx, etc.), or change to float16 if needed.

In [None]:
import os
import torch
from transformers import BitsAndBytesConfig
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init

model_path = "ytu-ce-cosmos/Turkish-LLaVA-v0.1"

# apply 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

disable_torch_init() # for inference
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    "llava_llama",
    quantization_config=quantization_config,
)

## 5. Prepare the Input (Image and Prompt)

Let's prepare an example image and a prompt for the model.

- We use a sample image from the Hugging Face datasets.
- The prompt consists of a system message (defining the assistant's behavior) and a user message (the actual question or instruction).

**Tips:**
- You can change the image URL to try your own images.
- Modify the user prompt to ask different questions about the image.

In [None]:
# get image from URL address
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg"
image = download_image(image_url)

# create a prompt with system and user messages
system_prompt = "Sen yardımsever bir asistansın."
user_prompt = "Görüntüyü detaylı olarak açıkla."
prompt = preprocess_prompt(system_prompt, user_prompt)
display(image)
print("Prompt:", prompt, sep="\n")

## 5. Run Inference

Now, let's run the model on the prepared image and prompt.

- The model will generate a detailed caption or answer about the image.
- You can adjust the "generation_config" (e.g., max_new_tokens) for longer or shorter outputs.

> **Tip:** If you get CUDA out-of-memory errors, try reducing max_new_tokens or use a smaller image.

In [None]:
# define generation config
generation_config = dict(
    do_sample=False,
    max_new_tokens=256,
)

# run inference
outputs = inference(
    model,
    tokenizer,
    image_processor,
    image,
    prompt,
    generation_config,
)
print(outputs)

## 6. Additional Tips and Troubleshooting

- If you want to use your own images, upload them and use `Image.open('your_image.jpg')`.
- For different tasks (e.g., VQA, conversation), adjust the prompt template accordingly.
- If you encounter errors related to CUDA, try restarting the runtime or reducing memory usage.
- For more advanced usage, see the [LLaVA GitHub repository](https://github.com/haotian-liu/LLaVA) and [Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/llava).

Happy inferencing! 🚀