# Fine-tuning Turkish-LLaVA-v0.1 with QLoRA

## 1. Introduction

Welcome! This notebook will guide you through fine-tuning a multimodal [Turkish-LLaVA-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-LLaVA-v0.1) model using QLoRA (Quantized Low-Rank Adaptation).

What you'll learn:
- Setting up your environment for LLaVA fine-tuning
- Preparing and preprocessing a vision-language dataset
- Using QLoRA for efficient training
- Saving and sharing your fine-tuned model

Requirements:
- A GPU-enabled environment (NVIDIA A100 GPU recommended)
- Basic familiarity with Python and Jupyter Notebooks
- (Optional) A HuggingFace account for model sharing

## 2. Install Requirements

We'll start by installing all necessary libraries: LLaVA, Hugging Face Transformers, PEFT (for LoRA), and more.

⚠️ **Note:**
- If you are running this on Google Colab, select a GPU runtime.
- After installation, you may need to restart the notebook kernel.

What is QLoRA?
- QLoRA allows you to fine-tune large models efficiently by using quantized weights (4-bit) and low-rank adapters (LoRA), reducing memory usage and speeding up training.

In [None]:
!pip install --upgrade pip  # enable PEP 660 support
!git clone https://github.com/haotian-liu/LLaVA.git
!(cd LLaVA && pip install -e . && pip install -e ".[train]")
!pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir
!pip install -U accelerate==0.34.2 peft==0.10.0 huggingface_hub datasets

## 3. Prepare the Model

In this section, we:
- Import necessary modules
- Set up model, vision tower, and output directories
- Define model, data, and training arguments

**Tip:** You can change the model or vision tower by editing the variables below. Make sure they are compatible.

What is a vision tower?
- A vision tower is a neural network (often a CLIP model) that processes images and extracts features for the language model.

In [None]:
import os
from llava import conversation as conv
from llava.train.train import ModelArguments, DataArguments, TrainingArguments


vision_tower_name = "openai/clip-vit-large-patch14-336"
model_path = "ytu-ce-cosmos/Turkish-LLaVA-v0.1"
dataset_dir = "llava_dataset"
output_dir = "output"

data_args = DataArguments(
    data_path=f"{dataset_dir}/data.json",
    image_folder=f"{dataset_dir}/images",
    lazy_preprocess=True,
)

model_args = ModelArguments(
    model_name_or_path=model_path,
    vision_tower=vision_tower_name,
    mm_vision_select_layer=-2,
    mm_use_im_start_end=False,
    mm_use_im_patch_token=False,
)

training_args = TrainingArguments(
    output_dir=output_dir,
    bf16=True,  # Use bf16 if your GPU supports it; else set to False or use fp16
    per_device_train_batch_size=4,  # Increase for faster training if you have more GPU memory
    gradient_accumulation_steps=4,  # Increase to simulate larger batch size without more memory
    optim="adamw_8bit",  # 8-bit optimizer saves memory; try "adamw_torch" for standard
    learning_rate=1e-4,  # Lower for stability, higher for faster learning (e.g., 5e-5 to 2e-4)
    warmup_ratio=0.03,  # Try 0.01–0.1; higher can help stabilize large models
    weight_decay=0.01,  # Regularization; 0.01 is common, try 0.0–0.1
    lr_scheduler_type="cosine",  # "linear" or "cosine" are common; try both
    # num_train_epochs=1,  # Use epochs or max_steps, not both; increase for more training
    max_steps=1000,  # Increase for longer training; set to -1 to use num_train_epochs
    logging_steps=5,  # Log more frequently for debugging, less for speed
    save_strategy="steps",  # "epoch" or "steps"; "steps" is good for long epochs
    save_steps=100,  # Save more often to avoid losing progress, less for disk space
    save_total_limit=5,  # Keep last N checkpoints; increase if you want more history
    group_by_modality_length=True,  # Set False if you have OOM issues or single modality
    mm_projector_lr=2e-5,  # Lower if overfitting, higher if underfitting vision branch
    report_to="none",  # Set to "wandb" or "tensorboard" for experiment tracking
)

# this is required for LLaMA-3 compatibilty
system_prompt = "Sen yardımsever bir asistansın."
conv.default_conversation = conv.Conversation(
    system=f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_prompt}",
    roles=(
        "<|start_header_id|>user<|end_header_id|>\n\n",
        "<|start_header_id|>assistant<|end_header_id|>\n\n",
    ),
    version="llama3",
    messages=[],
    offset=0,
    sep_style=conv.SeparatorStyle.MPT,
    sep="<|eot_id|>",
)

## 4. Load & Quantize the Model (4-bit)

Here, we load the pretrained model and quantize it to 4-bit precision using BitsAndBytes.

Why quantize?
- Reduces memory usage and allows you to train larger models on smaller GPUs.
- 4-bit quantization is a good balance between efficiency and performance.

**Note:** If you encounter CUDA or memory errors, try reducing the batch size or using a smaller model.

In [None]:
import torch
from transformers import BitsAndBytesConfig
from llava.model.builder import load_pretrained_model


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    "llava_llama",
    quantization_config=quantization_config,
)
model.config.use_cache = False
model.config.torch_dtype = torch.bfloat16

## 5. Set Up LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning method. Instead of updating all model weights, it adds small trainable matrices (adapters) to certain layers.

Why use LoRA?
- Dramatically reduces the number of trainable parameters
- Makes fine-tuning feasible on consumer hardware

What you can change: The LoRA rank (r), alpha, and dropout. Higher rank = more capacity, but more memory usage.

In [None]:
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from llava.train.train import find_all_linear_names

# Set the LoRA parameters below
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=find_all_linear_names(model),
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
)

if hasattr(model, "enable_input_require_grads"):
    model.enable_input_require_grads()
else:

    def make_inputs_require_grad(module, input, output):
        output.requires_grad_(True)

    model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)


model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

## 6. Align Vision Tower and Tokenizer

This step ensures that the image processor and tokenizer are correctly set up and aligned with the model.

Why is this important?
- Multimodal models need to process both text and images.
- Proper alignment ensures that images and text are handled consistently during training.

Advanced: If you want to freeze or tune specific parts of the model (like the vision tower), you can adjust the flags here.

In [None]:
vision_tower = model.get_vision_tower()
vision_tower.to(dtype=torch.bfloat16, device="cuda")

data_args.image_processor = vision_tower.image_processor
data_args.is_multimodal = True

model.config.image_aspect_ratio = data_args.image_aspect_ratio
model.config.tokenizer_padding_side = tokenizer.padding_side
model.config.tokenizer_model_max_length = tokenizer.model_max_length

model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
if model_args.tune_mm_mlp_adapter:
    model.requires_grad_(False)
    for p in model.get_model().mm_projector.parameters():
        p.requires_grad = True

model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
if training_args.freeze_mm_mlp_adapter:
    for p in model.get_model().mm_projector.parameters():
        p.requires_grad = False

model.get_model().mm_projector.to(dtype=torch.bfloat16, device="cuda")

model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_projector_lr = training_args.mm_projector_lr
training_args.use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)

## 7. Set LoRA Layer Data Types

For best performance and stability, we set the data types (dtypes) of LoRA and normalization layers.

Why?
- Some layers work better in float32, others in bfloat16.
- This helps prevent numerical issues during training.

In [None]:
from peft.tuners.lora import LoraLayer

for name, module in model.named_modules():
    if isinstance(module, LoraLayer):
        module = module.to(torch.bfloat16)
    if 'norm' in name:
        module = module.to(torch.float32)
    if 'lm_head' in name or 'embed_tokens' in name:
        if hasattr(module, 'weight'):
            if module.weight.dtype == torch.float32:
                module = module.to(torch.bfloat16)

## 8. Prepare the Dataset

We will download and preprocess a Turkish vision-language dataset.

What happens here:
- Download the dataset from Hugging Face
- Preprocess images and captions into the format required by LLaVA
- Save the processed data for training

**Tip:** You can use your own dataset by changing the dataset path and preprocessing logic. Make sure to split your data into training and validation sets for best results.

In [None]:
import json
from tqdm import tqdm
from pathlib import Path
from datasets import load_dataset
from typing import Union


def prepare_dataset_path(dataset_dir: Union[str, Path]):
    dataset_dir = Path(dataset_dir)
    if not dataset_dir.exists():
        dataset_dir.mkdir(parents=True)
    if not (dataset_dir / "images").exists():
        (dataset_dir / "images").mkdir(parents=True)


def preprocess(batch: list[dict], batch_size: int, user_prompt: str):
    batch["json"] = []
    for i in range(batch_size):
        img = f"{batch['imgid'][i]}.jpg"
        batch["image"][i].save(f"llava_dataset/images/{img}", format="JPEG")
        batch["json"].append(
            {
                "id": batch["imgid"][i],
                "image": img,
                "conversations": [
                    {"from": "human", "value": f"<image>\n{user_prompt}"},
                    {"from": "gpt", "value": batch["detailed_caption"][i]},
                ],
            },
        )
    return batch

# prepare dataset path first
prepare_dataset_path(dataset_dir)

batch_size = 1000
dataset_dir = "llava_dataset"
user_prompt = "Görüntüyü detaylı olarak açıkla."
ds = load_dataset("atasoglu/flickr8k-turkish-detailed-captions", split="train")
ds = ds.map(
    preprocess,
    batched=True,
    batch_size=batch_size,
    fn_kwargs=dict(batch_size=batch_size, user_prompt=user_prompt),
)
with open(f"{dataset_dir}/data.json", "w") as f:
    f.write(json.dumps(ds["json"], indent=2, ensure_ascii=False))

## 9. Fix End-of-Sequence (EOS) Token Issue

Some models may not handle the EOS token correctly. Here, we patch the library function to ensure the EOS token is added.

**Note:** This is a temporary workaround. If the library updates, this patch may break. Always check for official fixes or updates.

In [None]:
import functools
import llava.train.train as llava_train

@functools.wraps(llava_train._add_speaker_and_signal)
def patched_fn(header, source, get_conversation=True):
    EOS_TOKEN = "<|eot_id|>" # for Llama-3
    conversation = header
    for sentence in source:
        from_str = sentence["from"]
        if from_str.lower() == "human":
            from_str = conv.default_conversation.roles[0]
        elif from_str.lower() == "gpt":
            from_str = conv.default_conversation.roles[1]
        else:
            from_str = 'unknown'
        sentence["value"] = (EOS_TOKEN + from_str + sentence["value"])
        if get_conversation:
            conversation += sentence["value"]
    conversation += EOS_TOKEN
    return conversation

llava_train._add_speaker_and_signal = patched_fn

## 10. Prepare the Data Module

We create a data module that handles tokenization and batching for training.

Why?
- Data modules make it easy to manage datasets and data loaders.
- Ensures consistent preprocessing and batching.

In [None]:
from llava.train.train import make_supervised_data_module

data_module = make_supervised_data_module(
    tokenizer=tokenizer,
    data_args=data_args,
)

## 11. Inspect Tokenized Input (Optional)

Let's look at a sample of the tokenized input to make sure everything is working as expected.

**Tip:** If the decoded text looks strange, check your preprocessing and tokenizer settings.

In [None]:
from llava.constants import IMAGE_TOKEN_INDEX
example_text_input = data_module["train_dataset"][0]["input_ids"].unsqueeze(0)
# ignore image token since that is related with image processor
decoded = tokenizer.decode(example_text_input[example_text_input != IMAGE_TOKEN_INDEX])
print(decoded)

## 12. Start Training!

Now we're ready to train the model.

Tips for training:
- Monitor GPU memory usage and training loss.
- Adjust batch size or gradient accumulation steps if you run out of memory.
- Training can take several hours depending on your hardware and dataset size.

**Advanced:** For best results, use a validation set and monitor validation loss to avoid overfitting.

In [None]:
from llava.train.llava_trainer import LLaVATrainer

trainer = LLaVATrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)
trainer.train()

## 13. Save and Share Your Model

After training, you can save your model and push it to the Hugging Face Hub for sharing and future use.

What is the Hugging Face Hub?
- A platform for sharing models, datasets, and demos.
- You can create a free account at https://huggingface.co

**Note:** You will be prompted to log in to your Hugging Face account. Make sure to save all components: model, tokenizer, and image processor.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from peft import PeftModel

# Load & merge model with LoRA adapter and convert to bfloat16
model = PeftModel.from_pretrained(model_path, output_dir)
model = model.merge_and_unload().bfloat16()

In [None]:
merged_path = "merged_output"
repo_id = "atasoglu/Turkish-LLaVA-v0.1-ft"
push_to_hub = True

# Save (and push to hub) the model
model.save_pretrained(
    merged_path,
    repo_id=repo_id,
    push_to_hub=push_to_hub,
)
tokenizer.save_pretrained(
    merged_path,
    repo_id=repo_id,
    push_to_hub=push_to_hub,
)
image_processor.save_pretrained(
    merged_path,
    repo_id=repo_id,
    push_to_hub=push_to_hub,
)

## 14. Additional Tips and Best Practices

- Set random seeds for reproducibility (see torch.manual_seed, numpy, etc.).
- Use a validation split to monitor overfitting.
- Visualize training progress (e.g., with TensorBoard).
- Check hardware compatibility (CUDA, bfloat16 support, etc.).
- Consult official documentation for LLaVA, Hugging Face, and PEFT for updates and troubleshooting.

Happy fine-tuning! 🚀