# 4. Training Qwen-VL (VLM Embedding)

This notebook demonstrates how to fine-tune a Large Vision-Language Model (Qwen-VL) for embedding tasks.
This requires significantly more GPU memory than CLIP training.

Ensure you have run `01_setup_and_data.ipynb` first.

In [None]:
import os

# Ensure we are in the project root
if os.path.exists("vembed-factory"):
    os.chdir("vembed-factory")
elif os.getcwd().endswith("notebooks"):
    os.chdir("..")

print(f"Working Directory: {os.getcwd()}")

In [None]:
# Config Data Paths
if os.path.exists("data/flickr30k/train.jsonl"):
    DATA_PATH = "data/flickr30k/train.jsonl"
    IMAGE_ROOT = "data/flickr30k"
    VAL_DATA_PATH = "data/flickr30k/val.jsonl"
else:
    DATA_PATH = "data/dummy/train.jsonl"
    IMAGE_ROOT = "data/dummy"
    VAL_DATA_PATH = ""

## Configuration

- **Model**: `Qwen3-VL-Embedding-2B` 
- **Config**: Uses `examples/qwen3_2b_train.yaml` (standardized)
- **Method**: LoRA + FlashAttention + Gradient Cache + MRL
- **Memory Optimization**: Gradient Cache enabled with optimal chunk sizes

In [None]:
# Training Qwen3-VL using standardized config
# Note: Gradient cache is optimally configured in qwen3_2b_train.yaml

!python run.py examples/qwen3_2b_train.yaml \
    --data_path $DATA_PATH \
    --val_data_path "$VAL_DATA_PATH" \
    --image_root "$IMAGE_ROOT" \
    --config_override \
        output_dir=experiments/output_qwen_vl \
        epochs=1 \
        batch_size=2

## Evaluation

In [None]:
if os.path.exists(VAL_DATA_PATH):
    !python benchmark/run.py flickr30k \
        --model_path experiments/output_qwen_vl/checkpoint-epoch-1 \
        --flickr_root $IMAGE_ROOT \
        --output_dir experiments/eval_results_qwen \
        --batch_size 16 \
        --encoder_mode qwen3_vl \
        --flickr_split val