# üöÄ Vision-LLM Zero to Hero: Optimized Fine-Tuning (QLoRA)

## Introduction
Ce notebook est la solution **"Best Ever"** pour entra√Æner un mod√®le Vision-LLM (comme **Qwen-VL** ou **BLIP-2**) sur le dataset RAF-CE sans avoir besoin d'un supercalculateur.

### Pourquoi cette approche ?
1.  **Vision-LLM (SOTA)** : Contrairement aux CNN (ResNet) ou ViT classiques, ce mod√®le *comprend* l'image et peut expliquer pourquoi il voit une √©motion compos√©e.
2.  **QLoRA (4-bit Quantization)** : Nous allons charger le mod√®le en **4-bits** (compression extr√™me) et n'entra√Æner que des petits adaptateurs (**LoRA**). 
    *   *R√©sultat* : Entra√Ænement possible sur un GPU grand public (T4/L4/A10) en quelques heures au lieu de jours.
3.  **Qualit√© Professionnelle** : Code modulaire, gestion des erreurs, et pipeline de donn√©es robuste.

---

## 1. Setup "Zero Config" üõ†Ô∏è
Installation automatique de toutes les biblioth√®ques n√©cessaires optimis√©es (bitsandbytes, peft, transformers).

In [None]:
import os
import sys
import subprocess

def install_dependencies():
    packages = [
        "torch torchvision torchaudio",
        "transformers>=4.37.0",
        "peft",
        "bitsandbytes",
        "accelerate",
        "datasets",
        "pillow",
        "scikit-learn",
        "scipy",
        "tensorboard"
    ]
    print("‚ö° Installing optimized libraries for QLoRA...")
    for package in packages:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + package.split())
        except Exception as e:
            print(f"‚ö†Ô∏è Warning installing {package}: {e}")
    print("‚úÖ Environment Ready!")

# Uncomment to run installation
# install_dependencies()

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoProcessor
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType
)
from datasets import Dataset
from PIL import Image
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Force Cuda if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üöÄ Using device: {device}")

## 2. Configuration du Mod√®le (The "Smart" Part) üß†
Nous allons utiliser **Qwen-VL-Chat** (ou une alternative comme BLIP-2). Qwen-VL est actuellement l'un des meilleurs mod√®les open-source pour la compr√©hension visuelle pr√©cise.

**Magie QLoRA** :
*   `load_in_4bit=True` : Divise par 4 la m√©moire requise.
*   `bcd_quant_type="nf4"` : Type de donn√©es normalis√© pour ne pas perdre en pr√©cision.


In [None]:
MODEL_ID = "Qwen/Qwen-VL-Chat-Int4"  # Version optimis√©e 4-bit native de Qwen
# Alternative si Qwen est trop lourd: "Salesforce/blip2-opt-2.7b"

def load_model_and_processor():
    print(f"üîÑ Loading {MODEL_ID}...")
    
    # 1. Load Tokenizer & Processor
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
    processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
    
    # 2. 4-Bit Configuration (QLoRA)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16, # Use float16 for speed
        bnb_4bit_use_double_quant=True,
    )
    
    # 3. Load Base Model
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
    
    # 4. Prepare for Training
    model = prepare_model_for_kbit_training(model)
    
    # 5. Apply LoRA Adapters
    peft_config = LoraConfig(
        r=16,               # Rank (higher = smarter but heavier)
        lora_alpha=32,      # Scaling factor
        target_modules=["c_attn", "attn.c_proj", "w1", "w2"], # Targeted Linear Layers
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters() # Show how efficient we are!
    
    return model, processor, tokenizer

# model, processor, tokenizer = load_model_and_processor()

## 3. Data Loading Intelligent üìÇ
Nous transformons les images et les labels en **Conversations**.

**Format du Dataset Vision-LLM** :
*   **User** : `<Image> Analyze the facial expression. What is the compound emotion?`
*   **Assistant** : `The person is Happily Surprised. Facial cues: raised eyebrows, smiling mouth.`

In [None]:
class RAFCE_LLM_Dataset(torch.utils.data.Dataset):
    def __init__(self, data_list, processor, tokenizer):
        self.data = data_list
        self.processor = processor
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        img_path = item["path"]
        label_text = item["label_text"] # e.g., "Happily Surprised"
        explanation = item.get("explanation", "Facial features align with this emotion.") # Add logic here if you have AU data

        # 1. Create Prompt
        prompt = f"User: <img>{img_path}</img> Analyze the facial expression. What is the compound emotion?\nAssistant: The emotion is {label_text}. {explanation}<|endoftext|>"
        
        # 2. Process Image & Text using Qwen's specific method
        # Note: This part depends highly on the specific model's API (Qwen vs BLIP)
        # Here is a generic wrapper for Qwen-VL
        
        inputs = self.processor(
            text=[prompt],
            images=None, # Qwen handles image path inside text
            return_tensors="pt",
            padding="max_length",
            max_length=512,
            truncation=True
        )
        
        return {
            "input_ids": inputs["input_ids"].squeeze(),
            "attention_mask": inputs["attention_mask"].squeeze(),
            "labels": inputs["input_ids"].squeeze() # Causal LM training (predict next token)
        }

## 4. Training Loop Optimis√©e üî•
Utilisation de `transformers.Trainer` avec des param√®tres optimis√©s pour ne pas gaspiller de temps.

In [None]:
from transformers import TrainingArguments, Trainer

def run_training(model, train_dataset, val_dataset):
    training_args = TrainingArguments(
        output_dir="./qwen_rafce_finetuned",
        per_device_train_batch_size=4,    # Low batch size because model is huge
        gradient_accumulation_steps=8,    # Accumulate gradients to simulate batch_size=32
        num_train_epochs=3,               # Vision-LLMs learn FAST. 3 epochs is often enough.
        learning_rate=2e-4,               # LoRA allows higher LR than full fine-tuning
        bf16=True,                        # Use bfloat16 for stability if hardware supports it
        logging_steps=10,
        save_steps=100,
        save_total_limit=2,
        evaluation_strategy="steps",
        eval_steps=100,
        report_to="tensorboard",
        remove_unused_columns=False       # Important for custom multimodal datasets
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        # DataCollator would be needed here for complex padding
    )
    
    print("üî• Starting QLoRA Fine-Tuning...")
    trainer.train()
    
    print("‚úÖ Training Complete. Saving Adapters...")
    model.save_pretrained("./best_vision_llm_adapter")

# To run:
# 1. Prepare data_list from RAFCE files (same logic as before but with text labels)
# 2. dataset = RAFCE_LLM_Dataset(data_list, processor, tokenizer)
# 3. run_training(model, dataset, val_dataset)

## 5. Inf√©rence & D√©mo (The "Wow" Factor) ‚ú®
Une fonction simple pour tester le mod√®le entra√Æn√© sur une nouvelle image.

In [None]:
def predict_emotion(model, tokenizer, processor, image_path):
    prompt = f"User: <img>{image_path}</img> Analyze this face. What is the emotion?\nAssistant:"
    
    inputs = processor(text=[prompt], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=False, # Deterministic for evaluation
        temperature=0.0  # Greedy decoding
    )
    
    output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return output_text.split("Assistant:")[-1].strip()

# demo_img = "test_image.jpg"
# print(predict_emotion(model, tokenizer, processor, demo_img))