# QLoRA Fine-Tuning

**Introduction**:

The purpose of this notebook is to perform fine-tuning on the base Qwen-2.5-VL-7B-Instruct model from HuggingFace in order to improve the accuracy of W2 data extraction from the base model. 

Steps: 
  - Bootstrap the environment
  - Load prompts and perform additional dataset setup for fine-tuning
  - Load model and QLoRA configuration
  - Prepare training and evaluation collation functions, as well as data loaders
  - Prepare code for model training
  - Fine-tune the model and save
  - Re-execute inference testing and save the results in the `reports/finetuned` directory

See the README for a detailed discussion of project setup steps, background, and measured performance improvement from the base model. 

# Bootstrap Environment

In [4]:
# Set working directory
import os
os.environ["APP_PROJECT_DIR"] = "/content/ai-image-to-text"  # override with project directory
os.chdir(os.environ["APP_PROJECT_DIR"])

# Install packages and bootstrap environment
%pip install -q python-dotenv
from src.utils.env_setup import setup_environment
env = setup_environment()
%pip install -q -r requirements-{env}.txt

Loaded application properties from: /content/ai-image-to-text/.env.colab
Working directory: /content/ai-image-to-text
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Import libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from huggingface_hub import login as hf_login
import torch
import pandas as pd
from torch.utils.data import Dataset
import os
import json
from src.utils import data_loader
from src.model import evaluator, qwen_vl_model_adapter, reporting
from src.model.executor import Executor
from PIL import Image
from peft import get_peft_model, LoraConfig, PeftModel
from transformers import BitsAndBytesConfig
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import lightning as L
from torch.optim import AdamW
from lightning.pytorch.callbacks import ModelCheckpoint, Callback
from lightning.pytorch.callbacks.early_stopping import EarlyStopping


# Connect to huggingface
hf_login(os.environ["APP_HF_TOKEN"])

# File and directory paths
base_dir = os.environ["APP_PROJECT_DIR"]
datasets_dir = os.environ["APP_DATA_DIR"]
models_dir = os.environ["APP_MODELS_DIR"]
output_dir = os.environ["APP_OUTPUT_DIR"]
dataset_w2s_dir = f"{datasets_dir}/w2s"
dataset_processed_dir = f"{dataset_w2s_dir}/processed"
dataset_processed_final_dir = f"{dataset_processed_dir}/final"
output_results_dir = f"{output_dir}/finetuned"
output_results_file = f"{output_results_dir}/results.csv"
output_report_file = f"{output_results_dir}/results_report.txt"
output_report_ADP1_file = f"{output_results_dir}/results_report_ADP1.txt"
output_report_ADP2_file = f"{output_results_dir}/results_report_ADP2.txt"
output_report_IRS1_file = f"{output_results_dir}/results_report_IRS1.txt"
output_report_IRS2_file = f"{output_results_dir}/results_report_IRS2.txt"
system_prompt_file_path = f"{base_dir}/config/system_prompt.txt"
user_prompt_file_path = f"{base_dir}/config/user_prompt.txt"

# general constants
batch_size = 2
max_new_tokens = 256

# general constants
random_state = 42
max_new_tokens = 512
ignore_id = -100
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
model_fine_tuned_id = "qwen2.5-vl-7b-instruct-w2-finetuned"
use_qlora = True
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

# training hypperparameters
max_epochs = 10
learning_rate = 5e-5
batch_size = 2
num_workers = 8
accumulate_grad_batches = 8
lora_alpha = 16
lora_dropout = 0.05
r = 8

In [None]:
# Load model
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()  # Since this model will only be used for inference

# Load processor
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_id, min_pixels=min_pixels, max_pixels=max_pixels
)

# Prepare Datasets

In [None]:
# load prompts
system_prompt = file.read_text(system_prompt_file_path)
print(system_prompt)
user_prompt = file.read_text(user_prompt_file_path)
print(f"\n{user_prompt}")


class JSONLDataset(Dataset):
    def __init__(self, jsonl_file_path: str, image_directory_path: str):
        self.jsonl_file_path = jsonl_file_path
        self.image_directory_path = image_directory_path
        self.entries = self._load_entries()

    def _load_entries(self):
        entries = []
        with open(self.jsonl_file_path, "r") as file:
            for line in file:
                data = json.loads(line)
                entries.append(data)
        return entries

    def __len__(self):
        return len(self.entries)

    def __getitem__(self, idx: int):
        if idx < 0 or idx >= len(self.entries):
            raise IndexError("Index out of range")

        entry = self.entries[idx]
        image_path = os.path.join(self.image_directory_path, entry["file_name"])
        image = Image.open(image_path)
        return (
            image,
            entry,
            qwen_vl_model_adapter.format_prompt(
                system_prompt, user_prompt, image_path, entry["text"]
            ),
        )

You are an expert in processing W-2 forms. Your task is to extract specific information from the 
authoritative W-2 form in the provided image and present it in a structured JSON object. If the 
image contains multiple forms, the authoritative form is always located in the upper left portion 
of the image. Extract data only from this form, ignoring any duplicates. Use the standard box numbers 
to locate the fields: 

    - Employee Name (Box e),
    - Employer Name (Box c), 
    - Wages and Tips (Box 1), 
    - Federal Income Tax Withheld (Box 2), 
    - Social Security Wages (Box 3), 
    - Medicare Wages and Tips (Box 5), 
    - State (Box 15)
    - State Wages (Box 16)
    - State Income Tax Withheld (Box 17)

For state information, multiple states may be listed. Do not use information from Boxes c or e/f or 
any other areas of the image for state data. 

If a field is missing or blank, use an empty string as the value. Return only the completed JSON object 
without additional comme

In [None]:
train_dataset = JSONLDataset(
    jsonl_file_path=f"{dataset_processed_final_dir}/train/metadata.jsonl",
    image_directory_path=f"{dataset_processed_final_dir}/train",
)
val_dataset = JSONLDataset(
    jsonl_file_path=f"{dataset_processed_final_dir}/val/metadata.jsonl",
    image_directory_path=f"{dataset_processed_final_dir}/val",
)
test_dataset = JSONLDataset(
    jsonl_file_path=f"{dataset_processed_final_dir}/test/metadata.jsonl",
    image_directory_path=f"{dataset_processed_final_dir}/test",
)

Now let's inspect an example.

In [None]:
train_dataset[0]

(<PIL.PngImagePlugin.PngImageFile image mode=RGB size=2550x3300>,
 {'file_name': 'W2_Multi_Sample_Data_input_ADP1_clean_15500.png',
  'text': '{"Employee Name": "Tara Wilson", "Employer Name": "Lewis Ltd and Sons", "Wages and Tips": "193488.36", "Federal Income Tax Withheld": "56204.70", "Social Security Wages": "144247.30", "Medicare Wages and Tips": "219880.52", "State 1": "WI", "State 1 Wages and Tips": "93120.37", "State 1 Income Tax Withheld": "6308.88", "State 2": "", "State 2 Wages and Tips": "", "State 2 Income Tax Withheld": ""}'},
 [{'role': 'system',
   'content': [{'type': 'text',
     'text': 'You are an expert in processing W-2 forms. Your task is to extract specific information from the \nauthoritative W-2 form in the provided image and present it in a structured JSON object. If the \nimage contains multiple forms, the authoritative form is always located in the upper left portion \nof the image. Extract data only from this form, ignoring any duplicates. Use the standard

# Model Loading and Configuration

Load the Qwen2.5-VL model with QLoRA (Quantized LoRA). LoRA injects small trainable weights (the “low-rank matrices”) into the model’s layers, saving significant memory and compute compared to fine-tuning the entire model. QLoRA further applies 4-bit quantization to reduce memory usage while still preserving much of the model’s performance.

In [None]:
lora_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=r,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

if use_qlora:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_type=torch.bfloat16,
    )

# Set float32 matmul precision to 'high' to take advantage of the high precision
# A100 Tensor Cores for matmul operations
torch.set_float32_matmul_precision("high")

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config if use_qlora else None,
    torch_dtype=torch.bfloat16,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load processor
processor = Qwen2_5_VLProcessor.from_pretrained(
    model_id, min_pixels=min_pixels, max_pixels=max_pixels
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


trainable params: 2,523,136 || all params: 8,294,689,792 || trainable%: 0.0304


# Data Collation and Tokenization

This uses a masking technique that replaces padding tokens, special image tokens, and tokens from system and user turns with -100 so that the loss function ignores them and only evaluates the assistant's response.

In [None]:
def train_collate_fn(batch):
    _, _, examples = zip(*batch)

    texts = [
        processor.apply_chat_template(example, tokenize=False) for example in examples
    ]
    image_inputs = [process_vision_info(example)[0] for example in examples]

    model_inputs = processor(
        text=texts, images=image_inputs, return_tensors="pt", padding=True
    )

    labels = model_inputs["input_ids"].clone()

    # mask system message and image token IDs in the labels
    for i, example in enumerate(examples):
        sysuser_conv = example[:-1]
        sysuser_text = processor.apply_chat_template(sysuser_conv, tokenize=False)
        sysuser_img, _ = process_vision_info(sysuser_conv)

        sysuser_inputs = processor(
            text=[sysuser_text],
            images=[sysuser_img],
            return_tensors="pt",
            padding=True,
        )

        sysuser_len = sysuser_inputs["input_ids"].shape[1]
        labels[i, :sysuser_len] = -100

    input_ids = model_inputs["input_ids"]
    attention_mask = model_inputs["attention_mask"]
    pixel_values = model_inputs["pixel_values"]
    image_grid_thw = model_inputs["image_grid_thw"]

    return input_ids, attention_mask, pixel_values, image_grid_thw, labels

Now separate out the ground-truth target text for later comparison. The line `examples = [e[:2] for e in examples]` drops the assistant’s section from the template -- only system and user prompts are provided. The model must generate the assistant output itself.

In [None]:
def evaluation_collate_fn(batch):
    _, data, examples = zip(*batch)
    responses = [d["text"] for d in data]

    # drop the assistant portion so the model must generate it
    examples = [e[:2] for e in examples]

    texts = [
        processor.apply_chat_template(example, tokenize=False) for example in examples
    ]
    image_inputs = [process_vision_info(example)[0] for example in examples]

    model_inputs = processor(
        text=texts, images=image_inputs, return_tensors="pt", padding=True
    )

    input_ids = model_inputs["input_ids"]
    attention_mask = model_inputs["attention_mask"]
    pixel_values = model_inputs["pixel_values"]
    image_grid_thw = model_inputs["image_grid_thw"]

    return input_ids, attention_mask, pixel_values, image_grid_thw, responses

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=train_collate_fn,
    num_workers=num_workers,
    shuffle=True,
)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=evaluation_collate_fn,
    num_workers=num_workers,
)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=evaluation_collate_fn,
    num_workers=num_workers,
)

Validate the training data loader.

In [None]:
input_ids, attention_mask, pixel_values, image_grid_thw, labels = next(
    iter(train_loader)
)
processor.batch_decode(input_ids)

['<|im_start|>system\nYou are an expert in processing W-2 forms. Your task is to extract specific information from the \nauthoritative W-2 form in the provided image and present it in a structured JSON object. If the \nimage contains multiple forms, the authoritative form is always located in the upper left portion \nof the image. Extract data only from this form, ignoring any duplicates. Use the standard box numbers \nto locate the fields: \n\n    - Employee Name (Box e),\n    - Employer Name (Box c), \n    - Wages and Tips (Box 1), \n    - Federal Income Tax Withheld (Box 2), \n    - Social Security Wages (Box 3), \n    - Medicare Wages and Tips (Box 5), \n    - State (Box 15)\n    - State Wages (Box 16)\n    - State Income Tax Withheld (Box 17)\n\nFor state information, multiple states may be listed. Do not use information from Boxes c or e/f or \nany other areas of the image for state data. \n\nIf a field is missing or blank, use an empty string as the value. Return only the comple

Validate the validation data loader. You'll note that the 'assistant' section is removed from the model input (as this represents the ground truth for that example).

In [None]:
input_ids, attention_mask, pixel_values, image_grid_thw, labels = next(iter(val_loader))
processor.batch_decode(input_ids)

['<|im_start|>system\nYou are an expert in processing W-2 forms. Your task is to extract specific information from the \nauthoritative W-2 form in the provided image and present it in a structured JSON object. If the \nimage contains multiple forms, the authoritative form is always located in the upper left portion \nof the image. Extract data only from this form, ignoring any duplicates. Use the standard box numbers \nto locate the fields: \n\n    - Employee Name (Box e),\n    - Employer Name (Box c), \n    - Wages and Tips (Box 1), \n    - Federal Income Tax Withheld (Box 2), \n    - Social Security Wages (Box 3), \n    - Medicare Wages and Tips (Box 5), \n    - State (Box 15)\n    - State Wages (Box 16)\n    - State Income Tax Withheld (Box 17)\n\nFor state information, multiple states may be listed. Do not use information from Boxes c or e/f or \nany other areas of the image for state data. \n\nIf a field is missing or blank, use an empty string as the value. Return only the comple

# Training

In [None]:
class Qwen2_5_Trainer(L.LightningModule):
    def __init__(self, config, processor, model):
        super().__init__()
        self.config = config
        self.processor = processor
        self.model = model

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, pixel_values, image_grid_thw, labels = batch
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pixel_values=pixel_values,
            image_grid_thw=image_grid_thw,
            labels=labels,
        )
        loss = outputs.loss
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return loss

    def validation_step(self, batch, batch_idx, dataset_idx=0):
        input_ids, attention_mask, pixel_values, image_grid_thw, responses = batch
        generated_ids = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pixel_values=pixel_values,
            image_grid_thw=image_grid_thw,
            max_new_tokens=self.config.get("max_new_tokens"),
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_ids, generated_ids)
        ]
        generated_responses = self.processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )
        scores = []
        for generated_response, response in zip(generated_responses, responses):
            prediction_dict = qwen_vl_model_adapter.process_response(generated_response)
            ground_truth_dict = qwen_vl_model_adapter.process_response(response)
            accuracy = evaluator.evaluate_accuracy(ground_truth_dict, prediction_dict)
            scores.append(accuracy)
        score = sum(scores) / len(scores)
        self.log(
            "val_accuracy",
            score,
            prog_bar=True,
            logger=True,
            batch_size=self.config.get("batch_size"),
        )
        return scores

    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, pixel_values, image_grid_thw, responses = batch
        generated_ids = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pixel_values=pixel_values,
            image_grid_thw=image_grid_thw,
            max_new_tokens=self.config.get("max_new_tokens"),
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_ids, generated_ids)
        ]
        generated_responses = self.processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )
        scores = []
        for generated_response, response in zip(generated_responses, responses):
            prediction_dict = qwen_vl_model_adapter.process_response(generated_response)
            ground_truth_dict = qwen_vl_model_adapter.process_response(response)
            accuracy = evaluator.evaluate_accuracy(ground_truth_dict, prediction_dict)
            scores.append(accuracy)
        score = sum(scores) / len(scores)
        self.log(
            "test_accuracy",
            score,
            prog_bar=True,
            logger=True,
            batch_size=self.config.get("batch_size"),
        )
        return scores

    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=self.config.get("lr"))
        scheduler = {
            "scheduler": torch.optim.lr_scheduler.CosineAnnealingLR(
                optimizer, T_max=self.config.get("max_epochs")
            ),
            "interval": "epoch",
            "frequency": 1,
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler}

    def train_dataloader(self):
        return DataLoader(
            train_dataset,  # 800 samples
            batch_size=self.config.get("batch_size"),
            collate_fn=train_collate_fn,
            shuffle=True,
            num_workers=num_workers,
        )

    def val_dataloader(self):
        return DataLoader(
            val_dataset,  # 100 samples
            batch_size=self.config.get("batch_size"),
            collate_fn=evaluation_collate_fn,
            num_workers=num_workers,
        )

    def test_dataloader(self):
        return DataLoader(
            test_dataset,  # 100 samples
            batch_size=self.config.get("batch_size"),
            collate_fn=evaluation_collate_fn,
            num_workers=num_workers,
        )

    def on_train_end(self):
        """
        Save the model and processor after training completes.
        """
        # Define the save path
        final_model_path = os.path.join(self.config["result_path"], "final_model")

        # Create the directory if it doesn't exist
        os.makedirs(final_model_path, exist_ok=True)

        # Save the model and processor
        self.model.save_pretrained(final_model_path)
        self.processor.save_pretrained(final_model_path)

        # Print confirmation
        print(f"Model and processor saved to {final_model_path}")


# Adjusted configuration for 7B model with larger dataset
config = {
    "max_epochs": max_epochs,  # Increased for full training
    "batch_size": batch_size,  # Increased, adjust based on memory
    "lr": learning_rate,  # Kept as a reasonable starting point
    "check_val_every_n_epoch": 1,
    "gradient_clip_val": 1.0,
    "accumulate_grad_batches": accumulate_grad_batches,  # Adjusted to maintain effective batch size
    "num_nodes": 1,
    "warmup_steps": 100,  # Increased for larger dataset
    "result_path": f"{models_dir}/{model_fine_tuned_id}",
    "max_new_tokens": max_new_tokens,  # Added for generation consistency
}

model_module = Qwen2_5_Trainer(config, processor, model)

early_stopping_callback = EarlyStopping(
    monitor="val_accuracy", patience=5, verbose=False, mode="max"
)

checkpoint_callback = ModelCheckpoint(
    dirpath=config["result_path"],
    filename="best_model",
    save_top_k=1,
    monitor="val_accuracy",
    mode="max",
    save_last=True,
)


class SaveProcessorCallback(Callback):
    def on_save_checkpoint(self, trainer, pl_module, checkpoint):
        # Check if this is the best model being saved
        if trainer.checkpoint_callback.best_model_path:
            save_path = os.path.dirname(trainer.checkpoint_callback.best_model_path)
            pl_module.processor.save_pretrained(save_path)
            print(f"Processor saved to {save_path}")


trainer = L.Trainer(
    accelerator="gpu",
    devices=[0],
    max_epochs=config.get("max_epochs"),
    accumulate_grad_batches=config.get("accumulate_grad_batches"),
    check_val_every_n_epoch=config.get("check_val_every_n_epoch"),
    gradient_clip_val=config.get("gradient_clip_val"),
    num_sanity_val_steps=0,
    log_every_n_steps=50,  # Increased for less frequent logging
    precision="bf16",
    callbacks=[checkpoint_callback, early_stopping_callback, SaveProcessorCallback()],
)

/usr/local/lib/python3.11/dist-packages/lightning/fabric/connector.py:572: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO:lightning.pytorch.utilities.rank_zero:Using bfloat16 Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [None]:
# Train the model
trainer.fit(model_module)

# Test the model
trainer.test(model_module)

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type                 | Params | Mode 
-------------------------------------------------------
0 | model | PeftModelForCausalLM | 4.7 B  | train
-------------------------------------------------------
2.5 M     Trainable params
4.7 B     Non-trainable params
4.7 B     Total params
18,779.124Total estimated model params size (MB)
562       Modules in train mode
762       Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name  | Type                 | Params | Mode 
-------------------------------------------------------
0 | model | PeftModelForCausalLM | 4.7 B  | train
-------------------------------------------------------
2.5 M     Trainable params
4.7 B     Non-trainable params
4.7 B     Total params
18,779.124Total estimated model params size (MB)
562       Modules in train mode
762       Modules in eval mode


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned
Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


Validation: |          | 0/? [00:00<?, ?it/s]

Processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned


INFO: `Trainer.fit` stopped: `max_epochs=10` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


Model and processor saved to /content/drive/MyDrive/.models/qwen2.5-vl-7b-instruct-w2-finetuned/final_model


INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test_accuracy': 0.9708333015441895}]

# Test Fine-Tuned Model

Load fine-tuned model and processor.

In [6]:
# Load processor
final_model_path = os.path.join(models_dir, model_fine_tuned_id, "final_model")
processor = Qwen2_5_VLProcessor.from_pretrained(final_model_path)

# Load the base model with quantization (QLoRA settings)
if use_qlora:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_type=torch.bfloat16,
    )
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config if use_qlora else None,
    torch_dtype=torch.bfloat16,
)

# Load the fine-tuned LoRA weights
model = PeftModel.from_pretrained(base_model, final_model_path)
model.eval()

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/57.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/244 [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2_5_VLForConditionalGeneration(
      (visual): Qwen2_5_VisionTransformerPretrainedModel(
        (patch_embed): Qwen2_5_VisionPatchEmbed(
          (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
        )
        (rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding()
        (blocks): ModuleList(
          (0-31): 32 x Qwen2_5_VLVisionBlock(
            (norm1): Qwen2RMSNorm((1280,), eps=1e-06)
            (norm2): Qwen2RMSNorm((1280,), eps=1e-06)
            (attn): Qwen2_5_VLVisionSdpaAttention(
              (qkv): Linear4bit(in_features=1280, out_features=3840, bias=True)
              (proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
            )
            (mlp): Qwen2_5_VLMLP(
              (gate_proj): Linear4bit(in_features=1280, out_features=3420, bias=True)
              (up_proj): Linear4bit(in_features=1280, out_features=3420, bias=True)
              (down_pr

In [8]:
# load system prompt
system_prompt = data_loader.get_text(system_prompt_file_path)
print(system_prompt)

# load user prompt
user_prompt = data_loader.get_text(user_prompt_file_path)
print(f"\n{user_prompt}")

You are an expert in processing W-2 forms. Your task is to extract specific information from the 
authoritative W-2 form in the provided image and present it in a structured JSON object. If the 
image contains multiple forms, the authoritative form is always located in the upper left portion 
of the image. Extract data only from this form, ignoring any duplicates. Use the standard box numbers 
to locate the fields: 

    - Employee Name (Box e),
    - Employer Name (Box c), 
    - Wages and Tips (Box 1), 
    - Federal Income Tax Withheld (Box 2), 
    - Social Security Wages (Box 3), 
    - Medicare Wages and Tips (Box 5), 
    - State (Box 15)
    - State Wages (Box 16)
    - State Income Tax Withheld (Box 17)

For state information, multiple states may be listed. Do not use information from Boxes c or e/f or 
any other areas of the image for state data. 

If a field is missing or blank, use an empty string as the value. Return only the completed JSON object 
without additional comme

In [12]:
# Load ground truth data
metadata = data_loader.get_metadata(
    f"{dataset_processed_final_dir}/test/metadata.jsonl",
    f"{dataset_processed_final_dir}/test",
)
print(f"Selected {len(metadata)} ground truth examples for testing.")

# Run test
executor = Executor(
    model=model,
    processor=processor,
    system_prompt=system_prompt,
    user_prompt=user_prompt,
)
df = executor.execute_inference_test(metadata, batch_size, max_new_tokens)

# Save comparison results to CSV
os.makedirs(output_results_dir, exist_ok=True)
df.to_csv(output_results_file, index=False)

Selected 100 ground truth examples for testing.
Processing 100 examples for testing.
Processing batch (1 of 50); batch size = 2.
Processing batch (2 of 50); batch size = 2.
Processing batch (3 of 50); batch size = 2.
Processing batch (4 of 50); batch size = 2.
Processing batch (5 of 50); batch size = 2.
Processing batch (6 of 50); batch size = 2.
Processing batch (7 of 50); batch size = 2.
Processing batch (8 of 50); batch size = 2.
Processing batch (9 of 50); batch size = 2.
Processing batch (10 of 50); batch size = 2.
Processing batch (11 of 50); batch size = 2.
Processing batch (12 of 50); batch size = 2.
Processing batch (13 of 50); batch size = 2.
Processing batch (14 of 50); batch size = 2.
Processing batch (15 of 50); batch size = 2.
Processing batch (16 of 50); batch size = 2.
Processing batch (17 of 50); batch size = 2.
Processing batch (18 of 50); batch size = 2.
Processing batch (19 of 50); batch size = 2.
Processing batch (20 of 50); batch size = 2.
Processing batch (21 of 

In [13]:
# Read from persisted CSV
df = pd.read_csv(output_results_file)

# output main report - to file and std out
reporting.output_results(df, output_report_file)

# output report for each form type - to file only
form_types = [
    ("ADP1", output_report_ADP1_file),
    ("ADP2", output_report_ADP2_file),
    ("IRS1", output_report_IRS1_file),
    ("IRS2", output_report_IRS2_file),
]
for form_type, report_file_path in form_types:
    reporting.output_results_by_form_type(df, report_file_path, form_type)

**Overall Accuracy**: 97.23%

**Field Summary**:
| Field                       |   total_comparisons |   matches |   mismatches |   accuracy |   mismatch_percentage |
|:----------------------------|--------------------:|----------:|-------------:|-----------:|----------------------:|
| Employee Name               |                 100 |       100 |            0 |       1    |               0       |
| Employer Name               |                 100 |        97 |            3 |       0.97 |               8.33333 |
| Federal Income Tax Withheld |                 100 |        99 |            1 |       0.99 |               2.77778 |
| Field Count Check           |                 100 |       100 |            0 |       1    |               0       |
| Medicare Wages and Tips     |                 100 |        99 |            1 |       0.99 |               2.77778 |
| Social Security Wages       |                 100 |        88 |           12 |       0.88 |              33.3333  |
| State