# Data Preprocessing for Qwen2-VL Fine-tuning

This notebook documents and visualizes the data preprocessing pipeline for fine-tuning Qwen2-VL on the nutrition table detection task.

**Purpose:**
- Convert raw dataset samples to Qwen2-VL conversation format
- Understand the data transformations step-by-step
- Debug and visualize the preprocessing pipeline

**Key Transformations:**
1. OpenFoodFacts bbox format `[y_min, x_min, y_max, x_max]` in [0,1] → Qwen2-VL format `(x1,y1),(x2,y2)` in [0,1000)
2. Dataset samples → OpenAI conversation format (system, user, assistant)
3. PIL images handled via IMAGE_PLACEHOLDER pattern for HuggingFace dataset serialization

## Dependencies

In [1]:
# Basic imports
import torch
import os
import gc
import time
from pprint import pprint
from PIL import Image
import matplotlib.pyplot as plt

# HuggingFace imports
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from qwen_vl_utils import process_vision_info

print("Dependencies loaded successfully!")



Dependencies loaded successfully!


In [2]:
# Load dataset
dataset_id = "openfoodfacts/nutrition-table-detection"
ds = load_dataset(dataset_id)

# Split into train and validation
train_dataset = ds['train']
eval_dataset = ds['val']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")
print(f"Dataset features: {train_dataset.features}")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/291M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/285M [00:00<?, ?B/s]

data/val-00000-of-00001.parquet:   0%|          | 0.00/64.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1083 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/123 [00:00<?, ? examples/s]

Training samples: 1083
Validation samples: 123
Dataset features: {'image_id': Value('string'), 'image': Image(mode=None, decode=True), 'width': Value('int64'), 'height': Value('int64'), 'meta': {'barcode': Value('string'), 'off_image_id': Value('string'), 'image_url': Value('string')}, 'objects': {'bbox': List(List(Value('float32'))), 'category_id': List(Value('int64')), 'category_name': List(Value('string'))}}


# Data preprocessing

The dataset requires conversion to be compatible with the Hugging Face (HF) library. Specifically, each sample must be reformatted into the OpenAI conversation format, comprising:

- Roles: system, user, and assistant
- User input: Provide an image and ask, "Detect the bounding box of the nutrition table."
- Assistant response: Format compatible with Qwen2-VL's detection question responses
    * See pages 7 and 43 of this [paper](https://arxiv.org/pdf/2409.12191) and  [Model Card](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct#more-usage-tips) for tips
    * Ensure inclusion of class name and bounding box coordinates using the proper special tokens.
    * Check the expected range of bb coordinates
    * Pay attention to the order of x,y coordinates as expected by Qwen

Here is an example system prompt:

In [3]:
system_message = """You are a Vision Language Model specialized in interpreting visual data from product images.
Your task is to analyze the provided product images and detect the nutrition tables in a certain format.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

In [4]:
# Task: write a function to map each sample to a list of 3 dicts (one for each role)

def convert_to_conversation_format(example):
    """
    Convert a dataset example to Qwen2-VL conversation format.
    
    Why IMAGE_PLACEHOLDER?
    - HuggingFace dataset.map() needs serializable data (PIL images aren't)
    - The placeholder is replaced with the actual image during training (in collate_fn)
    - Image is stored separately at example['image'] to avoid duplication
    
    Coordinate Conversion:
    - Input: OpenFoodFacts [y_min, x_min, y_max, x_max] in [0,1]
    - Output: Qwen2-VL (x1,y1),(x2,y2) in [0,1000)
    
    Args:
        example: Dataset sample with 'image' and 'objects' fields
        
    Returns:
        Dict with 'messages' (conversation) and 'image' (PIL object)
    """
    # Validate input
    if 'objects' not in example or 'bbox' not in example['objects']:
        raise ValueError("Missing objects or bbox in example")
    
    # Extract nutrition table bounding boxes
    bboxes = example['objects']['bbox']
    categories = example['objects']['category_name']  # Fixed: 'category_name' not 'category'
    
    # Format the assistant response with Qwen2-VL special tokens
    # Convert normalized [0,1] bbox to Qwen's [0,1000) format
    assistant_responses = []
    for bbox, category in zip(bboxes, categories):
        # Validate bbox values are in [0,1] range (with small tolerance for rounding)
        if not all(-0.001 <= coord <= 1.001 for coord in bbox):
            print(f"Warning: bbox coordinates out of [0,1] range: {bbox}")
        
        # CRITICAL: OpenFoodFacts uses [y_min, x_min, y_max, x_max] format
        # But Qwen2VL expects (x_top_left, y_top_left), (x_bottom_right, y_bottom_right)
        y_min, x_min, y_max, x_max = bbox  # Unpack OpenFoodFacts format
        
        # Convert to Qwen format: (x,y) coordinates in [0,1000) range
        # Note: multiply by 1000 to convert from [0,1] to [0,1000)
        x1 = int(x_min * 1000)  # x_top_left
        y1 = int(y_min * 1000)  # y_top_left
        x2 = int(x_max * 1000)  # x_bottom_right
        y2 = int(y_max * 1000)  # y_bottom_right
        
        # Format: <|object_ref_start|>object<|object_ref_end|><|box_start|>(x1,y1),(x2,y2)<|box_end|>
        response = f"<|object_ref_start|>{category}<|object_ref_end|><|box_start|>({x1},{y1}),({x2},{y2})<|box_end|>"
        assistant_responses.append(response)
    
    # Combine multiple detections if present
    assistant_text = " ".join(assistant_responses)
    
    # Create conversation format WITHOUT the PIL image embedded
    conversation = [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": system_message
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "IMAGE_PLACEHOLDER"  # Use placeholder instead of actual image
                },
                {
                    "type": "text",
                    "text": "Detect the bounding box of the nutrition table."
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": assistant_text
                }
            ]
        }
    ]
    
    # Return messages and image separately
    return {
        "messages": conversation,
        "image": example['image']  # Store PIL image at top level
    }

Now, let's format the data using the chatbot structure. This will allow us to set up the interactions appropriately for our model.

In [5]:
def _has_image(example):
    """
    Filter out samples with missing or invalid images.
    
    This function is crucial for preventing None values from leaking into
    process_vision_info during collation/training, which would cause crashes.
    
    Why this is necessary:
    - Some dataset samples may have corrupted or missing images
    - PIL Image loading can fail silently, leaving None values
    - The collate_fn and process_vision_info expect valid PIL images
    - Filtering ensures training stability and prevents runtime errors
    
    Args:
        example: Dataset sample that should contain an 'image' field
    
    Returns:
        bool: True if example has a valid PIL image with 'size' attribute
    """
    img = example.get('image')
    try:
        # Treat as valid only if it looks like a PIL image
        return (img is not None) and hasattr(img, 'size')
    except Exception:
        return img is not None

In [6]:
# Apply filtering before formatting to remove samples with missing/invalid images
train_dataset = train_dataset.filter(_has_image)
eval_dataset = eval_dataset.filter(_has_image)

print(f"After filtering:")
print(f"  Training samples: {len(train_dataset)}")
print(f"  Validation samples: {len(eval_dataset)}")
print(f"\ntrain_dataset[0]: {train_dataset[0]}")

Filter:   0%|          | 0/1083 [00:00<?, ? examples/s]

Filter:   0%|          | 0/123 [00:00<?, ? examples/s]

After filtering:
  Training samples: 1083
  Validation samples: 123

train_dataset[0]: {'image_id': '0009800892204_1', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F8620064EB0>, 'width': 2592, 'height': 1944, 'meta': {'barcode': '0009800892204', 'off_image_id': '1', 'image_url': 'https://static.openfoodfacts.org/images/products/000/980/089/2204/1.jpg'}, 'objects': {'bbox': [[0.057098764926195145, 0.014274691231548786, 0.603501558303833, 0.991126537322998]], 'category_id': [0], 'category_name': ['nutrition-table']}}


In [7]:
# Task: apply the function above to all samples in the training and eval datasets
# Use remove_columns to get clean output with only 'messages' and 'image' fields
columns_to_remove = ['image_id', 'width', 'height', 'meta', 'objects']

# Unfortunately, HuggingFace datasets adds None fields during serialization of nested dicts
# This is a known behavior. We have two options:
# Option 1: Accept the None fields (they don't affect training, collate_fn handles them)
# Option 2: Post-process to remove them (adds overhead but cleaner)

# For now, using Option 1 - the collate_fn already handles None values correctly
train_dataset_formatted = train_dataset.map(
    convert_to_conversation_format,
    remove_columns=columns_to_remove
)
eval_dataset_formatted = eval_dataset.map(
    convert_to_conversation_format, 
    remove_columns=columns_to_remove
)

print(f"Formatted training samples: {len(train_dataset_formatted)}")
print(f"Formatted evaluation samples: {len(eval_dataset_formatted)}")

# NOTE: HuggingFace automatically adds 'image': None and 'text': None to content items
# This is expected behavior and the collate_fn handles it correctly

Map:   0%|          | 0/1083 [00:00<?, ? examples/s]

Map:   0%|          | 0/123 [00:00<?, ? examples/s]

Formatted training samples: 1083
Formatted evaluation samples: 123


## IMPORTANT DISCOVERY: None fields in HuggingFace Datasets

**Key points:**
1. HuggingFace uses Apache Arrow which enforces schema consistency
2. Dataset.from_list() will ALWAYS add None fields back for nested dicts with varying keys
3. The None fields don't affect training - collate_fn handles them correctly
4. This is only a cosmetic issue when inspecting the dataset

**Example of the None values issue:**

When `convert_to_conversation_format` returns clean data:
```python
{"type": "image", "image": "IMAGE_PLACEHOLDER"}
{"type": "text", "text": "Detect the bounding box..."}
```

After `dataset.map()`, HuggingFace adds None fields for schema consistency:
```python
{"type": "image", "image": "IMAGE_PLACEHOLDER", "text": None}  # <-- "text": None added!
{"type": "text", "text": "Detect the bounding box...", "image": None}  # <-- "image": None added!
```

This happens because Apache Arrow requires all dicts in a list to have the same keys.

We keep the current approach: use .map() with remove_columns for efficiency.
The collate_fn properly filters out None values during training.

## Comparing Output: BEFORE vs AFTER `dataset.map()`

The next two cells show the **same data** at different stages:

1. **`sample`** = Direct call to `convert_to_conversation_format(train_dataset[0])` → **CLEAN** output, no None fields
2. **`train_dataset_formatted[0]`** = After `dataset.map()` → **Has None fields** added by HuggingFace

This helps you see exactly what None fields get added.

In [8]:
# Simple check - just print what convert_to_conversation_format produces
print("\n" + "="*60)
print("Sample 0 after conversion:")
print("="*60)
sample = convert_to_conversation_format(train_dataset[0])
print(sample)

print("\n" + "="*60)
print("Sample 1 after conversion:")
print("="*60)
sample2 = convert_to_conversation_format(train_dataset[1])
print(sample2)


Sample 0 after conversion:
{'messages': [{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]}, {'role': 'user', 'content': [{'type': 'image', 'image': 'IMAGE_PLACEHOLDER'}, {'type': 'text', 'text': 'Detect the bounding box of the nutrition table.'}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>'}]}], 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F85E80CDD50>}

Sample 1 after conversion:
{'messages': [{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in 

In [9]:
# Display the same samples with better formatting (line breaks)
from pprint import pprint

print("\n" + "="*60)
print("Sample 0 - Better formatted")
print("="*60)
print()
pprint(sample)

print("\n" + "="*60)
print("Sample 1 - Better formatted")
print("="*60)
print()
pprint(sample2)


Sample 0 - Better formatted

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F85E80CDD50>,
 'messages': [{'content': [{'text': 'You are a Vision Language Model '
                                    'specialized in interpreting visual data '
                                    'from product images.\n'
                                    'Your task is to analyze the provided '
                                    'product images and detect the nutrition '
                                    'tables in a certain format.\n'
                                    'Focus on delivering accurate, succinct '
                                    'answers based on the visual information. '
                                    'Avoid additional explanation unless '
                                    'absolutely necessary.',
                            'type': 'text'}],
               'role': 'system'},
              {'content': [{'image': 'IMAGE_PLACEHOLDER', 'type': '

In [10]:
# Show what happens after apply_chat_template
print("\n" + "="*60)
print("After apply_chat_template")
print("="*60)

# Load processor to apply chat template
from transformers import Qwen2VLProcessor
processor = Qwen2VLProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Apply chat template to sample 0
text_sample0 = processor.apply_chat_template(
    sample['messages'], 
    tokenize=False, 
    add_generation_prompt=False  # False for training
)
print("\nSample 0 after apply_chat_template:")
print(text_sample0)

# Apply chat template to sample 1
text_sample1 = processor.apply_chat_template(
    sample2['messages'],
    tokenize=False,
    add_generation_prompt=False
)
print("\n" + "="*60)
print("\nSample 1 after apply_chat_template:")
print(text_sample1)


After apply_chat_template


preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


chat_template.json: 0.00B [00:00, ?B/s]


Sample 0 after apply_chat_template:
<|im_start|>system
You are a Vision Language Model specialized in interpreting visual data from product images.
Your task is to analyze the provided product images and detect the nutrition tables in a certain format.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Detect the bounding box of the nutrition table.<|im_end|>
<|im_start|>assistant
<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|><|im_end|>



Sample 1 after apply_chat_template:
<|im_start|>system
You are a Vision Language Model specialized in interpreting visual data from product images.
Your task is to analyze the provided product images and detect the nutrition tables in a certain format.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional 

## Now let's see what ACTUALLY goes to the DataLoader

Above, we saw `sample` - the **clean** output from `convert_to_conversation_format()`. 

You might expect that `train_dataset_formatted = train_dataset.map(convert_to_conversation_format)` would simply apply our function to every sample and give us the same clean format.

**But that's not what happens!** HuggingFace's `dataset.map()` adds extra `None` fields due to Apache Arrow's schema requirements.

**Why does this matter?** If the collate_fn doesn't handle these None values, it could break `process_vision_info()` or `apply_chat_template()` during training. Understanding this difference is key to debugging data pipeline issues.

In [11]:
# Inspect the first two elements of train_dataset_formatted (what actually goes to DataLoader)
print("\n" + "="*80)
print("INSPECTING train_dataset_formatted - ACTUAL DATASET PASSED TO TRAINER")
print("="*80)

# Simple inspection first
for i in range(2):
    print(f"\nSample {i}:")
    print(train_dataset_formatted[i])
    print("-" * 50)


INSPECTING train_dataset_formatted - ACTUAL DATASET PASSED TO TRAINER

Sample 0:
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F85E8127FA0>, 'messages': [{'content': [{'image': None, 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': 'IMAGE_PLACEHOLDER', 'text': None, 'type': 'image'}, {'image': None, 'text': 'Detect the bounding box of the nutrition table.', 'type': 'text'}], 'role': 'user'}, {'content': [{'image': None, 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>', 'type': 'text'}], 'role': 'assistant'}]}
---------------------------------

In [12]:
# Display the same train_dataset_formatted samples with better formatting
from pprint import pprint

print("\n" + "="*60)
print("train_dataset_formatted[0] - Better formatted")
print("="*60)
print()
formatted_sample_0 = train_dataset_formatted[0]
pprint(formatted_sample_0)

print("\n" + "="*60)
print("train_dataset_formatted[1] - Better formatted")
print("="*60)
print()
formatted_sample_1 = train_dataset_formatted[1]
pprint(formatted_sample_1)


train_dataset_formatted[0] - Better formatted

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F85E8127340>,
 'messages': [{'content': [{'image': None,
                            'text': 'You are a Vision Language Model '
                                    'specialized in interpreting visual data '
                                    'from product images.\n'
                                    'Your task is to analyze the provided '
                                    'product images and detect the nutrition '
                                    'tables in a certain format.\n'
                                    'Focus on delivering accurate, succinct '
                                    'answers based on the visual information. '
                                    'Avoid additional explanation unless '
                                    'absolutely necessary.',
                            'type': 'text'}],
               'role': 'system'},
      

# Step by Step debugging the collating function before apply to chat template

The following cells walk through the data transformation pipeline step by step,
showing exactly what happens to the data at each stage.

In [13]:
print(train_dataset_formatted[0])
print(type(train_dataset_formatted))
print(type(train_dataset_formatted[0]))
print(type([train_dataset_formatted[0:10]]))

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F85E80CE260>, 'messages': [{'content': [{'image': None, 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': 'IMAGE_PLACEHOLDER', 'text': None, 'type': 'image'}, {'image': None, 'text': 'Detect the bounding box of the nutrition table.', 'type': 'text'}], 'role': 'user'}, {'content': [{'image': None, 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>', 'type': 'text'}], 'role': 'assistant'}]}
<class 'datasets.arrow_dataset.Dataset'>
<class 'dict'>
<class 'list'>


In [14]:
batch = train_dataset_formatted

# Extract messages and images from each sample
messages_list = [sample['messages'] for sample in batch]
images_list = [sample.get('image', None) for sample in batch]

In [15]:
print(type(messages_list))
print(type(images_list))
# print(messages_list)
# print(images_list)
for i in range(5):
    print(messages_list[i])
    print(images_list[i])
    print("-"*100)

<class 'list'>
<class 'list'>
[{'content': [{'image': None, 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': 'IMAGE_PLACEHOLDER', 'text': None, 'type': 'image'}, {'image': None, 'text': 'Detect the bounding box of the nutrition table.', 'type': 'text'}], 'role': 'user'}, {'content': [{'image': None, 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>', 'type': 'text'}], 'role': 'assistant'}]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F8620066260>
----------------------------------------------------------------------------------------------------
[{'conte

In [16]:
print(len(messages_list))

1083


In [17]:
# Filter out samples without images
valid_pairs = [(m, img) for m, img in zip(messages_list, images_list) if img is not None]
if not valid_pairs:
    raise ValueError("Batch contains no valid images.")

messages_list, images_list = zip(*valid_pairs)
messages_list = list(messages_list)
images_list = list(images_list)

# Process each sample
texts = []
all_images = []

In [18]:
all_conversations = []  # This will hold complete conversations

for messages, image in zip(messages_list, images_list):
    messages_with_image = []
    # Clean up messages: remove None values and restore images
    for msg in messages:
        msg_copy = {'role': msg['role'], 'content': []}
        
        for content_item in msg['content']:
            # Skip None entries entirely
            if content_item is None:
                continue
                
            # Process text content - filter out None text values
            if content_item.get('type') == 'text':
                text_value = content_item.get('text')
                if text_value is not None and text_value != 'None':  # Check for actual None and string 'None'
                    msg_copy['content'].append({
                        'type': 'text',
                        'text': text_value
                    })
            # Process image content - replace IMAGE_PLACEHOLDER with actual PIL image
            elif content_item.get('type') == 'image' and msg['role'] == 'user':
                # Check if it's IMAGE_PLACEHOLDER and replace with actual image
                image_value = content_item.get('image')
                if image_value == 'IMAGE_PLACEHOLDER' or image_value is None:
                    # Replace with the actual PIL image from top level
                    msg_copy['content'].append({
                        'type': 'image',
                        'image': image  # Use the actual PIL image
                    })
                elif image_value and image_value != 'None':
                    # Use existing image if it's not placeholder
                    msg_copy['content'].append({
                        'type': 'image',
                        'image': image_value
                    })
        
        if msg_copy['content']:
            messages_with_image.append(msg_copy)

    # Add the complete conversation (3 messages) to collection
    all_conversations.append(messages_with_image)

print("First 5 messages_with_image examples:")
for i in range(min(5, len(all_conversations))):
    print(f"\nExample {i}:")
    print(all_conversations[i])

First 5 messages_with_image examples:

Example 0:
[{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]}, {'role': 'user', 'content': [{'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F8620066260>}, {'type': 'text', 'text': 'Detect the bounding box of the nutrition table.'}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>'}]}]

Example 1:
[{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour

In [19]:
# Final Code to make sure the data is formatted correctly after dataset.map:


batch = train_dataset_formatted

# Extract messages and images from each sample
messages_list = [sample['messages'] for sample in batch]
images_list = [sample.get('image', None) for sample in batch]

# Filter out samples without images
valid_pairs = [(m, img) for m, img in zip(messages_list, images_list) if img is not None]
if not valid_pairs:
    raise ValueError("Batch contains no valid images.")

messages_list, images_list = zip(*valid_pairs)
messages_list = list(messages_list)
images_list = list(images_list)

# Process each sample
# texts = []
# all_images = []

all_conversations = []  # This will hold complete conversations, each conversation is a list of 3 messages (system, assistant, user)

for messages, image in zip(messages_list, images_list):
    messages_with_image = []
    # Clean up messages: remove None values and restore images
    for msg in messages:
        msg_copy = {'role': msg['role'], 'content': []}
        
        for content_item in msg['content']:
            # Skip None entries entirely
            if content_item is None:
                continue
                
            # Process text content - filter out None text values
            if content_item.get('type') == 'text':
                text_value = content_item.get('text')
                if text_value is not None and text_value != 'None':  # Check for actual None and string 'None'
                    msg_copy['content'].append({
                        'type': 'text',
                        'text': text_value
                    })
            # Process image content - replace IMAGE_PLACEHOLDER with actual PIL image
            elif content_item.get('type') == 'image' and msg['role'] == 'user':
                # Check if it's IMAGE_PLACEHOLDER and replace with actual image
                image_value = content_item.get('image')
                if image_value == 'IMAGE_PLACEHOLDER' or image_value is None:
                    # Replace with the actual PIL image from top level
                    msg_copy['content'].append({
                        'type': 'image',
                        'image': image  # Use the actual PIL image
                    })
                elif image_value and image_value != 'None':
                    # Use existing image if it's not placeholder
                    msg_copy['content'].append({
                        'type': 'image',
                        'image': image_value
                    })
        
        if msg_copy['content']:
            messages_with_image.append(msg_copy)

    # Add the complete conversation (3 messages) to collection
    all_conversations.append(messages_with_image)

print("First 5 all_conversations examples:")
for i in range(min(5, len(all_conversations))):
    print(f"\nExample {i}:")
    print(all_conversations[i])

First 5 all_conversations examples:

Example 0:
[{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour task is to analyze the provided product images and detect the nutrition tables in a certain format.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]}, {'role': 'user', 'content': [{'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2592x1944 at 0x7F7FA93BC280>}, {'type': 'text', 'text': 'Detect the bounding box of the nutrition table.'}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|>'}]}]

Example 1:
[{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a Vision Language Model specialized in interpreting visual data from product images.\nYour t

In [20]:
# put a batch in apply_chat_template 

text = processor.apply_chat_template(
    all_conversations,
    tokenize=False,
    add_generation_prompt=False
)

for i in range(2):
    print(text[i])
    print("-"*100)

print(type(text))

<|im_start|>system
You are a Vision Language Model specialized in interpreting visual data from product images.
Your task is to analyze the provided product images and detect the nutrition tables in a certain format.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Detect the bounding box of the nutrition table.<|im_end|>
<|im_start|>assistant
<|object_ref_start|>nutrition-table<|object_ref_end|><|box_start|>(14,57),(991,603)<|box_end|><|im_end|>

----------------------------------------------------------------------------------------------------
<|im_start|>system
You are a Vision Language Model specialized in interpreting visual data from product images.
Your task is to analyze the provided product images and detect the nutrition tables in a certain format.
Focus on delivering accurate, succinct answers based on the visual info

In [21]:
image, video = process_vision_info(all_conversations)
for i in range(5):
    print(image[i])
    print("-"*100)
# print(video[0:5])

print(type(image))

<PIL.Image.Image image mode=RGB size=2604x1932 at 0x7F85E80CD480>
----------------------------------------------------------------------------------------------------
<PIL.Image.Image image mode=RGB size=308x420 at 0x7F8626DC9120>
----------------------------------------------------------------------------------------------------
<PIL.Image.Image image mode=RGB size=700x728 at 0x7F85E80CEA70>
----------------------------------------------------------------------------------------------------
<PIL.Image.Image image mode=RGB size=3080x4144 at 0x7F85E8064B80>
----------------------------------------------------------------------------------------------------
<PIL.Image.Image image mode=RGB size=1932x2604 at 0x7F8626DCA740>
----------------------------------------------------------------------------------------------------
<class 'list'>


In [22]:
batch_inputs = processor(
    text=text,
    images=image,
    # videos=all_videos,
    padding=True,
    truncation=False,
    return_tensors="pt"
)

In [23]:
print(f'batch_inputs: \n {batch_inputs["input_ids"][0:2, -10:]}') ## First 2 samples, last 10 tokens
print(f'batch_inputs["attention_mask"]: \n {batch_inputs["attention_mask"][0:2, -10:]}') ## 1=real token, 0=padding

print(f'batch_inputs["input_ids"]: \n {batch_inputs["input_ids"][0:2, 0:10]}') # First 2 samples, first 10 tokens
print(f'batch_inputs["attention_mask"]: \n {batch_inputs["attention_mask"][0:2, 0:10]}')

print("-"*100)

print("Input shapes:")
for key, value in batch_inputs.items():
    if hasattr(value, 'shape'):
        print(f"{key}: {value.shape}")

print("-"*100)
print(f'type(batch_inputs): \n {type(batch_inputs)}')

batch_inputs: 
 tensor([[    24,     16,     11,     21,     15,     18,      8, 151649, 151645,
            198],
        [    16,     21,     11,     20,     23,     23,      8, 151649, 151645,
            198]])
batch_inputs["attention_mask"]: 
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
batch_inputs["input_ids"]: 
 tensor([[151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643],
        [151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643]])
batch_inputs["attention_mask"]: 
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
----------------------------------------------------------------------------------------------------
Input shapes:
input_ids: torch.Size([1083, 16433])
attention_mask: torch.Size([1083, 16433])
pixel_values: torch.Size([31255872, 1176])
image_grid_thw: torch.Size([1083, 3])
----------------------------------------------------

In [24]:
# Final Code starting from Apply Chat template 

# Apply chat template to get text
text = processor.apply_chat_template(
    all_conversations,
    tokenize=False,
    add_generation_prompt=False
)

# process_vision_info to get image and video
image, video = process_vision_info(all_conversations)

# Process texts and images together
batch_inputs = processor( # batch_inputs is a dictionary containing input_ids, attention_mask, pixel_values, and image_grid_thw
    text=text,
    images=image,
    # videos=all_videos,
    padding=True,
    truncation=False,
    return_tensors="pt"
)

In [25]:
# __init__ method
print(
    f"processor.tokenizer.pad_token_id: {processor.tokenizer.pad_token_id}\n"
    f"processor.tokenizer.convert_tokens_to_ids('<|vision_start|>'): {processor.tokenizer.convert_tokens_to_ids('<|vision_start|>')}\n"
    f"processor.tokenizer.convert_tokens_to_ids('<|vision_end|>'): {processor.tokenizer.convert_tokens_to_ids('<|vision_end|>')}\n"
    f"processor.tokenizer.convert_tokens_to_ids('<|image_pad|>'): {processor.tokenizer.convert_tokens_to_ids('<|image_pad|>')}"
)

processor.tokenizer.pad_token_id: 151643
processor.tokenizer.convert_tokens_to_ids('<|vision_start|>'): 151652
processor.tokenizer.convert_tokens_to_ids('<|vision_end|>'): 151653
processor.tokenizer.convert_tokens_to_ids('<|image_pad|>'): 151655


## Clear Memory

Before proceeding with training, clear current variables and clean the GPU to free up resources.

In [26]:
import gc
import time

def clear_memory():
    # Delete variables if they exist in the current global scope
    if 'inputs' in globals(): del globals()['inputs']
    if 'model' in globals(): del globals()['model']
    if 'processor' in globals(): del globals()['processor']
    if 'trainer' in globals(): del globals()['trainer']
    if 'peft_model' in globals(): del globals()['peft_model']
    if 'bnb_config' in globals(): del globals()['bnb_config']
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

clear_memory()

GPU allocated memory: 0.00 GB
GPU reserved memory: 0.00 GB
