# 01 - Dataset Exploration: OpenFoodFacts Nutrition Tables

This notebook provides a detailed exploration of the **OpenFoodFacts nutrition-table-detection** dataset used for fine-tuning vision-language models.

## Objectives
- Load and understand the dataset structure
- Visualize sample images with bounding boxes
- Analyze image size distributions
- Study bounding box characteristics
- Generate insights for model training

## Dataset Info
- **Source**: [openfoodfacts/nutrition-table-detection](https://huggingface.co/datasets/openfoodfacts/nutrition-table-detection)
- **Task**: Object detection for nutrition tables in product images
- **Format**: Images with normalized bounding boxes [y_min, x_min, y_max, x_max]
- **Splits**: Train and Validation sets

## 1. Setup and Load Dataset

In [None]:
# Import required libraries
from datasets import load_dataset
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import numpy as np
import pprint
from collections import Counter

# Load the dataset
dataset_id = "openfoodfacts/nutrition-table-detection"
print(f"Loading dataset: {dataset_id}")
ds = load_dataset(dataset_id)

# Split into training and evaluation sets
train_dataset = ds['train']
eval_dataset = ds['val']

print("\n" + "=" * 60)
print("DATASET LOADED SUCCESSFULLY")
print("=" * 60)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

## 2. Dataset Structure Overview

In [None]:
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(ds)

print("\n" + "=" * 60)
print("FEATURE SCHEMA")
print("=" * 60)
for feature_name, feature_type in train_dataset.features.items():
    print(f"  {feature_name:12} : {feature_type}")

## 3. Inspect Sample Data

Let's examine the structure of a single training example to understand the data format.

In [None]:
# Get the first training example
example = train_dataset[0]

print("=" * 60)
print("FIRST TRAINING EXAMPLE - DETAILED INSPECTION")
print("=" * 60)

print(f"\nüÜî BASIC INFO:")
print(f"  Image ID     : {example['image_id']}")
print(f"  Dimensions   : {example['width']} x {example['height']}")
print(f"  Image Object : {example['image']}")

print(f"\nüìä METADATA:")
pp = pprint.PrettyPrinter(indent=4, width=80)
pp.pprint(example['meta'])

print(f"\nüéØ ANNOTATIONS:")
print(f"  Number of objects: {len(example['objects']['category_name'])}")
print(f"  Category names   : {example['objects']['category_name']}")
print(f"  Category IDs     : {example['objects']['category_id']}")
print(f"  Bounding boxes   :")
for i, bbox in enumerate(example['objects']['bbox']):
    print(f"    Object {i}: [y_min={bbox[0]:.3f}, x_min={bbox[1]:.3f}, y_max={bbox[2]:.3f}, x_max={bbox[3]:.3f}]")

print(f"\nüí° NOTE: Bounding box coordinates are normalized to [0, 1] for resolution independence.")
print(f"   Format: [y_min, x_min, y_max, x_max] (OpenFoodFacts convention)")

## 4. Visualize Single Example

Let's visualize one example to understand how bounding boxes map to actual images.

In [None]:
print("=" * 60)
print("SINGLE IMAGE VISUALIZATION")
print("=" * 60)

# Get the bounding box and image
bbox = example['objects']['bbox'][0]
pil_width, pil_height = example['image'].size

print(f"Image dimensions: {pil_width} x {pil_height} pixels")
print(f"Raw bbox (normalized): {bbox}")

# Convert normalized coordinates to pixel coordinates
y_min, x_min, y_max, x_max = bbox
x_min_px = x_min * pil_width
y_min_px = y_min * pil_height
x_max_px = x_max * pil_width
y_max_px = y_max * pil_height

print(f"Pixel coordinates: [{x_min_px:.1f}, {y_min_px:.1f}, {x_max_px:.1f}, {y_max_px:.1f}]")
print(f"Box size: {x_max_px-x_min_px:.1f} x {y_max_px-y_min_px:.1f} pixels")

# Create side-by-side visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Left: Original image
ax1.imshow(example['image'])
ax1.set_title("Original Image")
ax1.axis('off')

# Right: Image with bounding box
img_with_bbox = example['image'].copy()
draw = ImageDraw.Draw(img_with_bbox)
draw.rectangle([x_min_px, y_min_px, x_max_px, y_max_px], outline='red', width=3)

ax2.imshow(img_with_bbox)
ax2.set_title(f"With Bounding Box: {example['objects']['category_name'][0]}")
ax2.axis('off')

plt.tight_layout()
plt.show()

print(f"\n‚úÖ Category detected: {example['objects']['category_name'][0]}")

## 5. Visualize Multiple Examples

Display a grid of training examples to see the variety in the dataset.

In [None]:
print("=" * 60)
print("MULTIPLE EXAMPLES VISUALIZATION")
print("=" * 60)
print("Displaying 6 training examples with nutrition table bounding boxes:\n")

# Create a 2x3 grid for displaying 6 examples
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for idx in range(6):
    example = train_dataset[idx]
    img = example['image'].copy()
    draw = ImageDraw.Draw(img)
    
    pil_width, pil_height = img.size
    num_objects = len(example['objects']['bbox'])
    
    print(f"  Example {idx}: {pil_width}x{pil_height}px, {num_objects} object(s)")
    
    # Calculate proportional line width
    line_width = max(3, min(pil_width, pil_height) // 150)
    
    # Draw all bounding boxes for this image
    for i, (bbox, category) in enumerate(zip(example['objects']['bbox'], 
                                             example['objects']['category_name'])):
        y_min, x_min, y_max, x_max = bbox
        x_min_px = x_min * pil_width
        y_min_px = y_min * pil_height
        x_max_px = x_max * pil_width
        y_max_px = y_max * pil_height
        
        draw.rectangle([x_min_px, y_min_px, x_max_px, y_max_px], 
                      outline='red', width=line_width)
        
        # Add category label
        try:
            font = ImageFont.truetype("arial.ttf", 20)
        except:
            font = ImageFont.load_default()
        
        text_y = max(10, y_min_px - 25)
        draw.text((x_min_px, text_y), f"{category}", fill='red', font=font)
    
    axes[idx].imshow(img)
    axes[idx].set_title(f"Example {idx}\nSize: {pil_width}x{pil_height}px | Objects: {num_objects}", 
                       fontsize=10, pad=10)
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

print(f"\n‚úÖ Visualization complete!")

## 6. Image Size Distribution Analysis

Analyze the distribution of image sizes to understand the variety in input dimensions.

In [None]:
print("=" * 60)
print("IMAGE SIZE DISTRIBUTION ANALYSIS")
print("=" * 60)
print(f"Processing {len(train_dataset)} training examples...\n")

# Collect statistics
image_widths = []
image_heights = []
image_areas = []
num_bboxes_per_image = []

for example in train_dataset:
    image_widths.append(example['width'])
    image_heights.append(example['height'])
    image_areas.append(example['width'] * example['height'])
    num_bboxes_per_image.append(len(example['objects']['bbox']))

# Convert to numpy arrays
image_widths = np.array(image_widths)
image_heights = np.array(image_heights)
image_areas = np.array(image_areas)
num_bboxes_per_image = np.array(num_bboxes_per_image)

print(f"üìä IMAGE DIMENSIONS SUMMARY:")
print(f"  Width  - Min: {image_widths.min():4d}px | Max: {image_widths.max():4d}px | Mean: {image_widths.mean():.1f}px")
print(f"  Height - Min: {image_heights.min():4d}px | Max: {image_heights.max():4d}px | Mean: {image_heights.mean():.1f}px")
print(f"  Area   - Min: {image_areas.min():8.0f} | Max: {image_areas.max():8.0f} | Mean: {image_areas.mean():.0f}")

print(f"\nüéØ BOUNDING BOXES SUMMARY:")
print(f"  Min boxes per image: {num_bboxes_per_image.min()}")
print(f"  Max boxes per image: {num_bboxes_per_image.max()}")
print(f"  Mean boxes per image: {num_bboxes_per_image.mean():.2f}")
print(f"  Total bounding boxes: {num_bboxes_per_image.sum()}")

### Visualize Distributions

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# 1. Image widths histogram
ax1.hist(image_widths, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_xlabel('Image Width (pixels)')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Image Widths')
ax1.grid(True, alpha=0.3)
ax1.axvline(image_widths.mean(), color='red', linestyle='--', 
            label=f'Mean: {image_widths.mean():.0f}px')
ax1.legend()

# 2. Image heights histogram
ax2.hist(image_heights, bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
ax2.set_xlabel('Image Height (pixels)')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of Image Heights')
ax2.grid(True, alpha=0.3)
ax2.axvline(image_heights.mean(), color='red', linestyle='--', 
            label=f'Mean: {image_heights.mean():.0f}px')
ax2.legend()

# 3. Image areas histogram
ax3.hist(image_areas/1e6, bins=30, alpha=0.7, color='orange', edgecolor='black')
ax3.set_xlabel('Image Area (Megapixels)')
ax3.set_ylabel('Frequency')
ax3.set_title('Distribution of Image Areas')
ax3.grid(True, alpha=0.3)
ax3.axvline(image_areas.mean()/1e6, color='red', linestyle='--', 
            label=f'Mean: {image_areas.mean()/1e6:.1f}MP')
ax3.legend()

# 4. Bounding boxes per image
bbox_counts = np.bincount(num_bboxes_per_image)
bbox_labels = np.arange(len(bbox_counts))
ax4.bar(bbox_labels, bbox_counts, alpha=0.7, color='coral', edgecolor='black')
ax4.set_xlabel('Number of Bounding Boxes per Image')
ax4.set_ylabel('Number of Images')
ax4.set_title('Distribution of Bounding Boxes per Image\n(Critical for Model Configuration)')
ax4.grid(True, alpha=0.3)
ax4.set_xticks(bbox_labels)

# Add percentage labels on bars
for i, count in enumerate(bbox_counts):
    if count > 0:
        percentage = (count / len(train_dataset)) * 100
        ax4.text(i, count + 0.5, f'{count}\n({percentage:.1f}%)', 
                ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Print detailed bbox distribution
unique_bbox_counts, bbox_frequencies = np.unique(num_bboxes_per_image, return_counts=True)
print(f"\nüìä Bounding box distribution breakdown:")
for count, freq in zip(unique_bbox_counts, bbox_frequencies):
    percentage = (freq / len(train_dataset)) * 100
    print(f"  {count} box(es): {freq:3d} images ({percentage:5.1f}%)")

## 7. Bounding Box Characteristics Analysis

Detailed analysis of bounding box properties for anchor configuration and model optimization.

In [None]:
print("=" * 60)
print("BOUNDING BOX DETAILED ANALYSIS")
print("=" * 60)
print(f"Analyzing bbox sizes, positions, and coverage...\n")

# Collect detailed bbox statistics
bbox_widths = []
bbox_heights = []
bbox_areas = []
bbox_aspect_ratios = []
coverage_ratios = []
center_x_positions = []
center_y_positions = []

for example in train_dataset:
    img_width, img_height = example['width'], example['height']
    img_area = img_width * img_height
    
    for bbox in example['objects']['bbox']:
        # Convert normalized coordinates to pixels
        # Note: bbox format is [y_min, x_min, y_max, x_max]
        y_min, x_min, y_max, x_max = bbox
        w = (x_max - x_min) * img_width
        h = (y_max - y_min) * img_height
        area = w * h
        aspect_ratio = w / h if h > 0 else 0
        coverage = area / img_area if img_area > 0 else 0
        
        # Calculate center position (normalized)
        center_x = (x_min + x_max) / 2
        center_y = (y_min + y_max) / 2
        
        bbox_widths.append(w)
        bbox_heights.append(h)
        bbox_areas.append(area)
        bbox_aspect_ratios.append(aspect_ratio)
        coverage_ratios.append(coverage)
        center_x_positions.append(center_x)
        center_y_positions.append(center_y)

# Convert to numpy
bbox_widths = np.array(bbox_widths)
bbox_heights = np.array(bbox_heights)
bbox_areas = np.array(bbox_areas)
bbox_aspect_ratios = np.array(bbox_aspect_ratios)
coverage_ratios = np.array(coverage_ratios)

print(f"‚úÖ Analyzed {len(bbox_widths)} bounding boxes!")

print(f"\nüìè BOUNDING BOX SIZE STATISTICS:")
print(f"  Width  - Min: {bbox_widths.min():4.0f}px | Max: {bbox_widths.max():4.0f}px | Mean: {bbox_widths.mean():.0f}px | Std: {bbox_widths.std():.0f}px")
print(f"  Height - Min: {bbox_heights.min():4.0f}px | Max: {bbox_heights.max():4.0f}px | Mean: {bbox_heights.mean():.0f}px | Std: {bbox_heights.std():.0f}px")
print(f"  Area   - Min: {bbox_areas.min():8.0f} | Max: {bbox_areas.max():8.0f} | Mean: {bbox_areas.mean():.0f}")

print(f"\nüìê ASPECT RATIO & COVERAGE:")
print(f"  Aspect Ratio (W/H) - Min: {bbox_aspect_ratios.min():.2f} | Max: {bbox_aspect_ratios.max():.2f} | Mean: {bbox_aspect_ratios.mean():.2f}")
print(f"  Image Coverage     - Min: {coverage_ratios.min():.1%} | Max: {coverage_ratios.max():.1%} | Mean: {coverage_ratios.mean():.1%}")

### Visualize Bounding Box Characteristics

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# 1. Bbox widths and heights
ax1.hist(bbox_widths, bins=25, alpha=0.6, color='purple', edgecolor='black', label='Width')
ax1.hist(bbox_heights, bins=25, alpha=0.6, color='orange', edgecolor='black', label='Height')
ax1.set_xlabel('Size (pixels)')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Bounding Box Dimensions')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Aspect ratios
ax2.hist(bbox_aspect_ratios, bins=25, alpha=0.7, color='brown', edgecolor='black')
ax2.set_xlabel('Aspect Ratio (Width/Height)')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of Bounding Box Aspect Ratios')
ax2.axvline(bbox_aspect_ratios.mean(), color='red', linestyle='--', 
            label=f'Mean: {bbox_aspect_ratios.mean():.2f}')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Coverage ratios
ax3.hist(coverage_ratios * 100, bins=25, alpha=0.7, color='green', edgecolor='black')
ax3.set_xlabel('Image Coverage (%)')
ax3.set_ylabel('Frequency')
ax3.set_title('Bounding Box Coverage of Image')
ax3.axvline(coverage_ratios.mean() * 100, color='red', linestyle='--', 
            label=f'Mean: {coverage_ratios.mean():.1%}')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Bbox center positions
ax4.scatter(center_x_positions, center_y_positions, alpha=0.6, color='red', s=20)
ax4.set_xlabel('Center X (normalized)')
ax4.set_ylabel('Center Y (normalized)')
ax4.set_title('Bounding Box Center Positions')
ax4.set_xlim(0, 1)
ax4.set_ylim(0, 1)
ax4.grid(True, alpha=0.3)
ax4.invert_yaxis()  # Match image coordinates (0,0 at top-left)

plt.tight_layout()
plt.show()

## 8. Category Analysis

In [None]:
print("=" * 60)
print("CATEGORY ANALYSIS")
print("=" * 60)

# Collect all categories
all_categories = []
for example in train_dataset:
    all_categories.extend(example['objects']['category_name'])

category_counts = Counter(all_categories)

print(f"\nCategory distribution:")
for category, count in category_counts.items():
    percentage = (count / len(all_categories)) * 100
    print(f"  {category}: {count} instances ({percentage:.1f}%)")

# Check for category variations
unique_categories = set(all_categories)
print(f"\nUnique category strings: {len(unique_categories)}")
if len(unique_categories) > 1:
    print(f"  ‚ö†Ô∏è  Multiple category variants detected: {unique_categories}")
else:
    print(f"  ‚úÖ Consistent single category: {list(unique_categories)[0]}")

## 9. Key Insights and Recommendations

Based on the dataset analysis, here are important insights for model training.

In [None]:
print("=" * 60)
print("KEY INSIGHTS AND RECOMMENDATIONS")
print("=" * 60)

print(f"\nüîß MODEL PREPARATION INSIGHTS:")
print(f"  Most common image aspect ratio: {image_widths.mean()/image_heights.mean():.2f} (width/height)")
print(f"  Average bbox coverage: {coverage_ratios.mean():.1%} of image")
print(f"  Average bbox aspect ratio: {bbox_aspect_ratios.mean():.2f}")

print(f"\nüí° RECOMMENDATIONS:")

# Single vs multi-object detection
if num_bboxes_per_image.max() == 1:
    print(f"  ‚úÖ Single object detection - simpler model configuration")
else:
    print(f"  ‚ö†Ô∏è  Multi-object detection - configure model for up to {num_bboxes_per_image.max()} objects")

# Input resolution recommendation
suggested_resolution = int(np.sqrt(image_areas.mean()))
print(f"  üìê Consider input resolution around {suggested_resolution}√ó{suggested_resolution} pixels")

# Anchor configuration
mean_coverage = coverage_ratios.mean()
std_coverage = coverage_ratios.std()
scale_small = max(0.2, mean_coverage - std_coverage)
scale_medium = mean_coverage
scale_large = min(0.9, mean_coverage + std_coverage)
print(f"\nüéØ ANCHOR CONFIGURATION RECOMMENDATIONS:")
print(f"  Recommended anchor scales: [{scale_small:.1f}, {scale_medium:.1f}, {scale_large:.1f}]")

# Aspect ratios
mean_aspect = bbox_aspect_ratios.mean()
std_aspect = bbox_aspect_ratios.std()
ratio_narrow = max(0.5, mean_aspect - std_aspect)
ratio_medium = mean_aspect
ratio_wide = min(2.0, mean_aspect + std_aspect)
print(f"  Recommended aspect ratios: [{ratio_narrow:.1f}, {ratio_medium:.1f}, {ratio_wide:.1f}]")

print(f"\n‚úÖ Dataset analysis complete!")
print(f"   Use these insights to configure your vision-language model for optimal performance.")

## Summary

This notebook provided a comprehensive exploration of the OpenFoodFacts nutrition table detection dataset. Key findings:

1. **Dataset size**: Training and validation splits with nutrition table annotations
2. **Image characteristics**: Variable sizes with diverse aspect ratios
3. **Bounding boxes**: Normalized coordinates in [y_min, x_min, y_max, x_max] format
4. **Object detection**: Single or multi-object per image
5. **Coverage patterns**: Nutrition tables typically occupy a significant portion of images

Use these insights when:
- Configuring model input resolution
- Setting up data augmentation
- Designing anchor boxes (if applicable)
- Tuning training hyperparameters

**Next steps**: Proceed to model understanding and fine-tuning in the main training notebook.