# Dataset Exploration – Checkbox Identification

This notebook presents a **retrospective exploration** of the datasets used in this
project for checkbox state classification (checked vs unchecked) using a
vision-language model.

The original dataset was sourced from Roboflow Universe in COCO format and was
designed for checkbox **object detection**. As part of the experimental workflow,
bounding box annotations were used to crop checkbox regions and construct
classification-ready datasets. A reduced and balanced subset was later created
for efficient model training and evaluation.

The purpose of this notebook is to document and analyze the **final dataset
pipeline**, including:
- the original COCO dataset structure,
- the cropped checkbox dataset,
- and the reduced dataset used for training.

The goals of this analysis are to:
- Understand dataset structure at each stage of processing,
- Inspect class distribution across train, validation, and test splits,
- Identify dataset challenges such as class imbalance and visual ambiguity in
  checkbox regions.

This exploration provides the foundation for the preprocessing choices and model
design decisions applied in the subsequent stages of the project.


## Original Dataset Source

The primary dataset used in this project was sourced from **Roboflow Universe**, specifically
the *Checkbox Detection* dataset (Version 7):

https://universe.roboflow.com/checkbox-detection-ztyeq/checkbox-kwtcz-qkbid/dataset/7/images

This dataset contains 843 document images annotated for checkbox object detection, with
predefined train, validation, and test splits. The images include a variety of real-world
document types such as scanned forms, UI screenshots, and application interfaces, making
the dataset diverse and representative.

While the original dataset is designed for object detection, it was selected for this
project due to the quality and diversity of checkbox annotations. These annotations were
used to extract checkbox regions, enabling the construction of a downstream checkbox
state classification dataset (checked vs unchecked).

The dataset was downloaded **COCO format**
(`checkbox.v7i.coco`). The COCO annotation format provides bounding box
coordinates and class labels for object detection tasks.



## Motivation for Cropping Checkbox Regions

To adapt the detection dataset for checkbox state classification, the annotated
bounding boxes were used to crop checkbox regions from the original images.

Cropping serves several purposes:
- Removes irrelevant background content
- Focuses the model on local visual cues
- Reduces input complexity
- Enables consistent input size for training

This transformation converts the detection dataset into a classification-ready
dataset while preserving the original annotation information.


In [11]:
import json
from pathlib import Path
from collections import Counter

# Path to COCO dataset root
COCO_ROOT = Path("../data/raw/checkbox.v7i.coco")

print("COCO dataset exists:", COCO_ROOT.exists())
print("Splits:", [p.name for p in COCO_ROOT.iterdir() if p.is_dir()])


COCO dataset exists: True
Splits: ['valid', 'test', 'train']


In [16]:
from pathlib import Path

COCO_ROOT = Path("../data/raw/checkbox.v7i.coco")

IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png"}

def count_images_robust(dataset_root):
    counts = {}
    for split in dataset_root.iterdir():
        if split.is_dir():
            images = [
                p for p in split.rglob("*")
                if p.suffix.lower() in IMAGE_EXTENSIONS
            ]
            counts[split.name] = len(images)
    return counts

coco_image_counts = count_images_robust(COCO_ROOT)
coco_image_counts

total = 0
for split, count in coco_image_counts.items():
    print(f"{split.upper()} SET — Images: {count}")
    total += count

print("\nTOTAL IMAGES IN ORIGINAL COCO DATASET:", total)


VALID SET — Images: 151
TEST SET — Images: 81
TRAIN SET — Images: 611

TOTAL IMAGES IN ORIGINAL COCO DATASET: 843


## Cropped Checkbox Dataset

Using the bounding box annotations from the original COCO dataset, checkbox regions
were cropped from the full document images. The resulting dataset was stored in the
folder `cropped_checkboxes_binary`.

This dataset represents the first classification-ready version of the data, where
each sample corresponds to a single checkbox labeled as either **checked** or
**unchecked**. The cropped dataset retains the original train, validation, and test
splits to avoid data leakage.


In [17]:
from pathlib import Path

CROPPED_FULL = Path("../data/processed/cropped_checkboxes_binary")

def explore_cropped_dataset(dataset_root):
    stats = {}
    for split in dataset_root.iterdir():
        if split.is_dir():
            split_stats = {}
            for cls in split.iterdir():
                if cls.is_dir():
                    split_stats[cls.name] = len(list(cls.glob("*")))
            stats[split.name] = split_stats
    return stats

cropped_full_stats = explore_cropped_dataset(CROPPED_FULL)
cropped_full_stats

for split, classes in cropped_full_stats.items():
    total = sum(classes.values())
    print(f"\n{split.upper()} SET — Total cropped checkboxes: {total}")
    for cls, count in classes.items():
        pct = (count / total) * 100 if total > 0 else 0
        print(f"  {cls}: {count} ({pct:.2f}%)")



VALID SET — Total cropped checkboxes: 978
  unchecked: 451 (46.11%)
  checked: 527 (53.89%)

TEST SET — Total cropped checkboxes: 714
  unchecked: 481 (67.37%)
  checked: 233 (32.63%)

TRAIN SET — Total cropped checkboxes: 4403
  unchecked: 2374 (53.92%)
  checked: 2029 (46.08%)


## Observations on Cropped Dataset (`cropped_checkboxes_binary`)

The cropped dataset contains a total of 6,095 checkbox samples distributed across
training, validation, and test splits. Each sample is labeled as either **checked**
or **unchecked**.

Analysis of the class distribution reveals moderate class imbalance across splits.
While the training and validation sets show relatively balanced proportions, the
test split exhibits a noticeable skew toward unchecked samples.

Such imbalance can bias model evaluation and make performance metrics less reliable.
Additionally, the full cropped dataset is relatively large, increasing computational
cost during fine-tuning experiments.


## Motivation for Dataset Reduction

To address class imbalance and reduce computational overhead, a smaller and balanced
subset of the cropped dataset was constructed. This reduced dataset preserves the
original train, validation, and test splits while ensuring equal representation of
checked and unchecked samples.

The resulting dataset, stored as `cropped_checkboxes_binary_small`, was used for all
model training and evaluation experiments in this project.


In [25]:
from pathlib import Path

CROPPED_SMALL = Path("../data/processed/cropped_checkboxes_binary_small")

def explore_small_dataset(dataset_root):
    stats = {}
    for split in dataset_root.iterdir():
        if split.is_dir():
            split_stats = {}
            for cls in split.iterdir():
                if cls.is_dir():
                    split_stats[cls.name] = len(list(cls.glob("*")))
            stats[split.name] = split_stats
    return stats

cropped_small_stats = explore_small_dataset(CROPPED_SMALL)
cropped_small_stats

for split, classes in cropped_small_stats.items():
    total = sum(classes.values())
    print(f"\n{split.upper()} SET — Final images used: {total}")
    for cls, count in classes.items():
        pct = (count / total) * 100 if total > 0 else 0
        print(f"  {cls}: {count} ({pct:.2f}%)")



VALID SET — Final images used: 300
  unchecked: 150 (50.00%)
  checked: 150 (50.00%)

TEST SET — Final images used: 400
  unchecked: 200 (50.00%)
  checked: 200 (50.00%)

TRAIN SET — Final images used: 1000
  unchecked: 500 (50.00%)
  checked: 500 (50.00%)


## Reduced and Balanced Dataset (`cropped_checkboxes_binary_small`)

Based on the analysis of the full cropped dataset, a reduced and balanced subset was
constructed to support efficient and unbiased model training. The reduced dataset
ensures equal representation of checked and unchecked samples across all splits.

The final dataset used for training consists of:
- **1000** training images
- **300** validation images
- **400** test images

Each split maintains a 50–50 class balance, which helps prevent bias during training
and enables reliable evaluation of model performance. This dataset was used for all
baseline and fine-tuning experiments reported in this project.


## Dataset Summary

| Dataset Stage        | Train | Validation | Test |
|----------------------|-------|------------|------|
| Original (COCO)      | 611   | 151        | 81   |
| Cropped (Full)       | 4403  | 978        | 714  |
| Cropped (Small)      | 1000  | 300        | 400  |
