# TACO Dataset Splitting and Organization

This notebook facilitates the splitting of the TACO dataset into training, validation, and test subsets. It includes:
1. Loading annotations and image metadata.
2. Creating separate directories for each subset.
3. Randomly shuffling and assigning images and annotations to the subsets.
4. Ensuring class balance and consistency across the splits.

The structured organization ensures that the dataset is ready for training and evaluation in downstream tasks such as object detection and classification.


In [15]:
import os
import json
import shutil
import random
import logging

In [16]:
# Set paths
SAVE_DIR = r"C:\Users\bumin\Downloads\DLCV project\TACO\data\images"
AUG_SAVE_DIR = r"C:\Users\bumin\Downloads\DLCV project\TACO\data\augmented_images"
ANNOTATIONS_PATH = r"C:\Users\bumin\Downloads\DLCV project\TACO\data\annotations.json"
SPLIT_SAVE_DIR = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split"
os.makedirs(SPLIT_SAVE_DIR, exist_ok=True)

In [17]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


In [18]:
# Create directories for train, val, and test splits
for split in ['train', 'val', 'test']:
    os.makedirs(os.path.join(SPLIT_SAVE_DIR, 'images', split), exist_ok=True)

##  Load and Parse Annotations

The COCO annotations are loaded from a JSON file. The script extracts the `images` and `annotations` sections, which provide metadata about the images and their bounding box details. These are then used to organize the dataset into the defined subsets.


In [19]:
# Load COCO annotations
with open(ANNOTATIONS_PATH, 'r') as file:
    annotations = json.load(file)

images = annotations['images']
annotations_list = annotations['annotations']

In [20]:
images

[{'id': 0,
  'width': 1537,
  'height': 2049,
  'file_name': 'batch_1/000006.jpg',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/33978196618_e30a59e0a8_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/33978196618_632623b4fc_z.jpg'},
 {'id': 1,
  'width': 1537,
  'height': 2049,
  'file_name': 'batch_1/000008.jpg',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/47803331152_ee00755a2e_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/47803331152_19beae025a_z.jpg'},
 {'id': 2,
  'width': 1537,
  'height': 2049,
  'file_name': 'batch_1/000010.jpg',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/40888872753_08ffb24902_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/40888872753_631ab0f441_z.jpg'},
 {'id': 3,
  'width': 2049,


In [21]:
# Get all augmented images
all_images = [f for f in os.listdir(SAVE_DIR) if f.endswith('.jpg')]
#augmented_images = [f for f in os.listdir(AUG_SAVE_DIR) if f.endswith('.jpg')]
#all_images.extend(augmented_images)

In [22]:
# Create a new images list including augmented images
#augmented_image_ids = len(images) + 1
#for aug_img in augmented_images:
#    images.append({
 #       "id": augmented_image_ids,
  #      "file_name": aug_img,
   #     "width": 640,  # Assuming width of augmented images
    #    "height": 480  # Assuming height of augmented images
    #})
    #augmented_image_ids += 1

## Split Dataset into Train, Validation, and Test Subsets

The dataset is split into three subsets:
- **Training**: 70% of the data for model training.
- **Validation**: 15% of the data for model tuning.
- **Testing**: 15% of the data for final evaluation.

Random shuffling ensures a balanced representation of classes across subsets.


In [23]:
# Split dataset (70% train, 15% val, 15% test)
random.shuffle(images)
train_split = int(0.7 * len(images))
val_split = int(0.85 * len(images))

train_images = images[:train_split]
val_images = images[train_split:val_split]
test_images = images[val_split:]

In [24]:
train_images

[{'id': 749,
  'width': 3264,
  'height': 2448,
  'file_name': 'batch_2/000072.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/47066269614_c86c132a1e_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/47066269614_f39a0f7d5d_z.jpg'},
 {'id': 722,
  'width': 2448,
  'height': 3264,
  'file_name': 'batch_2/000042.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/40889058283_5a60b778e8_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/40889058283_2e117733c9_z.jpg'},
 {'id': 697,
  'width': 2448,
  'height': 3264,
  'file_name': 'batch_2/000015.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/46939420395_06beac8f89_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/46939420395_100e4c7431_z.jpg'},
 {'id': 1025,
  'width

In [25]:
# Save annotations per split
def save_annotations(split_images, split_name):

    split_annotations = {
        "images": split_images,
        "annotations": [ann for ann in annotations_list if ann['image_id'] in [img['id'] for img in split_images]],
        "categories": annotations['categories']
    }
    with open(os.path.join(SPLIT_SAVE_DIR, f'{split_name}_annotations.json'), 'w') as f:
        json.dump(split_annotations, f)

In [26]:
save_annotations(train_images, 'train')
save_annotations(val_images, 'val')
save_annotations(test_images, 'test')

## Copy Images and Annotations to Subset Directories

Images and their corresponding annotations are copied to their respective subset directories. This ensures that each subset is self-contained and ready for training, validation, or testing.


In [32]:
# Move images to their respective split folders
def move_images(split_images, split_name):
    for img_info in split_images:
        img_name = f"{img_info['id']}"+'.jpg'
        
        # Check in original and augmented directories by filename
        if img_name in os.listdir(SAVE_DIR):
            src_path = os.path.join(SAVE_DIR, img_name)
        elif img_name in os.listdir(AUG_SAVE_DIR):
            src_path = os.path.join(AUG_SAVE_DIR, img_name)
        else:
            logging.warning(f"Image not found: {img_name}")
            continue

        dest_path = os.path.join(SPLIT_SAVE_DIR, 'images', split_name, img_name)
        
        # Copy the image to the respective split folder
        logging.info(f"Copying image {img_name} to {dest_path}")
        shutil.copy2(src_path, dest_path)

move_images(train_images, 'train')
move_images(val_images, 'val')
move_images(test_images, 'test')


2024-12-04 13:07:46,072 - INFO - Copying image 749.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\749.jpg
2024-12-04 13:07:46,088 - INFO - Copying image 722.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\722.jpg
2024-12-04 13:07:46,100 - INFO - Copying image 697.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\697.jpg
2024-12-04 13:07:46,109 - INFO - Copying image 1025.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\1025.jpg
2024-12-04 13:07:46,115 - INFO - Copying image 629.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\629.jpg
2024-12-04 13:07:46,123 - INFO - Copying image 929.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\929.jpg
2024-12-04 13:07:46,137 - INFO - Copying image 609.jpg to C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train\609.jpg
2024-12-04 13:07:46,147 - INFO - Copying image