# Divide TACO Dataset by Category

This notebook organizes the TACO dataset by categories. It involves:
1. Parsing COCO-style annotations to identify categories.
2. Splitting and organizing images based on their category IDs.
3. Creating category-specific directories for better data management.

By the end of this notebook, the dataset will be divided into directories corresponding to individual categories, simplifying category-wise analysis and processing.


In [1]:
import json
import os
from collections import defaultdict
import shutil



In [2]:
# Paths to your annotations
train_annotations_path = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\train_annotations.json"
val_annotations_path = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\val_annotations.json"
test_annotations_path = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\test_annotations.json"

In [3]:
with open(train_annotations_path, 'r') as f:
        data = json.load(f)

In [4]:
data['images']

[{'id': 749,
  'width': 3264,
  'height': 2448,
  'file_name': 'batch_2/000072.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/47066269614_c86c132a1e_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/47066269614_f39a0f7d5d_z.jpg'},
 {'id': 722,
  'width': 2448,
  'height': 3264,
  'file_name': 'batch_2/000042.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/40889058283_5a60b778e8_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/40889058283_2e117733c9_z.jpg'},
 {'id': 697,
  'width': 2448,
  'height': 3264,
  'file_name': 'batch_2/000015.JPG',
  'license': None,
  'flickr_url': 'https://farm66.staticflickr.com/65535/46939420395_06beac8f89_o.png',
  'coco_url': None,
  'date_captured': None,
  'flickr_640_url': 'https://farm66.staticflickr.com/65535/46939420395_100e4c7431_z.jpg'},
 {'id': 1025,
  'width

## Retrieve Category Details

A helper function is defined to extract details of a specific category or annotation based on its ID. This enables category-wise filtering and organization of images.


In [5]:
def get_id_details(annotation_path, target_id):
    """
    Retrieve details for a specific ID from a COCO-format annotations file.

    Args:
        annotation_path (str): Path to the annotations file.
        target_id (int): The ID to search for.

    Returns:
        dict: Details of the matching ID, if found.
    """
    with open(annotation_path, 'r') as f:
        data = json.load(f)

    for annotation in data['annotations']:
        if annotation['id'] == target_id:
            return annotation 

# Example usage
target_id = 4
details = get_id_details(train_annotations_path, target_id)

if details:
    print(f"Details for ID {target_id}: {details}")
else:
    print(f"ID {target_id} not found.")


Details for ID 4: {'id': 4, 'image_id': 2, 'category_id': 5, 'segmentation': [[670.0, 993.0, 679.0, 998.0, 684.0, 1001.0, 688.0, 1005.0, 690.0, 1009.0, 697.0, 1010.0, 707.0, 1011.0, 743.0, 1013.0, 754.0, 1012.0, 782.0, 1016.0, 801.0, 1021.0, 815.0, 1027.0, 843.0, 1047.0, 912.0, 1091.0, 924.0, 1101.0, 948.0, 1118.0, 1026.0, 1166.0, 1094.0, 1208.0, 1111.0, 1219.0, 1124.0, 1230.0, 1131.0, 1242.0, 1132.0, 1253.0, 1129.0, 1264.0, 1124.0, 1275.0, 1118.0, 1285.0, 1108.0, 1298.0, 1100.0, 1313.0, 1092.0, 1323.0, 1088.0, 1331.0, 1081.0, 1342.0, 1074.0, 1350.0, 1066.0, 1358.0, 1054.0, 1361.0, 1040.0, 1358.0, 1025.0, 1349.0, 1018.0, 1347.0, 1013.0, 1344.0, 1008.0, 1340.0, 1006.0, 1336.0, 998.0, 1333.0, 993.0, 1329.0, 968.0, 1313.0, 918.0, 1278.0, 895.0, 1259.0, 881.0, 1250.0, 871.0, 1244.0, 858.0, 1231.0, 844.0, 1226.0, 834.0, 1221.0, 826.0, 1216.0, 786.0, 1188.0, 746.0, 1164.0, 725.0, 1148.0, 709.0, 1130.0, 699.0, 1111.0, 679.0, 1073.0, 670.0, 1056.0, 668.0, 1051.0, 664.0, 1049.0, 659.0, 1048.0, 

In [9]:
def parse_annotations_with_supercategories(annotation_path):
    with open(annotation_path, 'r') as f:
        data = json.load(f)

    # Build a mapping of image_id to file_name
    image_id_to_file = {image['id']: f"{image['id']}"+'.jpg' for image in data['images']}

    # Map category_id to supercategory
    category_id_to_supercategory = {cat['id']: cat['supercategory'] for cat in data['categories']}
    
    # Build a mapping of image_id to the annotation with the largest area
    image_to_supercategories = {}
    for annotation in data['annotations']:
        image_id = annotation['image_id']
        bbox = annotation['bbox']  # [x_min, y_min, width, height]
        area = bbox[2] * bbox[3]  # Calculate area (width * height)
        
        # Keep only the annotation with the largest area for each image_id
        if image_id not in image_to_supercategories or area > image_to_supercategories[image_id]['area']:
            image_to_supercategories[image_id] = {
                'supercategory': category_id_to_supercategory[annotation['category_id']],
                'area': area
            }
    print(image_to_supercategories)
    # Map images to their supercategories
    image_to_supercategory_map = {}
    for image_id, data in image_to_supercategories.items():
        image_to_supercategory_map[image_id_to_file[image_id]] = [data['supercategory']]

    return image_to_supercategory_map


# Parse train, validation, and test annotations
train_image_to_supercategory = parse_annotations_with_supercategories(train_annotations_path)
val_image_to_supercategory = parse_annotations_with_supercategories(val_annotations_path)
test_image_to_supercategory = parse_annotations_with_supercategories(test_annotations_path)

{0: {'supercategory': 'Bottle', 'area': 590934.0}, 1: {'supercategory': 'Carton', 'area': 2170651.0}, 2: {'supercategory': 'Bottle', 'area': 187000.0}, 3: {'supercategory': 'Bottle', 'area': 255148.0}, 5: {'supercategory': 'Bottle', 'area': 952413.0}, 6: {'supercategory': 'Can', 'area': 401625.0}, 9: {'supercategory': 'Can', 'area': 167475.0}, 10: {'supercategory': 'Bottle', 'area': 135150.0}, 11: {'supercategory': 'Plastic bag & wrapper', 'area': 74529.0}, 14: {'supercategory': 'Bottle', 'area': 1321902.0}, 15: {'supercategory': 'Bottle', 'area': 767988.0}, 16: {'supercategory': 'Can', 'area': 589734.0}, 17: {'supercategory': 'Cup', 'area': 55458.0}, 18: {'supercategory': 'Can', 'area': 29280.0}, 19: {'supercategory': 'Can', 'area': 158865.0}, 20: {'supercategory': 'Can', 'area': 314568.0}, 22: {'supercategory': 'Can', 'area': 357903.0}, 23: {'supercategory': 'Bottle', 'area': 523908.0}, 25: {'supercategory': 'Plastic bag & wrapper', 'area': 282726.0}, 28: {'supercategory': 'Can', 'ar

In [11]:
train_image_to_supercategory['1401.jpg']

['Can']

## Divide Data by Categories

Using the parsed annotations, this section organizes images into category-specific directories. Images belonging to each category are identified and copied to corresponding folders. This facilitates category-wise analysis or training.


In [12]:
def organize_images_by_supercategory(image_to_supercategory_map, images_dir, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for image_name, supercategories in image_to_supercategory_map.items():
        # Multi-supercategory images: duplicate into all relevant supercategory folders
        for supercategory in supercategories:
            supercategory_dir = os.path.join(output_dir, supercategory)
            os.makedirs(supercategory_dir, exist_ok=True)

            # Move or copy the image into the supercategory directory
            src_path = os.path.join(images_dir, image_name)
            dest_path = os.path.join(supercategory_dir, image_name)
            if os.path.exists(src_path):
                shutil.copy(src_path, dest_path)

# Directories where images are stored and where to save organized data
train_images_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\train"
val_images_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\val"
test_images_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\images\test"

train_output_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\train"
val_output_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\val"
test_output_dir = r"C:\Users\bumin\Downloads\DLCV project\TACO\dataset_split\test"

# Organize datasets
organize_images_by_supercategory(train_image_to_supercategory, train_images_dir, train_output_dir)
organize_images_by_supercategory(val_image_to_supercategory, val_images_dir, val_output_dir)
organize_images_by_supercategory(test_image_to_supercategory, test_images_dir, test_output_dir)
