<a href="https://colab.research.google.com/github/astrid12345/recyclo/blob/convert_taco_to_yolo/scripts/convert_taco_dataset_to_yolov8_format.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this notebook is to convert a dataset to a format that can be used to train a YOLOv8 model.

To use, make a copy of this notebook, and adapt it to work with your specific dataset. Please save your version of this ipynb file on GitHub in *recyclo/scripts*.

(File > Save a copy in GitHub > File path = "scripts/my_filename.ipynb" to save notebook in scripts folder)

Once you've generated your YOLOv8 dataset, and are confident you can train a model with it, please upload your converted dataset to the Recyclo datasets google drive, https://drive.google.com/drive/folders/1bUkIYQRXX08OKI5TuOSg-eqntSudGaFB.

(Why Google Drive? Because these datasets are too large for GitHub!)

In [None]:
!pip install -U ultralytics

from ultralytics import YOLO
import os

# Convert datasets for YOLOv8

### General

In general, YOLO models output the following for a given image:
* Bounding box
* Class label
* Confidence score

To train a YOLO model, we need object detection datasets that contain images of what we're looking for (trash), and annotations: class labels and bounding boxes.

### YOLOv8

In this project we will use Ultralytics YOLOv8 object detection.

YOLOv8 expects datasets in the following format:

```
dataset/
├── images/
│   ├── train/  <-- image files for training
│   └── val/    <-- image files for validation after each epoch. Must not overlap with images in train.
├── labels/
│   ├── train/  <-- one .txt file per train image. Contains class and bbox info..
│   └── val/    <-- one .txt file per val image
└── data.yaml   <-- config file,
```

Example labels/train file:
```
<class_id> <x_center> <y_center> <width> <height>
```

Example data.yaml file:
```
path: /content/dataset  # Root folder
train: images/train
val: images/val

nc: 5  # number of classes
names: ['bottle', 'can', 'plastic bag', 'wrapper', 'paper']

```

### Colabs

When you open the "Files" tab on the left, you'll find yourself in a folder containing
* ..
* sample data

This is a colab thing, the "content" folder, to get you started.
Ignore it: click the .. to go up a level.

---
⚠️ ***CHANGE THIS FILE FROM HERE DOWN TO SUIT YOUR DATASET*** ⚠️

The sections above apply for all dataset conversions.

---

### TACO dataset

#### Overview

The TACO dataset uses COCO-style formatting (segmentation).

Sidebar: COCO, Common Objects in Context, is a object detection, segmentation, and captioning dataset developed by Microsoft. It uses an annotations.json file to organize image data. This json annotation approach has become standard for other datasets to use.

So, TACO has an annotations.json file containing:
*   "images":  List of image metadata
*   "annotations":  List of label data (type of trash, bounding box definition, segmentation data; corresponds to images list)
*   "categories":  List of the different categories this dataset uses

#### Conversion
To convert the TACO dataset to a format YOLOv8 can use, we must:
* Split the TACO images into train and val sets
* Extract label and bbox info from annotations.json, and save it in individual txt files corresponding to the image files
* Make a data.yaml file

In [1]:
import json
import shutil
import random
from pathlib import Path
from sklearn.model_selection import train_test_split
import yaml

# === IMPORT DATASET ===
import kagglehub
taco_dataset_path = kagglehub.dataset_download('kneroma/tacotrashdataset')  # https://www.kaggle.com/datasets/kneroma/tacotrashdataset
print(f"Dataset downloaded to {taco_dataset_path}\n")

# === CONFIG ===
input_root = Path(taco_dataset_path)
output_root = Path('/kaggle/working/taco_yolo')   # Since TACO is a kaggle dataset, this will output to kaggle/working/taco_yolo
train_ratio = 0.8  # 80% training, 20% validation

# === OUTPUT STRUCTURE ===
images_train = output_root / 'images' / 'train'
images_val = output_root / 'images' / 'val'
labels_train = output_root / 'labels' / 'train'
labels_val = output_root / 'labels' / 'val'

# Create folders
for folder in [images_train, images_val, labels_train, labels_val]:
    folder.mkdir(parents=True, exist_ok=True)

# === LOAD ANNOTATIONS ===
with open(input_root / 'data' / 'annotations.json', 'r') as f:
    coco = json.load(f)

image_id_to_info = {img['id']: img for img in coco['images']}
annotations_by_image = {}

for ann in coco['annotations']:
    image_id = ann['image_id']
    annotations_by_image.setdefault(image_id, []).append(ann)

category_map = {cat['id']: idx for idx, cat in enumerate(coco['categories'])}
category_names = [cat['name'] for cat in sorted(coco['categories'], key=lambda x: category_map[x['id']])]

# === SPLIT DATA ===
all_image_ids = list(image_id_to_info.keys())
train_ids, val_ids = train_test_split(all_image_ids, train_size=train_ratio, random_state=42)

def convert_bbox_to_yolo(bbox, img_w, img_h):
    x, y, w, h = bbox
    x_center = (x + w / 2) / img_w
    y_center = (y + h / 2) / img_h
    w /= img_w
    h /= img_h
    return x_center, y_center, w, h

def process_image(image_id, split):
    image_info = image_id_to_info[image_id]
    img_path = input_root / 'data' / image_info['file_name']
    img_w, img_h = image_info['width'], image_info['height']

    # Output paths
    out_img_dir = images_train if split == 'train' else images_val
    out_label_dir = labels_train if split == 'train' else labels_val

    rel_path = Path(image_info['file_name'])  # e.g., batch_1/000123.jpg
    batch_prefix = rel_path.parts[0].replace('/', '_')  # "batch_1"
    filename = batch_prefix + "_" + rel_path.name  # "batch_1_000123.jpg"

    # Output flattened path
    out_img_path = out_img_dir / filename
    out_label_path = out_label_dir / filename.replace('.jpg', '.txt')

    # Copy image
    shutil.copy(input_root / 'data' / rel_path, out_img_path)

    # Write label
    with open(out_label_path, 'w') as label_file:
        for ann in annotations_by_image.get(image_id, []):
            class_id = category_map[ann['category_id']]
            bbox = convert_bbox_to_yolo(ann['bbox'], img_w, img_h)
            label_file.write(f"{class_id} {' '.join(f'{x:.6f}' for x in bbox)}\n")

# Process images
for image_id in train_ids:
    process_image(image_id, 'train')

for image_id in val_ids:
    process_image(image_id, 'val')

# === WRITE data.yaml ===
data_yaml = {
    'path': str(output_root),
    'train': 'images/train',
    'val': 'images/val',
    'nc': len(category_names),
    'names': category_names
}

with open(output_root / 'data.yaml', 'w') as f:
    yaml.dump(data_yaml, f)

print("Conversion complete. YOLOv8 dataset created at:", output_root)

Dataset downloaded to /kaggle/input/tacotrashdataset

Conversion complete. YOLOv8 dataset created at: /kaggle/working/taco_yolo


To verify that your conversion worked, make sure you can train a model and that it outputs images with a bounding box and label.

In [None]:
model = YOLO('yolov8n.pt')
results = model.train(data='/kaggle/working/taco_yolo/data.yaml', epochs=5, imgsz=640)  # epoch size is small - this is just to see if it can work!

If the model outputs even one image with a bounding box and label, then the dataset should work for our project! Verify this using the code below.

In [None]:
import cv2
from random import sample
import matplotlib.pyplot as plt

model = YOLO('runs/detect/train/weights/best.pt')

train_images_path = Path('/kaggle/working/taco_yolo/images/train')
image_files = list(train_images_path.glob('*.jpg'))

sample_images = sample(image_files, 10)

for image_path in sample_images:
    result = model(image_path)[0]
    annotated_image = result.plot()

    plt.figure(figsize=(8, 6))
    plt.imshow(annotated_image)
    plt.title(f'Predictions: {image_path.name}')
    plt.axis('off')
    plt.show()

If the model successfully generated even one image with a bounding box and label, please download the converted dataset and upload it on Google Drive, https://www.google.com/url?q=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1bUkIYQRXX08OKI5TuOSg-eqntSudGaFB.

In [None]:
from datetime import datetime

dataset_name = 'taco_yolo'

# Generate date prefix
date_str = datetime.now().strftime('%Y%m%d')
zip_name = f"{date_str}_{dataset_name}.zip"

# Change directory and zip
%cd /kaggle/working/{dataset_name}
!zip -r /content/{zip_name} .

print(f"✅ Zip created at /content/{zip_name}")