The purpose of this notebook is to convert the Aquatrash dataset to a format that can be used to train an Ultralytics YOLO model.

# Using this notebook: workflow

To use, make a copy of this notebook, and adapt it to work with your specific dataset. Please save your version of this ipynb file on GitHub in *recyclo/scripts*.

(File > Save a copy in GitHub > File path = "scripts/my_filename.ipynb" to save notebook in scripts folder)

Once you've generated your YOLO dataset, and are confident you can train a model with it, please upload your converted dataset to the Recyclo datasets google drive, https://drive.google.com/drive/folders/1bUkIYQRXX08OKI5TuOSg-eqntSudGaFB.

(Why Google Drive? Because these datasets are too large for GitHub!)

# What's in this notebook: contents

Notebook contents:
- intro to YOLO
- intro to AquaTrash
- dataset specific notes (update for your specific dataset)

# Pro tips about Colabs

When you open the "Files" tab on the left, you'll find yourself in a folder containing
* ..
* sample data

This is a colab thing, the "content" folder, to get you started.
Ignore it: click the .. to go up a level.

# Intro to YOLO

## General

In general, YOLO models output the following for a given image:
* Bounding box
* Class label
* Confidence score

To train a YOLO model, we need object detection datasets that contain images of what we're looking for (trash), and annotations: class labels and bounding boxes.

## Ultralytics YOLO

In this project we will use Ultralytics YOLO object detection, eg their YOLO11n model. YOLO11n is a pretrained object detection model developed by Ultralytics.

Ultralytics YOLO expects datasets in the following format:

```
dataset/
├── images/
│   ├── train/  <-- image files for training.
│   ├── val/    <-- image files for validation after each epoch. Must not overlap with images in train.
|   └── test/   <-- optional: can put some image files here for benchmarking.
├── labels/
│   ├── train/  <-- one .txt file per train image (must have same name). Contains class and bbox info..
│   ├── val/    <-- one .txt file per val image.
|   └── test/   <-- one .txt file per test image.
└── data.yaml   <-- config file; helps tie all the above together.
```

Example labels/train file:
```
<class_id> <x_center> <y_center> <width> <height>
```

Example data.yaml file:
```
path: /content/dataset  # Root folder
train: images/train
val: images/val

nc: 5  # number of classes
names: ['bottle', 'can', 'plastic bag', 'wrapper', 'paper']

```


---
⚠️‼️ ***THE SECTION TO CHANGE FOR YOUR SPECIFIC DATASET STARTS HERE*** ‼️⚠️

The sections above apply for all dataset conversions.

---

In [None]:
# ✏️ Enter your dataset-specific code here
# This cell is for importing your dataset to the notebook, and defining its name and path.

from pathlib import Path

dataset_name = "AquaTrash"
dataset_path = Path('/content/AquaTrash_dataset')
if not dataset_path.exists():
    print(f"Cloning AquaTrash dataset to {dataset_path}...")
    !git clone https://github.com/Harsh9524/AquaTrash.git /content/AquaTrash_dataset
    print(f"Dataset downloaded to {dataset_path}\n")
else:
    print(f"Dataset directory {dataset_path} already exists. Skipping clone.\n")

print(f"{dataset_name} dataset downloaded to {dataset_path}\n")

# AquaTrash dataset
✏️ Modify this section for your specific dataset.

The AquaTrash dataset uses an unusual formatting. It has an annotations.json file that contains the labelling information in the format of

```
<image name> <x_min> <y_min> <x_max> <y_max> <class_name>
```

## Conversion
To convert the AquaTrash dataset to a format ultralytics YOLO can use, we must:
* Split the images into train, val, and test sets
* Extract label and bbox info from annotations.csv, convert it to

```
<class_id> <x_center> <y_center> <width> <height>
```
* Make a data.yaml file

In [None]:
import shutil
from pathlib import Path
import pandas as pd
from PIL import Image
dataset_name = "AquaTrash"
dataset_path = Path('/content/AquaTrash_dataset')
source_images_path = dataset_path / 'Images'
annotations_path = dataset_path / 'annotations.csv'
n_total = sum(1 for _ in source_images_path.glob('*.jpg'))  # a quick way to find out how many files are in the Images folder
# Set up output folder system
output_root = dataset_path.parent / f"{dataset_name}_yolo_{n_total}"
yolo_img_dirs = {
    'train': output_root / 'images' / 'train',
    'val': output_root / 'images' / 'val',
    'test': output_root / 'images' / 'test',
}
yolo_lbl_dirs = {
    'train': output_root / 'labels' / 'train',
    'val': output_root / 'labels' / 'val',
    'test': output_root / 'labels' / 'test',
}
# Clear and recreate folders if the script is run a 2nd time
for d in list(yolo_img_dirs.values()) + list(yolo_lbl_dirs.values()):
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)
# Load annotations
df = pd.read_csv(annotations_path)
default_class_id = 0              # We decided to only use one class, 'trash', so all labels will have a class ID of 0
grouped = df.groupby('image_name') # Group annotations by image, since one image often has multiple labels in the csv
grouped = list(df.groupby('image_name'))  # convert to list so shuffle works
# Compute dataset split
n_train = int(0.8 * len(grouped)) # 80% of the images to training
n_val = int(0.1 * len(grouped))   # 10% to val
n_test = len(grouped) - n_train - n_val  # the rest to test
splits = ['train'] * n_train + ['val'] * n_val + ['test'] * n_test
# Process each image
for i, ((file_path, group), split) in enumerate(zip(grouped, splits)):
    src_img_path = source_images_path / Path(file_path).name
    if not src_img_path.exists():
        print(f"Warning: {src_img_path} not found, skipping.")
        continue
    # Open image to get dimensions
    with Image.open(src_img_path) as img:
        width, height = img.size
    # Let's rename the file's while we're at it
    base_name = f"{dataset_name}_{i:06}"
    new_img_name = base_name + ".jpg"
    new_lbl_name = base_name + ".txt"
    # Copy image to destination YOLO folder (train, val, or test)
    shutil.copy(src_img_path, yolo_img_dirs[split] / new_img_name)
    # Convert label data to YOLO format, and write it to the corresponding label file
    label_lines = []
    for _, row in group.iterrows():
        x_min, y_min, x_max, y_max = row[['x_min', 'y_min', 'x_max', 'y_max']]
        x_center = ((x_min + x_max) / 2) / width
        y_center = ((y_min + y_max) / 2) / height
        box_width = (x_max - x_min) / width
        box_height = (y_max - y_min) / height
        label_lines.append(f"{default_class_id} {x_center:.6f} {y_center:.6f} {box_width:.6f} {box_height:.6f}")
    with open(yolo_lbl_dirs[split] / new_lbl_name, 'w') as f:
        f.write('\n'.join(label_lines))
# generate data.yaml
data_yaml_path = output_root / 'data.yaml'
with open(data_yaml_path, 'w') as f:
    f.write("names: ['trash']\n")
    f.write("nc: 1\n")
    f.write(f"path: {output_root}\n")
    f.write("train: images/train\n")
    f.write("val: images/val\n")
    f.write("test: images/test\n")
print(f"Converted {len(grouped)} images to YOLO format with 80/10/10 train/val/test split at: {output_root}")


---
⚠️‼️ ***THE SECTION TO CHANGE FOR YOUR SPECIFIC DATASET STOPS HERE*** ‼️⚠️

The sections below apply for all dataset conversions.

---

To verify that your conversion worked, make sure you can train a model and that it outputs images with a bounding box and label.

In [None]:
# ⚠️ DO NOT MODIFY THIS CELL
# This cell imports the ultralytics library required for training a model

!pip install -U ultralytics

from ultralytics import YOLO
import os

In [None]:
# ⚠️ DO NOT MODIFY THIS CELL
# This cell trains a YOLO model on the converted YOLO dataset to see if it's set up correctly
# Tip: inspect the output of this cell to assess whether training occured properly.

model = YOLO('yolo11n.pt')
# Use the output_root directly as the data path
results = model.train(data=str(output_root / 'data.yaml'), epochs=20, imgsz=640)  # epoch size is small - this is just to see if it can work!

If the model outputs even one image with a bounding box and label, then the dataset should work for our project! Verify this using the code below.

In [None]:
# ⚠️ DO NOT MODIFY THIS CELL
# This cell passes the trained model some images, to see if the model can identify some trash

import cv2
from random import sample
import matplotlib.pyplot as plt
import os # Moved import here

# Get the latest model
runs_detect_dir = Path('runs/detect')
train_dirs = [d for d in runs_detect_dir.iterdir() if d.is_dir() and d.name.startswith("train")]
train_dirs.sort(key=lambda d: d.stat().st_mtime, reverse=True)  # sort by modification time
latest_train_dir = train_dirs[0]
best_model_path = latest_train_dir / 'weights' / 'best.pt'
print(f"Loading {best_model_path}")

# Load the model and try it out
model = YOLO(best_model_path)
train_images_path = output_root / "images" / "train" # Corrected path to use output_root
image_files = list(train_images_path.glob('*.jpg'))

sample_images = sample(image_files, 10)

for image_path in sample_images:
    result = model(image_path)[0]
    annotated_image = result.plot()

    plt.figure(figsize=(8, 6))
    plt.imshow(annotated_image)
    plt.title(f'Predictions: {image_path.name}')
    plt.axis('off')
    plt.show()

If the model successfully generated even one image with a bounding box and label, please run the following code block to zip the yolo dataset, download the zipped file, and upload it on Google Drive, https://www.google.com/url?q=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1bUkIYQRXX08OKI5TuOSg-eqntSudGaFB.

In [None]:
# ⚠️ DO NOT MODIFY THIS CELL
# This cell zips your converted YOLO dataset with an informative name, so you can download it and upload it to google drive.

from datetime import datetime
from pathlib import Path
import os

# Generate date prefix
date_str = datetime.now().strftime('%Y%m%d')

# Define the folder to zip (the final YOLO dataset folder)
folder_to_zip = output_root # output_root is set to the path of the generated YOLO dataset

# Define the name of the output zip file
# Using the parent directory name in the zip name might be confusing if the folder to zip is the final output
# Let's use the folder_to_zip name directly for the zip name prefix
zip_name = f"{date_str}_{folder_to_zip.name}.zip"

# Define the full path for the output zip file
output_zip_path = Path('/content') / zip_name

# Change directory to the parent of the folder to zip so the zip command includes the folder itself
# Or, more simply, use the zip command directly on the folder_to_zip path
# %cd {folder_to_zip.parent} # No need to change directory if zipping directly

print(f"Creating zip archive from {folder_to_zip}...")

# Use the zip command to archive the specific folder
# The -r flag is for recursive zipping (includes subdirectories)
# The first argument is the output zip file path
# The second argument is the folder to be zipped (relative to the current directory or absolute path)
# We will use the absolute path to the folder_to_zip
!zip -r {output_zip_path} {folder_to_zip}

print(f"Zip created at {output_zip_path}")