# YOLOv8 Training Notebook (Google Colab)
This notebook helps you upload the dataset assets to Google Colab (or mount Google Drive), prepare the YOLOv8 dataset layout, and train a YOLOv8 model using the GPU.

What this notebook does:
- Checks GPU availability and PyTorch/Ultralytics install
- Installs Ultralytics (YOLOv8) and required packages
- Lets you either mount Google Drive or upload dataset zip files directly via the file picker
- Unzips and arranges the dataset into the YOLOv8 expected structure (images/labels split)
- Writes a `data.yaml` describing the dataset for YOLOv8
- Launches training with the Ultralytics Python API

Notes:
- In Colab, select Runtime -> Change runtime type -> GPU.
- For large datasets prefer mounting Google Drive to avoid repeated uploads and preserve outputs.
- Edit the training cell parameters (epochs, batch, imgsz) to match your hardware.

In [None]:
# Check if running in Google Colab and GPU availability
import sys
in_colab = 'google.colab' in sys.modules
print('Running in Colab:' , in_colab)

# Show GPU status (works in Colab)
try:
    import torch
    print('torch.__version__ =', torch.__version__)
    print('cuda available =', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('cuda device count =', torch.cuda.device_count())
except Exception as e:
    print('Torch not available yet:', e)

# Try nvidia-smi (only works if driver present, e.g., Colab GPU runtime)
import os
if os.name == 'posix':
    try:
        !nvidia-smi
    except Exception:
        pass

In [None]:
# Install/upgrade Ultralytics (YOLOv8) and unzip helper (runs in Colab)
# This cell is safe to re-run; it will reinstall if updates are available.
import sys
if 'google.colab' in sys.modules:
    print('Installing/Upgrading ultralytics...')
    !pip install -U ultralytics --quiet
    !pip install -q wandb
else:
    print('Not in Colab; please ensure you have ultralytics installed locally (pip install -U ultralytics).')

# Show installed ultralytics version
try:
    import ultralytics
    print('ultralytics version =', ultralytics.__version__)
except Exception as e:
    print('Ultralytics not available yet:', e)

## Upload dataset assets or mount Google Drive
Choose one of the two options below:
1) Mount Google Drive (recommended for large datasets / to persist outputs)
2) Use the file uploader to upload the dataset zip files directly to the Colab VM.

You should have these files (from this repository `datasets/`):
- train_images.zip
- train_labels.zip
- val_images.zip
- val_labels.zip
- test_images.zip (optional)
- test_labels.zip (optional)

The next two cells provide both flows. Run the one you want.

In [None]:
# Option A: Mount Google Drive (uncomment and run if you want persistence)
# This will create /content/drive/MyDrive/microplastic_dataset where you can upload the zip files via Drive UI or the Drive web interface.
# from google.colab import drive
# drive.mount('/content/drive')
# dataset_dir = '/content/drive/MyDrive/microplastic_dataset'
# print('Dataset dir (Drive):', dataset_dir)

print('Option A: mount Google Drive is available but commented out. If you prefer Drive, uncomment the lines and run this cell.')

In [None]:
# Option B: Upload dataset zip files directly to the Colab VM using the upload widget
# This is convenient for small/medium sized zips. Uploaded files are written to /content/datasets/
import os
os.makedirs('/content/datasets', exist_ok=True)
try:
    from google.colab import files
    uploaded = files.upload()
    for fname, data in uploaded.items():
        out_path = os.path.join('/content/datasets', fname)
        with open(out_path, 'wb') as f:
            f.write(data)
        print('Saved', out_path)
except Exception as e:
    print('files.upload is only available in Colab. If you are running locally copy the dataset zips into /content/datasets or mount Drive. Error:', e)

## Prepare dataset layout for YOLOv8
The code below expects the image zips and label zips to be present in `/content/datasets`. It will: 
- extract the zips into `/content/datasets/raw`
- place images into `/content/datasets/images/{train,val,test}` and labels into `/content/datasets/labels/{train,val,test}`
- write a `data.yaml` at `/content/datasets/data.yaml` suitable for YOLOv8 training.

If your zips already contain folder structures, the script will try to locate images (.jpg/.png) and label files (.txt) and move them appropriately.

In [None]:
# Unzip and organize dataset files into YOLOv8 expected layout
import os, zipfile, shutil, glob
base = '/content/datasets'
raw = os.path.join(base, 'raw')
os.makedirs(raw, exist_ok=True)
# Extract any zip files found in /content/datasets
for z in glob.glob(os.path.join(base, '*.zip')):
    print('Extracting', z)
    try:
        with zipfile.ZipFile(z, 'r') as zip_ref:
            zip_ref.extractall(raw)
    except Exception as e:
        print('Failed to extract', z, e)

# Create expected folders
images_dir = os.path.join(base, 'images')
labels_dir = os.path.join(base, 'labels')
for subset in ['train','val','test']:
    os.makedirs(os.path.join(images_dir, subset), exist_ok=True)
    os.makedirs(os.path.join(labels_dir, subset), exist_ok=True)

# Helper to move files by extension from raw to target folder
def move_files(patterns, target_dir):
    moved = 0
    for pat in patterns:
        for f in glob.glob(os.path.join(raw, '**', pat), recursive=True):
            try:
                shutil.move(f, target_dir)
                moved += 1
            except Exception as e:
                print('move failed', f, e)
    return moved

# Try to detect and move images and labels using heuristic names
image_patterns = ['*.jpg','*.jpeg','*.png','*.bmp']
label_patterns = ['*.txt']

# Heuristics: if raw contains folders named 
, 
, 
 move accordingly, else distribute by file names if they contain 
/
/

for subset in ['train','val','test']:
    # move images whose path contains subset
    moved_imgs = 0
    moved_lbls = 0
    for pat in image_patterns:
        for f in glob.glob(os.path.join(raw, '**', '*'+subset+'*', pat), recursive=True):
            try:
                shutil.move(f, os.path.join(images_dir, subset))
                moved_imgs += 1
            except Exception as e:
                pass
    for pat in label_patterns:
        for f in glob.glob(os.path.join(raw, '**', '*'+subset+'*', pat), recursive=True):
            try:
                shutil.move(f, os.path.join(labels_dir, subset))
                moved_lbls += 1
            except Exception as e:
                pass
    print(f'{subset}: moved images={moved_imgs}, labels={moved_lbls}')

# If nothing moved by heuristic, fall back to moving all images to train/val/test by looking at zip filenames
if sum(len(os.listdir(os.path.join(images_dir, s))) for s in ['train','val','test']) == 0:
    print('Heuristic failed to detect subsets; falling back to zip-name based placement')
    # If zip names contained 'train_images' etc., use them
    name_map = {'train_images':'train','val_images':'val','test_images':'test', 'train_labels':'train','val_labels':'val','test_labels':'test'}
    for z in glob.glob(os.path.join(base, '*.zip')):
        zname = os.path.basename(z).lower()
        target = None
        for key,v in name_map.items():
            if key in zname:
                target = v
                break
        if target is None:
            continue
        # extract this zip to a temp dir and move files by extension
        with zipfile.ZipFile(z, 'r') as zip_ref:
            zip_ref.extractall(os.path.join(raw, 'tmp_'+os.path.splitext(os.path.basename(z))[0]))
        moved_imgs += move_files(image_patterns, os.path.join(images_dir, target))
        moved_lbls += move_files(label_patterns, os.path.join(labels_dir, target))
    print('Fallback moved imgs, labels:', moved_imgs, moved_lbls)

# Final counts
for subset in ['train','val','test']:
    imgs = len(list(glob.glob(os.path.join(images_dir, subset, '*'))))
    lbls = len(list(glob.glob(os.path.join(labels_dir, subset, '*.txt'))))
    print(f'Subset {subset}: images={imgs}, labels={lbls}')

In [None]:
# Write a YOLOv8-compatible data.yaml into /content/datasets/data.yaml
import yaml, os
base = '/content/datasets'
data_yaml = {
    'path': base,
    'train': 'images/train',
    'val': 'images/val',
    'test': 'images/test',
    'names': ['Microplastic']
}
with open(os.path.join(base, 'data.yaml'), 'w') as f:
    yaml.dump(data_yaml, f, default_flow_style=False)
print('Wrote', os.path.join(base, 'data.yaml'))
print('Contents:')
print(open(os.path.join(base, 'data.yaml')).read())

## Train YOLOv8
The cell below launches training using the Ultralytics YOLOv8 Python API. Adjust `epochs`, `batch`, `imgsz` and `model_size` as needed.
- `model_size` defaults to `yolov8s.pt` (small) which trains faster.
- `device=0` uses the first CUDA GPU. Use `device='cuda'` for automatic selection.
- To persist results, set `project` and `name` to a Drive folder if you mounted Drive.

In [None]:
# Training cell - edit hyperparameters before running
from ultralytics import YOLO
import os
data_yaml = '/content/datasets/data.yaml'
# Choose model_size: 'yolov8n.pt','yolov8s.pt','yolov8m.pt','yolov8l.pt' etc.
model_size = 'yolov8s.pt'
epochs = 100
imgsz = 640
batch = 16
device = 0  # GPU index (0) - change to 'cpu' to run on CPU
project = '/content/runs/YOLOv8'  # change to Drive path if mounted for persistence
name = 'microplastic_exp'

print('Starting training with', model_size, 'epochs=', epochs, 'batch=', batch)
# Create model (loads pretrained weights)
model = YOLO(model_size)
# Train
model.train(data=data_yaml, epochs=epochs, imgsz=imgsz, batch=batch, device=device, project=project, name=name)

# After training the best weights are saved under project/name/weights/best.pt
print('Training launched. Check the run folder for checkpoints and results.')

## Quick inference example (after training)
After training completes, you can run a quick inference using the trained weights. Update the `weights_path` below to the best checkpoint saved by the training run.

In [None]:
# Quick inference example - update weights_path if needed
from ultralytics import YOLO
weights_path = '/content/runs/YOLOv8/microplastic_exp/weights/best.pt'  # default training save location
if os.path.exists(weights_path):
    y = YOLO(weights_path)
    # run inference on a sample image if present
    sample_images = glob.glob('/content/datasets/images/val/*')
    if len(sample_images) > 0:
        print('Running inference on', sample_images[0])
        results = y.predict(sample_images[0], save=True)
        print('Saved predictions to /content/runs/predict')
    else:
        print('No validation images found for inference')
else:
    print('Weights not found at', weights_path)

## Notes & tips
- For long trainings or large datasets, mount Google Drive and set `project` to a Drive folder so results are saved persistently.
- If you encounter out-of-memory errors, reduce `batch` or `imgsz`, or use a smaller model (yolov8n/yolov8s).
- Use `epochs=10` for a quick smoke test before running longer experiments.
- To resume training from a checkpoint, pass `resume=True` or set `model = YOLO(checkpoint_path)` and call `model.train(...)`.

In [None]:
# Save trained weights and run folder to Google Drive
# Run this cell AFTER training (you can queue it and leave Colab running). It will mount Drive (if not already mounted) and copy weights and the whole run folder.
import os, shutil, glob, time
# Attempt to mount Google Drive if running in Colab
try:
    from google.colab import drive
    print('Mounting Google Drive...')
    drive.mount('/content/drive', force_remount=False)
    drive_mounted = True
except Exception as e:
    print('Google Drive mount not available or already handled:', e)
    drive_mounted = os.path.exists('/content/drive')

# Destination folder in Drive where artifacts will be copied
dst_base = '/content/drive/MyDrive/microplastic_yolov8_runs'
os.makedirs(dst_base, exist_ok=True)
# Default run and weights paths (match training cell defaults)
run_folder = '/content/runs/YOLOv8/microplastic_exp'
weights_dir = os.path.join(run_folder, 'weights')
best_pt = os.path.join(weights_dir, 'best.pt')
# Helper to copy with retries (handles transient Drive issues)
def copy_with_retries(src, dst, retries=3, wait=3):
    for i in range(retries):
        try:
            if os.path.isdir(src):
                if os.path.exists(dst):
                    shutil.rmtree(dst)
                shutil.copytree(src, dst)
            else:
                shutil.copy2(src, dst)
            return True
        except Exception as e:
            print(f'Copy attempt {i+1} failed:', e)
            time.sleep(wait)
    return False

# Copy best.pt if it exists
if os.path.exists(best_pt):
    print('Found best.pt at', best_pt)
    dst_weights = os.path.join(dst_base, 'weights')
    os.makedirs(dst_weights, exist_ok=True)
    out = copy_with_retries(best_pt, os.path.join(dst_weights, 'best.pt'))
    if out:
        print('Copied best.pt to', dst_weights)
    else:
        print('Failed to copy best.pt')
else:
    # Try to find any .pt files in weights folder
    pts = glob.glob(os.path.join(weights_dir, '*.pt'))
    if len(pts) > 0:
        dst_weights = os.path.join(dst_base, 'weights')
        os.makedirs(dst_weights, exist_ok=True)
        for p in pts:
            print('Copying', p)
            copy_with_retries(p, os.path.join(dst_weights, os.path.basename(p)))
        print('Copied', len(pts), 'pt files to', dst_weights)
    else:
        print('No .pt files found at', weights_dir)

# Copy entire run folder (checkpoints, logs, plots) to Drive for safety
if os.path.exists(run_folder):
    dst_run = os.path.join(dst_base, os.path.basename(run_folder))
    print('Copying run folder', run_folder, '->', dst_run)
    ok = copy_with_retries(run_folder, dst_run)
    if ok:
        print('Run folder copied successfully to', dst_run)
    else:
        print('Failed to copy run folder')
else:
    print('Run folder not found:', run_folder)

print('Done. Check your Google Drive at', dst_base)