# Train YOLOv5 (and v8 soon!) on Weed-AI Datasets

This guide will take you through training a state-of-the-art object detection architecture - YOLOv5 - on Weed-AI datasets. It combines elements of the official Ultralytics guide, with elements of other custom training and conversion guides.

**Steps:**
1. Setup the project: creating folders, cloning YOLOv5
2. Download the Weed-AI dataset
3. Convert weedCOCO to YOLO annotation format
4. Create YOLOv5 supporting files
5. Train YOLOv5
6. Inference on pictures/videos

The tutorial requires you to have access to a Google Drive account and be able to upload images/data to specific folders. Algorithms will train fastest with a GPU. Select the GPU type under 'Runtime' > 'Change Runtime Type'. Make sure GPU is selected. Premium or High RAM will improve speed/size of models that can be trained.
Make sure you run each cell in the tutorial by pressing the 'Play' button on the left hand side. Some options that may need changing are in capital letters.



# Create Project Folder

To begin, create a project folder in your Google Drive. We'll call this one `weedai_yolo`. Replace this with whatever name you decide. 

It will be created in the root folder of your Google Drive. InTO this folder we'll be cloning the [YOLOv5 GitHub Repository](https://github.com/ultralytics/yolov5) and saving our data too. There are many guides on training YOLOv5 that are accessible through the official repository, make sure to check those for any tips/tricks on tuning your model.

In [1]:
# print(os.getcwd())
# %cd /home/zhou/Desktop/WeedX/working_dir/
# print(os.getcwd())

In [2]:
YOUR_DIRECTORY = 'weedai_yolo'

!mkdir {YOUR_DIRECTORY}
%ls './' # should list everything in your Google Drive - double check that your project folder is there.

 [0m[01;34mcombined[0m/    merge_coco.ipynb   [01;34mweedai_yolo[0m/         [01;34mWeedCOCO[0m/
 detect.txt   untitled.txt       weed_ai_yolo.ipynb  [01;34m'WeedCOCO (copy)'[0m/


**(first time only)**

Clone the YOLOv5 repository so we can use it to train our models. Only do this ONCE at the start of the project.

In [3]:
%cd {YOUR_DIRECTORY}
!git clone https://github.com/ultralytics/ultralytics # clone the YOLOv5 repository. It is a large repository and may take some time depending on your internet speed.

/home/zhou/Desktop/WeedX/merge/weedai_yolo
Cloning into 'ultralytics'...
remote: Enumerating objects: 6450, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 6450 (delta 10), reused 9 (delta 5), pack-reused 6415[K
Receiving objects: 100% (6450/6450), 5.11 MiB | 15.52 MiB/s, done.
Resolving deltas: 100% (4377/4377), done.


# Downloading a Weed-AI dataset

For this example I've used the [Northern WA Wheatbelt Blue Lupins](https://weed-ai.sydney.edu.au/datasets/9df290f4-a29b-44b2-9de6-24bca1cee846) dataset but any of the other object detection datasets would work too, including the recrntly uploaded [Amsinckia in chickpeas](https://weed-ai.sydney.edu.au/datasets/21675efe-9d25-4096-be76-3a541475efd4) dataset. 

Download the dataset to a default place on your computer and unzip it. Rename it to something more memorable, in this case `blue_lupins`. Then, we'll create a folder called `datasets` in the `yolov5` directory and move the Weed-AI download (now called `blue_lupins`) to that folder. 

To summarise, the steps we will follow below are:
1. Download the dataset on Weed-AI by clicking the button 'Download in WEEDCOCO format'
2. Unzip the download and rename it to something memorable, in this case I've called it `blue_lupins`
3. Create the `datasets` folder in the `yolov5` directory using the code below
4. Move the Weed-AI download into the Google Drive `yolov5/datasets` folder. For me, this is now `'weedai_yolo/yolov5/datasets'`
5. Convert the data from WeedCOCO to [YOLOv5 format](https://roboflow.com/formats/yolov5-pytorch-txt)

Assuming you've downloaded the dataset, unzipped it and changed its name, I'll go through each of these other steps in more detail below.

In [4]:
YOUR_DATASET = 'combined' # this should match the memorable name of the Weed-AI download you just created.



Create the dataset folder where you'll move the unzipped folder renamed to `blue_lupins` to.

In [5]:
import os
import shutil
os.rename("ultralytics", "yolov8")
!mkdir yolov8/datasets



In [6]:
shutil.copytree(f"../{YOUR_DATASET}", f"yolov8/datasets/{YOUR_DATASET}")

'yolov8/datasets/combined'

Once the dataset has downloaded and is in the datasets folder, it should have a similar structure to the following:
* yolov5/datasets
    * blue_lupins
        * images
        * weedcoco.json


# Convert weedCOCO to YOLO

The first step in the process is converting the downloaded weedCOCO dataset into the YOLO .txt format. The method below is adapted from the official [Ultralytics GitHub repository](https://github.com/ultralytics/JSON2YOLO/blob/master/labelbox_json2yolo.py). Don't worry too much about the code, though certainly check it out, just run the cell by pressing 'Play' on the left side.

In [7]:
import os
from pathlib import Path

import yaml
import shutil
from tqdm import tqdm
import contextlib
import json

import pandas as pd
import numpy as np
from PIL import Image
from collections import defaultdict

def make_dirs(dir='new_dir/'):
    # Create folders
    dir = Path(dir)
    for p in dir, dir / 'labels', dir / 'images':
        p.mkdir(parents=True, exist_ok=True)  # make dir
    return dir


def convert_weedcoco_json(json_dir='', yaml_dir=''):
    save_dir = make_dirs(dir=f'{json_dir}')  # output directory
    print()

    # Import json
    for json_file in sorted(Path(json_dir).resolve().glob('*.json')):
        fn = Path(save_dir) # / 'labels' # folder name
        fn.mkdir(exist_ok=True)
        with open(json_file) as f:
            data = json.load(f)

        # Create image dict
        images = {'%g' % x['id']: x for x in data['images']}
        # Create image-annotations dict
        imgToAnns = defaultdict(list)
        for ann in data['annotations']:
            imgToAnns[ann['image_id']].append(ann)
 

        # Write labels file
        for img_id, anns in tqdm(imgToAnns.items(), desc=f'Annotations {json_file}'):
            # print(img_id, anns)
            img = images['%g' % img_id]
            h, w, f = img['height'], img['width'], img['file_name']

            bboxes = []
            segments = []
            for ann in anns:
                # The COCO box format is [top left x, top left y, width, height]
                box = np.array(ann['bbox'], dtype=np.float64)
                box[:2] += box[2:] / 2  # xy top-left corner to center
                box[[0, 2]] /= w  # normalize x
                box[[1, 3]] /= h  # normalize y
                if box[2] <= 0 or box[3] <= 0:  # if w <= 0 and h <= 0
                    continue

                cls = ann['category_id']  # class
                box = [cls] + box.tolist()
                if box not in bboxes:
                    bboxes.append(box)

            # Write
            with open((fn / f.replace('images', 'labels')).with_suffix('.txt'), 'a') as file:
                for i in range(len(bboxes)):
                    line = *(bboxes[i]),  # cls, box or segments
                    file.write(('%g ' * len(line)).rstrip() % line + '\n')

    # Save dataset.yaml
    names = [data['categories'][i]['name'].split(': ')[1] for i in range(len(data['categories']))]
    d = {'path': yaml_dir,
         'train': 'images/train',
         'val': 'images/train',
         'test': 'images/train',
         'nc': len(names),
         'names': names}  # dictionary

    with open(f"{save_dir}/weedcoco.yaml", 'w') as f:
        yaml.dump(d, f, sort_keys=False)


    print('\nweedCOCO to YOLO conversion completed successfully!')


In [8]:
WEED_COCO_LOCATION = f"yolov8/datasets/{YOUR_DATASET}"
YAML_LOCATION = os.path.join(os.getcwd(), f"yolov8/datasets/{YOUR_DATASET}")
print(YAML_LOCATION)
#convert the weedcoco file
convert_weedcoco_json(json_dir=WEED_COCO_LOCATION, yaml_dir=YAML_LOCATION)

/home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combined



Annotations /home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combined/


weedCOCO to YOLO conversion completed successfully!





## Splitting the dataset into train/validation/test
An algorithm needs a training portion and a validation portion to check as it learns. The test portion is left entirely unseen and can be used later for more appropriate results and to make sure the algorithm hasn't overfit. 

If you find the algorithm performs well on the training data but terribly on the val/test data, then it is likely overfitting. This is more common on small datasets and larger models when trained for many epochs.

In [9]:
from sklearn.model_selection import train_test_split

# Read images and annotations
images = [os.path.join(f'{WEED_COCO_LOCATION}/images', x) for x in os.listdir(f'{WEED_COCO_LOCATION}/images')]
annotations = [os.path.join(f'{WEED_COCO_LOCATION}/labels', x) for x in os.listdir(f'{WEED_COCO_LOCATION}/labels') if x[-3:] == "txt"]

images.sort()
annotations.sort()

# Split the dataset into train-val-test splits 80-10-10%
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 1)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)

%cd {WEED_COCO_LOCATION}
!mkdir images/train images/val images/test labels/train labels/val labels/test


/home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combined


In [10]:
%cd "../../.."

/home/zhou/Desktop/WeedX/merge/weedai_yolo


In [11]:
#Utility function to move images 
def move_files_to_folder(list_of_files, destination_folder):
    for f in list_of_files:
        try:
            shutil.move(f, destination_folder)
        except:
            print(f)
            assert False

# Move the splits into their folders
move_files_to_folder(train_images, 'yolov8/datasets/combined/images/train')
move_files_to_folder(val_images, 'yolov8/datasets/combined/images/val/')
move_files_to_folder(test_images, 'yolov8/datasets/combined/images/test/')
move_files_to_folder(train_annotations, 'yolov8/datasets/combined/labels/train/')
move_files_to_folder(val_annotations, 'yolov8/datasets/combined/labels/val/')
move_files_to_folder(test_annotations, 'yolov8/datasets/combined/labels/test/')

In [12]:
# Check the images have been moved
%cd {WEED_COCO_LOCATION}
print(len(os.listdir('images/train')), len(os.listdir('labels/train')))
print(len(os.listdir('images/val')), len(os.listdir('labels/val')))
print(len(os.listdir('images/test')), len(os.listdir('labels/test')))

/home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combined
2224 2224
278 278
278 278


# Preparing for training
Now we have all the splits made, we need to import some packages and install other YOLOv5 requirements before we can start training a model.

In [13]:
# import necessary packages
import torch
from IPython.display import Image  # for displaying images
import os 
import random
import shutil
from sklearn.model_selection import train_test_split
import xml.etree.ElementTree as ET
from xml.dom import minidom
from tqdm import tqdm
from PIL import Image, ImageDraw
import numpy as np
import matplotlib.pyplot as plt

random.seed(0)

print('torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

torch 2.0.0+cu118 _CudaDeviceProperties(name='NVIDIA GeForce RTX 4090', major=8, minor=9, total_memory=24183MB, multi_processor_count=128)


In [14]:
%cd "../.."
!pip install -r requirements.txt

/home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8
Defaulting to user installation because normal site-packages is not writeable


In [15]:
# Weights & Biases  (optional) - this will let you track and visualise the training process with a WandB account; however, it isn't necessary 
# %pip install -q wandb
# import wandb
# wandb.login()

# YOLOv5 Training

Now we get to train a model! Change the name of your run to whatever you like, and try playing around with things like image size, batch size, epochs and YOLOv5 variant. Larger variants and larger images will probably do better, but require more memory. So if you run out of memory, just reduce image size or model variant size (choose M instead of X) and then try again.

Information on selecting batch size: https://twitter.com/rasbt/status/1617544195220312066


In [None]:
# train YOLOv5m
BATCH = 16
EPOCHS = 30
IMAGE_SIZE = 640 # (should be one of 320, 640, 1280, 1920)
MODEL = 'm' # (should be one of 'n', 's', 'm', 'l', 'x' and must be in lower case)

# this is the name of your run, and how it will be saved
RUN_NAME = f'{YOUR_DATASET}_TRAIN_B{str(BATCH)}_E{str(EPOCHS)}_SZ{str(IMAGE_SIZE)}_M{MODEL}'

# avoid making any changes to the below, or check the Ultralytics docs for other commands
# !python3 train.py --img {IMAGE_SIZE} --cfg yolov8{MODEL}.yaml --batch {BATCH} --epochs {EPOCHS} --data datasets/{YOUR_DATASET}/weedcoco.yaml --weights yolov8{MODEL}.pt --name {RUN_NAME}
!yolo task=detect mode=train model=yolov8{MODEL}.pt imgsz=320 data="datasets/{YOUR_DATASET}/weedcoco.yaml" epochs={EPOCHS} batch={BATCH} name={RUN_NAME}


Downloading https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8m.pt to yolov8m.pt...
100%|██████████████████████████████████████| 49.7M/49.7M [00:01<00:00, 26.4MB/s]
New https://pypi.org/project/ultralytics/8.0.73 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.58 🚀 Python-3.10.6 torch-2.0.0+cu118 CUDA:0 (NVIDIA GeForce RTX 4090, 24184MiB)
[34m[1myolo/engine/trainer: [0mtask=detect, mode=train, model=yolov8m.pt, data=datasets/combined/weedcoco.yaml, epochs=30, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=16, project=None, name=combined_TRAIN_B16_E30_SZ640_Mm, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half

[34m[1mtrain: [0mNew cache created: /home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combined/labels/train.cache
[34m[1mval: [0mScanning /home/zhou/Desktop/WeedX/merge/weedai_yolo/yolov8/datasets/combine[0m


Plotting labels to runs/detect/combined_TRAIN_B16_E30_SZ640_Mm/labels.jpg... 
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to [1mruns/detect/combined_TRAIN_B16_E30_SZ640_Mm[0m
Starting training for 30 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       1/30      13.5G      1.485      2.686      1.341        135        640: 1
                 Class     Images  Instances      Box(P          R      mAP50  m

# Detect
This is where you can run the model you've just trained on a sample video or other dataset to see how it goes. The --source flag below accepts videos, folders of images and images. All you need to do is upload these to the YOLOv5 datasets directory and then specify the name/path below.

In [None]:
DETECTION_FILES = '' # e.g. 'test_video.mp4' OR test_image_directory OR test_image.jpg
CONFIDENCE_THRESHOLD = 0.50 # this should be between 0 and 1. It changes the cutoff value for a detection. Lower = more sensitive, higher = less sensitive

!python detect.py --source datasets/{DETECTION_FILES} --weights runs/train/{RUN_NAME}/weights/best.pt --name {RUN_NAME} --img {IMAGE_SIZE} --conf-thres 0.50