# Final Notebook: Yolo Detection of Bacteria on Agar
###### Dietrich Nigh

## Business Understanding of the Problem

Since the discovery of bacteria in the 17th century, scientists have been trying to identify and classify those tiny specs under the microscope. In the 19th century, Julius Richard Petri, a German physician working under the famous Robert Koch, developed his namesake, the Petri dish, for this purpose. He needed to reliably grow bacteria without risk of containmination so he could accurately study his specimens. Since that time, the classification of bacteria has come a long way. Many bacterial samples can be sequenced to elucidate all of their secrets. That said, the Petri dish is still a critical tool in the culturing and classification of bacterial samples. 

Presently, the identification of bacteria is a laborious and time consuming task. This is not to mention the years of training needed to properly perform the task. Even still mistakes can be made. To reduce the cost (both in time and money), recent years have seen an explosion of research into the construction machine learning models to correctly identify bacteria from a sample. Agar plates (Petri dishes with agar media) are a widely available, affordable, and effective means of growing isolated samples. If a model could be made to quickly and accurately differiante bacteria based on their growth, medical diagnostics could be done quickly with less training for technicians. For research applications, time spent memorizing such works as Bergey's Manual could be diverted elsewhere.

The model chosen for this task is YOLO. YOLO is a single stage object detection model from Ultralytics with many hidden layers. YOLO is a single stage detector, meaning it performs regression around the object of interest and the classification of said image in parallel. This makes it much faster than dual-stage detectors, like Faster RCNN, which perform these tasks sequentially. The model is also relatively lightweight once trained. The model was developed for such tasks as live object detection after all. With this model we were able to construct a highly precise model.

As this model is only good for 5 different species, I am hoping to gain the funding necessary to improve this model model. I am seeking this funding from NIH.

<img src='https://editor.analyticsvidhya.com/uploads/1512812.png' width="540" height="270" />


In [3]:
# Import Necessary libraries

import json
import pandas as pd
import numpy as np
import cv2
import os
from pathlib import Path
from datetime import datetime
from collections import Counter
import random
import shutil
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from ultralytics import YOLO

## Data Exploration

A test json file was read in. Keys and items were explored

In [9]:
example = {    "background": "bright",
    "classes": [
        "P.aeruginosa"
    ],
    "colonies_number": 4,
    "labels": [
        {
            "class": "P.aeruginosa",
            "height": 307,
            "id": 1,
            "width": 307,
            "x": 549,
            "y": 1141
        },
        {
            "class": "P.aeruginosa",
            "height": 190,
            "id": 2,
            "width": 190,
            "x": 1667,
            "y": 1880
        },
        {
            "class": "P.aeruginosa",
            "height": 279,
            "id": 3,
            "width": 279,
            "x": 510,
            "y": 2300
        },
        {
            "class": "P.aeruginosa",
            "height": 157,
            "id": 4,
            "width": 157,
            "x": 3086,
            "y": 2480
        }
    ],
    "sample_id": 356
}


In [10]:
example.keys()

dict_keys(['background', 'classes', 'colonies_number', 'labels', 'sample_id'])

In [11]:
example['labels'][0].items()

dict_items([('class', 'P.aeruginosa'), ('height', 307), ('id', 1), ('width', 307), ('x', 549), ('y', 1141)])

## Data Cleaning and Reformatting

As I am using a YOLO model, data structure is quite important. The data was converted into kitti format and written into txt files. Then the data was reorganized into the proper repos, before repeating this process to convert the data to YOLO format. The kitti2yolo formatter belongs to ZeroEyes, Inc. I have gained permission before using this.

In [20]:
filenames = []
for number in range(0,18001): # create list of numbers range 0 - 18000
    filenames.append(number)
filenames.pop(0) # remove number 0 from list

In [51]:
fileswithoutcolonies = [] # place holder for files in 
for file in filenames:
    kitlines = [] #converted lines to be placed in a txt file
 
    with open(f'data/AGAR_dataset/dataset/{file}.json') as f:
            data = json.load(f) # open json file
            for colony in data['labels']: 
               species = colony['class'] # extract species
               x =  colony['x'] # extract bounding box corner
               y = colony['y'] # extract bounding box corner
               x2 = colony['width'] + colony['x'] # calculate position of second corner
               y2 = colony['height'] + colony['y'] # calculate position of second corner
               kitlines.append(f'{species} 0.0 0 0.0 {x} {y} {x2} {y2} 0.00 0.00 0.00 0.00 0.00 0.00 0.00') # line to write to file
               kitlines.append('\n') # create a new line after file
               f.close() # close the file


    if len(kitlines) >= 1: # check if any colonies available
        with open(f'./data/txtfiles/{file}.txt', 'w') as textfile: # create new file, file.txt with kitlines
            for line in kitlines:
                textfile.writelines(line)
    else:
        print(f'File {file} had no colonies labelled') # report if no files found and append to list
        fileswithoutcolonies.append(file)

File 1 had no colonies labelled
File 2 had no colonies labelled
File 3 had no colonies labelled
File 4 had no colonies labelled
File 5 had no colonies labelled
File 6 had no colonies labelled
File 7 had no colonies labelled
File 8 had no colonies labelled
File 9 had no colonies labelled
File 10 had no colonies labelled
File 11 had no colonies labelled
File 12 had no colonies labelled
File 13 had no colonies labelled
File 14 had no colonies labelled
File 15 had no colonies labelled
File 16 had no colonies labelled
File 17 had no colonies labelled
File 18 had no colonies labelled
File 19 had no colonies labelled
File 20 had no colonies labelled
File 21 had no colonies labelled
File 22 had no colonies labelled
File 23 had no colonies labelled
File 24 had no colonies labelled
File 25 had no colonies labelled
File 26 had no colonies labelled
File 27 had no colonies labelled
File 28 had no colonies labelled
File 29 had no colonies labelled
File 30 had no colonies labelled
File 31 had no colo

In [53]:
len(fileswithoutcolonies) # check number without colonies

5728

In [64]:
(18000 - len(fileswithoutcolonies)) * 2 # check number with colonies

24544

In [61]:
for file in filenames: # put jpg files in same location as txt files
    if file not in fileswithoutcolonies:
        shutil.copyfile(src=f'./data/AGAR_dataset/dataset/{file}.jpg',dst=f'./data/txtfiles/{file}.jpg')


In [16]:
# This code was provided by ZeroEyes, Inc.

def kitti2bboxes(kitti_file):
    """
    Description: Convert kitti label file to arrays of bboxes and labels
    :param kitti_file: Path to kitti label file
    :return bboxes: 2D array of bboxes where each sub-array is bbox coordinates in the form of
    [x_min, y_min, x_max, y_max]
    :return labels: Array of labels corresponding to the bboxes array
    """

    bboxes = []
    labels = []
    for line in open(kitti_file).readlines():
        line_contents = line.split(' ')
        assert len(line_contents) > 1, f'File {kitti_file} does not have required number of fields'
        labels.append(line_contents[0])
        bbox = [int(float(line_contents[4])), int(float(line_contents[5])),
                int(float(line_contents[6])), int(float(line_contents[7]))]
        bboxes.append(bbox)

    return bboxes, labels
def kitti2yolo(dataset_path, resolution=(960, 540), use_images=True,
               class_map={'r_1': 'r_1', 'p_1': 'p_1'}):
    """
    Create annotations for YOLO implementation in format:
    class_id center_x center_y bbox_width bb_height
    where all values are normalized and one label file is generated per image file
    NOTE: This script will also generate a class dict that will be placed in the dab location
    :param dataset_path: Path to dataset where labels will be converted. expects <dataset_path>/train/labels
    :param resolution: Resolution to normalize over if use_images is False
    :param use_images: Flag to use DAB image resolutions to normalize, will slow down
    process
    :param class_map: Class map to map labels
    """

    # Set paths and subdirectories
    subsets = ['train', 'val', 'test']
    assert os.path.exists(dataset_path), f'{dataset_path} not present'

    # Create label ID map
    current_label_value = 0
    id_map = {}
    for label_value in class_map.values():
        if label_value not in id_map.keys():
            id_map[label_value] = current_label_value
            current_label_value += 1

    # Iterate through subdirectory and create conversions. Converted files will be placed in yolo_v6_labels directory
    for subdirectory in subsets:
        kitti_labels_path = os.path.join(dataset_path, subdirectory, 'labels')
        images_path = os.path.join(dataset_path, subdirectory, 'images')
        yolo_v5_labels_path = os.path.join(dataset_path, subdirectory, 'yolo_v5_labels')

        # Check that labels are present and that no yolo_v6 labels have been made yet
        if not os.path.exists(kitti_labels_path):
            print(f'Path not found for subset {subdirectory}, skipping')
            continue

        if not len(os.listdir(kitti_labels_path)) > 0:
            print(f'No labels found in {kitti_labels_path}, skipping subset {subdirectory}')
            continue

        if os.path.exists(yolo_v5_labels_path):
            print(f'Labels present in {yolo_v5_labels_path}, these will be overwritten')
            shutil.rmtree(yolo_v5_labels_path)
        os.mkdir(yolo_v5_labels_path)

        # Iterate through labels and create label format for YOLOv6
        total_labels = len(os.listdir(kitti_labels_path))
        for label_file in tqdm(os.listdir(kitti_labels_path), total=total_labels):

            # Get image file to get image width and height
            if use_images:
                image_file = label_file.split('.txt')[0] + '.jpg'
                image_file_path = os.path.join(images_path, image_file)
                img = cv2.imread(image_file_path)
                image_height = img.shape[0]
                image_width = img.shape[1]
            else:
                image_height = resolution[1]
                image_width = resolution[0]

            # Create normalized labels
            bboxes, labels = kitti2bboxes(os.path.join(kitti_labels_path, label_file))
            yolo_v5_label_lines = []
            for index in range(len(labels)):
                if labels[index] not in class_map:
                    continue
                x_min = bboxes[index][0]yet
                y_min = bboxes[index][1]
                x_max = bboxes[index][2]
                y_max = bboxes[index][3]
                height = (y_max - y_min) / image_height
                width = (x_max - x_min) / image_width
                center_y = (y_min + ((y_max - y_min) / 2)) / image_height
                center_x = (x_min + ((x_max - x_min) / 2)) / image_width
                label = [labels[index]][0]
                class_label = class_map[label]
                class_id = id_map[class_label]
                annotation_line = f'{class_id} {center_x} {center_y} {width} {height}'
                yolo_v5_label_lines.append(annotation_line)

            # Write YOLO v6 line
            yolo_v5_file = os.path.join(yolo_v5_labels_path, label_file)
            with open(yolo_v5_file, 'w+') as f:
                for line in yolo_v5_label_lines:
                    f.writelines(line + '\n')

    # Create label dict JSON
    label_json_file = os.path.join(dataset_path, 'yolo_v5_id_map.json')
    with open(label_json_file, 'w+') as f:
        json.dump(id_map, f, indent=4)

    print(f'Label values mapped to {id_map}')

In [28]:
# Isolating files with colonies in them
good_file_nums = []
for file in filenames:
    if str(file) not in fileswithoutcolonies: # exclude files without colonies
        good_file_nums.append(str(file)) # write numbers with colonies to good_file_nums

len(good_file_nums) # check number of good images

12272

In [32]:
y_nums = good_file_nums.copy() # copy for y in train-test

train_val_nums, test_nums, train_val_y, test_y = train_test_split(good_file_nums, y_nums) # Train, test split
train_nums, val_nums, train_y, val_y = train_test_split(train_val_nums, train_val_y) # Train, validation split


In [33]:
print(len(train_nums), len(val_nums), len(test_nums)) # confirming length of datasets

6903 2301 3068


In [34]:
# create dictionary for classes, needed by kitti2yolo

label_dict = {'S.aureus': 'S.aureus', 'B.subtilis': 'B.subtilis',
              'P.aeruginosa': 'P.aeruginosa', 'E.coli': 'E.coli', 'C.albicans': 'C.albicans'}

In [39]:
# move train data to new repo
for file in train_nums:
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.jpg',dst=f'./data/bacteria_data/train/images/{file}.jpg')
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.txt',dst=f'./data/bacteria_data/train/labels/{file}.txt')

In [40]:
# move test data to new repo
for file in test_nums:
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.jpg',
                    dst=f'./data/bacteria_data/test/images/{file}.jpg')
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.txt',
                    dst=f'./data/bacteria_data/test/labels/{file}.txt')

In [41]:
# move validation data to new repo
for file in val_nums:
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.jpg',
                    dst=f'./data/bacteria_data/val/images/{file}.jpg')
    shutil.copyfile(src=f'/home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/data/{file}.txt',
                    dst=f'./data/bacteria_data/val/labels/{file}.txt')

In [42]:
# perform conversion from kitti data to yolo data
kitti2yolo(dataset_path='./data/bacteria_data/', resolution=(1000, 1000), use_images=True, class_map= label_dict)

100%|██████████| 6903/6903 [08:38<00:00, 13.31it/s]
100%|██████████| 2301/2301 [02:48<00:00, 13.67it/s]
100%|██████████| 3068/3068 [03:44<00:00, 13.64it/s]

Label values mapped to {'S.aureus': 0, 'B.subtilis': 1, 'P.aeruginosa': 2, 'E.coli': 3, 'C.albicans': 4}





## First model constructed and trained

In [45]:
# Instantiate fist model
model = YOLO('./yolov8.yaml')


                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.Conv                  [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.Conv                  [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.C2f                   [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.Conv                  [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.C2f                   [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.Conv                  [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.C2f                   [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.Conv                  [128

In [52]:
#train first model for 3 epochs
model.train(data = './data.yaml', epochs = 3)

New https://pypi.org/project/ultralytics/8.0.69 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.67 🚀 Python-3.8.10 torch-2.0.0+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24252MiB)
[34m[1myolo/engine/trainer: [0mtask=detect, mode=train, model=./yolov8.yaml, data=./data.yaml, epochs=3, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classe

In [60]:
# perform test prediction
model.predict(source='./datasets/data/bacteria_data/images/test/13512.jpg', save=True, show=True, imgsz=640, conf=0.5)


image 1/1 /home/zeloada/Flatiron/Yolo_Detection_of_Bacteria_on_Agar/datasets/data/bacteria_data/images/test/13512.jpg: 640x640 5 S.aureuss, 21 E.colis, 3.4ms
Speed: 9.5ms preprocess, 3.4ms inference, 1.1ms postprocess per image at shape (1, 3, 640, 640)
Results saved to [1mruns/detect/predict[0m


[ultralytics.yolo.engine.results.Results object with attributes:
 
 _keys: ('boxes', 'masks', 'probs', 'keypoints')
 boxes: ultralytics.yolo.engine.results.Boxes object
 keypoints: None
 keys: ['boxes']
 masks: None
 names: {0: 'S.aureus', 1: 'B.subtilis', 2: 'P.aeruginosa', 3: 'E.coli', 4: 'C.albicans'}
 orig_img: array([[[ 69, 108, 110],
         [ 67, 106, 108],
         [ 64, 103, 105],
         ...,
         [ 99, 125, 139],
         [ 96, 122, 136],
         [ 96, 122, 136]],
 
        [[ 67, 106, 108],
         [ 68, 107, 109],
         [ 68, 107, 109],
         ...,
         [103, 129, 143],
         [103, 129, 143],
         [104, 130, 144]],
 
        [[ 67, 106, 108],
         [ 65, 104, 106],
         [ 62, 101, 103],
         ...,
         [103, 129, 141],
         [105, 131, 143],
         [107, 133, 145]],
 
        ...,
 
        [[121, 184, 198],
         [153, 216, 230],
         [114, 177, 191],
         ...,
         [122, 162, 190],
         [110, 153, 180],
      

: 

## Creation of second model 
Fresh start on this model. The image size was doubled and it was trained for 3.33 times as long as the initial model.

In [None]:
# Second model was constructed and ran for 10 epochs. Image size was doubled
second_model = YOLO('./yolov8.yaml')
second_model.train(data='./data.yaml', epochs=10, imgsz = 1280, batch=8)

# Final Model
Model construction has not changed. Yet, through utilizing the pretrained network from second_model, I was able to improve the accuracy drastically. Please see the README for further information. A summary from there is provided below.

In [None]:
# Final model was constructed via the pretrain weights of the second_model and ran for max 100 epochs
final_model = YOLO('./runs/detect/train6/weights/best.pt') #location of previous model with pre-trained weights
final_model.train(data='./data.yaml', patience=10, imgsz = 1280, batch=8) # stopped itself at 98 epochs after no improvement
final_model.val() # validation ran on test set
final_model.predict(source='./datasets/data/bacteria_data/images/test/3645.jpg', save= True) # save an example image locally

### Model Deployment
Annotated application that was used. This application was helped in its construction by a member of ZeroEyes, Inc.

In [None]:
import os
from flask import Flask, flash, request, redirect, url_for
from werkzeug.utils import secure_filename
from PIL import Image
import numpy as np
import cv2
from ultralytics import YOLO
from base64 import b64encode
import matplotlib
from collections import defaultdict
import json
import webcolors

model = YOLO('models_train/97epoch.pt') # instantiate the model
UPLOAD_FOLDER = 'upload_folder' # define were files will be saved
os.makedirs(UPLOAD_FOLDER, exist_ok=True) # create directory for save if none there
ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg'} # define allowed files
app = Flask(__name__) #instatiate flask application
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER # set upload folder
def allowed_file(filename): # check if uploaded file has an allowed extension
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route('/', methods=['GET', 'POST']) #define GET and POST methods
def upload_file():
    if request.method == 'POST':
        if 'file' not in request.files: # if no file is submitted
            flash('No file part')
            return redirect(request.url)
        file = request.files['file']
        if file and allowed_file(file.filename): # if there is a file and it's allowed continue
            filename = secure_filename(file.filename) # ensure safe file
            file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))# save file
            # img = Image.open(f'{UPLOAD_FOLDER}/{filename}')
            img = cv2.imread(f'{UPLOAD_FOLDER}/{filename}') # open file
            output_img, classes, colorHex = inference_img(img) # perform inference
            _, buffer = cv2.imencode('.jpg', output_img) #encode image from jpg
            b64_img = b64encode(buffer).decode() # encode butter file for html display
            ## HTML for post
            return f''' 
            <!doctype html>
            <h1> Your Agar Plate with Detections </h1>
            <img src=data:image/jpeg;base64,{b64_img} width="960" height="960">
            <h3> Colonies per class </h3>
            <p> {json.dumps(classes)} </p>
            <h3> LEGEND </h3>
            <p style= "color:{colorHex['S.aureus']};">S.aureus</p>
            <p style= "color:{colorHex['B.subtilis']};">B.subtilis</p>
            <p style= "color:{colorHex['P.aeruginosa']};">P.aeruginosa</p>
            <p style= "color:{colorHex['E.coli ']};">E.coli</p>
            <p style= "color:{colorHex['C.albicans']};">C.albicans</p>
            <form method="GET" action="/">  
                <input Home type="submit" value="Test another image"/>  
            </form>    
            </form>
            '''
    else:
        return '''
        <!doctype html>
        <title>Upload new File</title>
        <h1>Upload new File</h1>
        <form method=post enctype=multipart/form-data>
        <input type=file name=file>
        <input type=submit value=Upload>
        </form>
        '''
def inference_img(img: np.array) -> np.array: # take in image as numpy array and return a numpy array
    imgsize = 1280
    results = model.predict(img, imgsz=imgsize) #define image size
    output_img = img.copy() # copy image for bbox drawing
    # define classes
    class_maps = [
        "S.aureus",
        "B.subtilis", 
        "P.aeruginosa",
        "E.coli ",
        "C.albicans",
    ]
    colorsRGB = matplotlib.cm.tab20(range(len(class_maps))) # import colors
    colors = [(i[:-1][::-1]*255) for i in colorsRGB] # convert colors to portion of 255 in BGR format
    colorsRev = [(i[:-1][::1]*255) for i in colorsRGB] # convert colors to portion of 255 in RGB format
    colorsTuple = [(int(x),int(y),int(z)) for x,y,z in colorsRev] # convert form list to tuple
    colorHex = {x:webcolors.rgb_to_hex(y) for x, y in zip(class_maps, colorsTuple)} # convert RBG to hex code
    print(colors)
    classes_found = defaultdict(int) # instaiate dictionary with found colonies
    for result in results: # take bboxes and draw them on to image
        boxes = result.boxes.to('cpu').numpy()
        classes = boxes.cls.astype(int)
        for box, cls in zip(boxes, classes):
            bbox_class = class_maps[cls]
            coord = box.xyxy.astype(int).squeeze() # return bbox coord in smallest format
            xmin, ymin, xmax, ymax = coord
            classes_found[bbox_class] += 1 # count colonies
            
            color = colors[cls] # define color by class
            color = tuple(color) # convert color to a tuple

            cv2.rectangle(output_img, (xmin, ymin), (xmax, ymax), color, 2) # draw rectangle on image
    print(classes_found)
    return output_img, classes_found, colorHex # return image with bboxes, colony count, and hex values for colors used
if __name__ == '__main__': # run application
    app.run()


#### Results
My final YOLO model takes images as input and consists of __many__ hidden layers. The first layer is a convolutional layer that applies a set of filters to the input image to detect certain features in the image. The output is then passed to the next layer to detect more complicated features. This continues until the final output layer. YOLO utilizes parrellel processing of the image to classify objects as well as regress around the objects. This allows the model to be much faster than dual stage detectors such as faster RCNN. For those of you interested, here is [Ultralytic's github](https://github.com/ultralytics/ultralytics). 


During training, the model continues to refine the layers and adjust the weight of specific neurons until it can no longer improve itself. After 10 epochs of no improvement, the model will stop itself and save the model at whichever epoch had the highest score. My model precede to run for 87 epochs before leveling out and stopping itself at 97 epochs. This model was then saved and utilized in the application deployment. 


After training, the test set is introduced to the model to test the score on unseen data. Here are the final scores:

Metrics: 

* __mAP50__ = 0.971
* __mAP50-95__ = 0.701

This represents a 76% and 129% improvement over the baseline in mAP50 and mAP50-95 respectively.

Below is a summary of all results:

![results](images/resultsscreenshot.jpg)