In [1]:
%load_ext autoreload
%autoreload 1  
# Automatically reload bioscout package
%aimport bioscout_tech_challenge

In [2]:
from bioscout_tech_challenge import *
from bioscout_tech_challenge.imagery import *
from bioscout_tech_challenge.models import BoundingBox
from bioscout_tech_challenge.utils.file_operations import *
from bioscout_tech_challenge.utils import * 
from bioscout_tech_challenge.utils.image import *
# Now any changes to your package will be automatically reloaded
import pandas as pd
import numpy as np
import os
from PIL import Image, ImageDraw

# 3. Explore Imagery & Bounding Boxes

## Visualise the data

To analyse the performance of the machine learning model for detecting diesease within imagery data we need to first provide a mechanism to of transforming the model outputs and ground truths into a comparable format. Let

lets start by plotting the ground truth data on top of the image data 
now lets see if we can plot the model outputs on top of the image data for a visual comparison

In [None]:
# hardcode some data
data_folder = "../data/"
imagery_folder = os.path.join(data_folder, "imagery/")
model_output_folder = os.path.join(data_folder, "tables/imagery/model predictions/")
ground_truth_folder = os.path.join(data_folder, "tables/imagery/ground truths/")
files = os.listdir(imagery_folder)
idx = 0
test_file = files[idx]
file_number = test_file.split(".")[0]

def display_image(filename):
    img_path = os.path.join(imagery_folder, filename)
    img = Image.open(img_path)
    return img

test_image = display_image(test_file)
test_image.show()

In [None]:
# Now lets read in the ground truth data
ground_truth_file = os.path.join(ground_truth_folder, f"{file_number}.csv")
ground_truth_df = read_csv_file(ground_truth_file)
display(ground_truth_df.head())


Looks good lets plot the ground truth data on top of the image data. The ground truth data stores diesease as a bounding box with the top left and bottom right coordinates as pixel values. Lets plot the bounding boxes on top of the image data.



In [None]:
draw = ImageDraw.Draw(test_image)   
for idx, row in ground_truth_df.iterrows():
    draw.rectangle([(row["x_min"], row["y_min"]), (row["x_max"], row["y_max"])], outline="red", width=3)
test_image.show()


This looks great we need to develop this into the package for useablity when analysing.

Now lets plot the model outputs on top of the image data for a visual comparison.

In [None]:
# read in the model outputs
# Search for file with any extension matching the number
matching_files = [f for f in os.listdir(model_output_folder) if f.startswith(f"{file_number}.")]
if matching_files:
    model_output_file = os.path.join(model_output_folder, matching_files[0])
else:
    raise FileNotFoundError(f"No file found starting with {file_number} in {model_output_folder}")
model_output_df = read_csv_file(model_output_file)
model_output_df.head()



In [None]:
#Load this into a bounding box object
from bioscout_tech_challenge.models.bounding_box import BoundingBox

for idx, row in model_output_df.iterrows():
    box = BoundingBox.from_centroid(row["x_center_normalised"], row["y_center_normalised"], row["width_normalised"], row["height_normalised"])
    draw.rectangle(box.to_absolute_coordinates(test_image.width, test_image.height), outline="blue", width=3)
    
test_image.show()


## Calculate IoU
Looks good we have a visual comparison of the model outputs and ground truths. Now we need a way to compare the model outputs and ground truths. Commonly IoU (Intersection over Union) is used to compare the model outputs and ground truths. This measures the overlap normalized by the union of the two bounding boxes giving a value between 0 and 1. Typically a threshold of 0.5 is used to determine if a prediction is a true positive. 

Lets calculate the IoU for the boxes identified above and show the results on the image.

In [8]:
def draw_text_with_outline(draw, box, image_width, image_height, text, font_size=72):
    """
    Draw text centered in the bounding box with a black outline for better visibility.
    """
    # Get box coordinates
    x1, y1, x2, y2 = box.to_absolute_coordinates(image_width, image_height)
    
    # Calculate center of box
    center_x = (x1 + x2) // 2
    center_y = (y1 + y2) // 2
    
    # Get text size using default font
    font = ImageFont.load_default().font_variant(size=font_size)
    text_width = draw.textlength(text, font=font)  # Get width of text
    text_height = font_size  # Approximate height for default font
    
    # Calculate text position (centered)
    text_x = center_x - text_width // 2
    text_y = center_y - text_height // 2
    
    # Draw outline
    offset = 2
    for dx, dy in [(-offset,-offset), (-offset,offset), (offset,-offset), (offset,offset)]:
        draw.text((text_x+dx, text_y+dy), text, fill="black")
    
    # Draw main text
    draw.text((text_x, text_y), text, fill="white",font=font)

In [None]:
# Create a list of bounding boxes from the model outputs
model_output_boxes = [BoundingBox.from_centroid(row["x_center_normalised"], row["y_center_normalised"], row["width_normalised"], row["height_normalised"]) for idx, row in model_output_df.iterrows()]
from PIL import ImageFont


# Create a list of bounding boxes from the ground truth data
ground_truth_boxes = [BoundingBox.from_absolute_coordinates(\
    row["x_min"], \
    row["y_min"], \
    row["x_max"], \
    row["y_max"], \
    test_image.width, \
    test_image.height) for idx, row in ground_truth_df.iterrows()]

matches = find_box_matches(model_output_boxes, ground_truth_boxes)

for pred_idx, gt_idx in matches.items():
    iou = model_output_boxes[pred_idx].calculate_iou(ground_truth_boxes[gt_idx])
    centroid = model_output_boxes[pred_idx].to_absolute_centroid(test_image.width, test_image.height)
    # Use the font when drawing text

    draw_text_with_outline(draw, model_output_boxes[pred_idx], test_image.width, test_image.height, f"{iou:.2f}")
test_image.show()


Great we can visualise the IoU scores on the image data. There is one issue here, only a single IoU score is displayed for the two sets of boxes that overlap at the bottom centre of the image. This actually seems to be both an issue with the ground truth data labelling and the model outputs. The ground truth data has labelled to of three overlapping diesease areas as a single bounding box. Whilst the model outputs have successfully seperated one these out from the single box, however it has also incorrectly identified all three diesease areas as a single bounding box. 

This is a complicated issue and as such will be left for now. Possible solutions include:

1. Post processing the model outputs to merge the boxes that are too close to each other.
2. Post processing the ground truth data to merge the boxes that are too close to each other.
3. Using a more complex model that can identify multiple diesease areas within a single bounding box.



## Calculate some metrics for the a single image.
Lets calculate some metrics for the single test image so far. We can use the IoU scores to determine the number of true positives and false positives. We can then use the true positives to calculate the precision, recall and F1 score. These are common metrics used in evaulating the performance of a machine learning model. Whilst these metrics are simple to calculate it was first important to establish a way to compare the model outputs and ground truths using the IoU scores.


In [None]:
# Calculate the number of true positives
true_positives, matched_gt_boxes = find_true_positives(model_output_boxes, ground_truth_boxes,iou_threshold=0.5)
print(f"True Positives: {true_positives}")
    
# Calculate the number of false positives
false_positives = len(model_output_boxes) - true_positives
print(f"False Positives: {false_positives}")

# Calculate the number of false negatives
false_negatives = len(ground_truth_boxes) - true_positives
print(f"False Negatives: {false_negatives}")



This matches the visualisation of the IoU scores on the image data. We have one false positive and negative from the overlap of the two sets of boxes. Whilst an additional false positive that appears to be correct however, it was not identified by the ground truth data. There are three additional false negatives that were not identified by the model outputs. However, the final one appears to be a true negative as there is a labelled ground truth box that runs off the image and cannot be identified as a diesease area.

Overall just this one image has given us a good indication of the performance of the model and has raised some issues with the data labelling.

Although this is a small sample lets just test the precision, recall and F1 score calculations.

In [None]:

precision = calculate_precision(true_positives, len(model_output_boxes))
recall = calculate_recall(true_positives, len(ground_truth_boxes))
f1 = calculate_f1(precision, recall)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")



This seems like a good result. The precision, recall and F1 score are all relatively high. However, this is only a single image and the metrics are not very robust. We need to calculate the metrics for all the images to get a better indication of the model performance.

## Create dataframe to store metrics of all images
We will create a dataframe to store and keep track of the metrics for the images as we process them. THis will allow us to easily calculate the metrics per image and across the dataset. As well as allow us to add functionality like calculating the average precision across the dataset.

In [None]:
# Lets read in all the model outputs and ground truths and return a dataframe combining the data.
# We can reuse our functions developed in utils.file_operations
import bioscout_tech_challenge.utils.image as image_utils
ground_truth_files = find_csv_files(ground_truth_folder,sort=True)
model_output_files = find_csv_files(model_output_folder,sort=True)

ground_truth_df = combine_csv_files(ground_truth_files,detect_header=False)
model_output_df = combine_csv_files(model_output_files,detect_header=False)

# since we automatically tag the source file we just need to remove the file type to get the file number
ground_truth_df["file_number"] = ground_truth_df["source_file"].str.replace(".csv", "")
model_output_df["file_number"] = model_output_df["source_file"].str.replace(".jpg.csv", "")

display(ground_truth_df.head())
display(model_output_df.head())

image_files = image_utils.find_image_files(imagery_folder,sort=True)
image_df = image_utils.get_image_dimensions(image_files)
display(image_df.head())

ground_truth_df = ground_truth_df.merge(image_df, on="file_number", how="left")



In [13]:
from bioscout_tech_challenge.utils.bounding_box import df_to_bounding_boxes

# Now lets turn the ground truth data into a list of bounding boxes
ground_truth_boxes = df_to_bounding_boxes(ground_truth_df, method="absolute",file_name="file_number")
model_output_boxes = df_to_bounding_boxes(model_output_df, method="centroid",file_name="file_number")


In [None]:
display(ground_truth_boxes[0])
display(model_output_boxes[0])

We now how two lists of bounding boxes but no easy way to compare them. We need to find the IoU scores for the boxes in the two lists. But only between the boxes that are from the same image. 

In [15]:
# Turn into two new dataframes using the dataclass bounding box
from dataclasses import asdict
ground_box_df = pd.DataFrame.from_records([asdict(box) for box in ground_truth_boxes])
model_box_df = pd.DataFrame.from_records([asdict(box) for box in model_output_boxes])


It is more efficent to calculate the statistics using bounding box broken into dataframes. However, we have already created the tools to calculate the statistics using the bounding box dataclass. We will add the list of bounding boxes to our image dataframe and then calculate the statistics.

In [16]:
filename_list = ground_truth_df["file_number"].unique()


# Split ground_truth_boxes into sublists
start = 0
ground_truth_boxes_split = []
model_output_boxes_split = []
counts = ground_truth_df["file_number"].value_counts().sort_index(key=lambda x: x.astype(int))
for count in counts.tolist():
    end = start + count
    ground_truth_boxes_split.append(ground_truth_boxes[start:end])
    start = end
start = 0
counts = model_output_df["file_number"].value_counts().sort_index(key=lambda x: x.astype(int))
for count in counts.tolist():
    end = start + count
    model_output_boxes_split.append(model_output_boxes[start:end])
    start = end


# # Add the lists of bounding boxes to the image dataframe
image_df['ground_truth_boxes'] = ground_truth_boxes_split
image_df['model_output_boxes'] = model_output_boxes_split



The above was not a great way to process the data as there are no sanity checks that only bounding boxs from the same image are being added to the dataframe. We can check this by looking at the bounding boxes for a single image. But ideally we would process the data in a way that ensures this.

In [None]:
idx = 17
name = image_df["file_number"].iloc[idx]
print(all([box.name==name for box in image_df["model_output_boxes"].iloc[idx]]))
print(all([box.name==name for box in image_df["ground_truth_boxes"].iloc[idx]]))



## Calculate the metrics for all the images
Lets use our metrics functions to calculate the positives negatives, precision, recall and F1 score for each image. We will assume an IoU score of 0.5 to determine a match.

In [None]:
from bioscout_tech_challenge.imagery.metrics import calculate_metrics_for_predictions
# Calculate metrics for each row and expand the dictionary into new columns directly
metrics_df = pd.DataFrame(
    image_df.apply(
        lambda row: calculate_metrics_for_predictions(
            row['model_output_boxes'],
            row['ground_truth_boxes'],
            iou_threshold=0.5
        ),
        axis=1
    ).tolist()
)

# Add metrics columns to original dataframe
processed_df = pd.concat([image_df, metrics_df], axis=1)
processed_df.drop(columns=["model_output_boxes", "ground_truth_boxes"], inplace=True)

processed_df.head()

Lets do a sanity check with our original test image.

In [None]:
file_name = test_file.split(".")[0]
idx = processed_df[processed_df["file_number"]==file_name].index[0]
display(processed_df.iloc[idx])

print("precision:", precision)
print("recall:", recall) 
print("f1_score:", f1)
print("true_positives:", true_positives)
print("false_positives:", false_positives)
print("false_negatives:", false_negatives)

# Analyse low scores
Everything matches up. Lets pick some of the images with lower precision, recall and F1 score and see if we can understand why.

In [None]:

def display_image_with_boxes(image_df, idx):
    file_name = image_df.loc[idx]["file_number"] + ".jpg"
    image = display_image(file_name)
    draw = ImageDraw.Draw(image)
    for box in image_df.loc[idx]["ground_truth_boxes"]:
        draw.rectangle(box.to_absolute_coordinates(image.width, image.height), outline="red", width=3)
    for box in image_df.loc[idx]["model_output_boxes"]:
        draw.rectangle(box.to_absolute_coordinates(image.width, image.height), outline="blue", width=3)
    image.show()


precision_min = processed_df["precision"].idxmin()
recall_min = processed_df["recall"].idxmin()
f1_min = processed_df["f1_score"].idxmin()
display(f"Precision Minimum idx: {precision_min}")
display(f"Recall Minimum idx: {recall_min}")
display(f"F1 Minimum idx: {f1_min}")

display(processed_df.loc[precision_min])
display_image_with_boxes(image_df, precision_min)

Unsurprisingly the one image has the lowest precision, recall and F1 score. Whilst it did succesfully identify quite a few of the diesease areas the image contains a large number of false positives as well as a large number of false negatives. The image itself is quite busy and contains a large number of diesease areas. It also seems that most of the false positives are in fact true positives and the ground truth data has not labelled all of the diesease areas. This is the seconnd image we have visualised and as we have seen unlabelled diesease areas it is a cause for concern and will be investigated further.


## Highest scoring images
Whilst the lowest scoring images are the cause for concern we can analyse high scoring images to see where the model is performing well.

In [None]:
precision_max = processed_df["precision"].idxmax()
recall_max = processed_df["recall"].idxmax()
f1_max = processed_df["f1_score"].idxmax()
display(f"Precision Maximum idx: {precision_max}")
display(f"Recall Maximum idx: {recall_max}")
display(f"F1 Maximum idx: {f1_max}")

display(processed_df.loc[precision_max])
display_image_with_boxes(image_df, precision_max)

display(processed_df.loc[recall_max])
display_image_with_boxes(image_df, recall_max)


Unsurprisingly the image with the highest precision, recall and F1 score has no false positives or negatives. The image itself is quite sparce and the ground truth data has labelled all of the diesease areas. The model has also identified all of the diesease areas. A second image with a perfect recall but containing some false positives again looks like the model has performed perfectly but there has been issues with the labelling of the ground truth data.

Lets have a look at the image with the highest f1 score that has both false positives and negatives.

In [None]:
f1_max = processed_df[(processed_df["precision"]<1) & (processed_df["recall"]<1)]["f1_score"].idxmax()
display(f"F1 Maximum idx: {f1_max}")
display(processed_df.loc[f1_max])
display_image_with_boxes(image_df, f1_max)


This is interesting as the false negative seems to be due to the opacity of the diesease area. WHilst the false positive appears to have detected a diesease area that is similar to what we are looking for but it is unclear if it is the same type of diesease. We will need to discuss this with a biologist to determine if the false positive is a true positive due to again the issue of unlabelled diesease areas in the ground truth data.

## Calculate the metrics for all the images

Since we have all the metrics in a dataframe lets calculate the precision, recall and F1 score for the entire dataset.

In [None]:
# Sum up all true positives, false positives, and false negatives across all images
total_true_positives = processed_df['true_positives'].sum()
total_false_positives = processed_df['false_positives'].sum() 
total_false_negatives = processed_df['false_negatives'].sum()
# Calculate overall metrics
overall_precision = total_true_positives / (total_true_positives + total_false_positives)
overall_recall = total_true_positives / (total_true_positives + total_false_negatives)
overall_f1 = 2 * (overall_precision * overall_recall) / (overall_precision + overall_recall)

print(f"Overall Precision: {overall_precision:.3f}")
print(f"Overall Recall: {overall_recall:.3f}") 
print(f"Overall F1 Score: {overall_f1:.3f}")


Excellent the overall precision, recall and F1 score are all relatively high. This is a good indication that the model is performing well across the dataset. It is likely that the precision score is higher due to the issue of unlabelled diesease areas in the ground truth data.

## Conclusion and Recommendations

The model is performing well across the dataset. However, there are some issues with the ground truth data labelling that need to be addressed. The unlabelled diesease areas are a cause for concern and will need to be investigated further. This is causing a misleading precision score as the are multiple cases of true positive being identified as false positives. This would likely greatly improve the precision score and by consequence the F1 score.

Next steps would be to integrate this analysis as a tool in the package. This could be included as a function in the `imagery` module. It could be used to analyse the performance of the model on a single image or across the entire dataset. It could also be used to plot the IoU scores on the image data. To help with the identification of unlabelled diesease areas in the ground truth data a function needs to be added that draws the false positive bounding boxes on the image data and exports so that they can be reviewed by a biologist. This would allow for inveitigation into the issue without needing to rereviewing all the images and diesease areas.

Additionally the way in which the data was processed from dataframe to bounding box back to dataframe is not robust and raises issues with data integrity. The process could be made more efficent by processing all the bounding boxes within a dataframe. This would also allow us to keep track of model metadata specifically confidence scores. This is necessary in order to calculate the AP score (average precision) given different confidence thresholds. Which in turn would allow us to calcaulte the mAP (mean average precision) score to give a single value to describe the performance of the model across the dataset.

