# Exploratory Data Analysis

In this notebook we will take a closer look at the pothole data that we are dealing with. On first glance, the data is very chaotic and fairly low quality, so it can use some work neatening up before we start building the model.

## Assess missing data

In [1]:
import pandas as pd
import shutil
import os

In [2]:
# Step 1: Load the CSV file
train_labels = pd.read_csv('./data_full/train_labels_big.csv')

In [3]:
train_labels.head()

Unnamed: 0,Pothole number,Bags used
0,101,0.5
1,102,1.0
2,106,0.5
3,107,0.5
4,109,0.5


In [27]:
# Step 2: List all images and annotations
image_folder = './data_full/train_images'
annotation_folder = './data_full/train_annotations'

# Get list of image files (without extensions)
image_files = set([os.path.splitext(f)[0] for f in os.listdir(image_folder)])

# Get list of annotation files (without extensions)
annotation_files = set([os.path.splitext(f)[0]
                       for f in os.listdir(annotation_folder)])

In [5]:
# Step 3: Cross-check for missing data
label_ids = set("p" + train_labels['Pothole number'].astype(str))

# Potholes with complete data
complete_data = label_ids.intersection(
    image_files).intersection(annotation_files)

# Missing components
missing_images = label_ids - image_files
missing_annotations = label_ids - annotation_files
missing_both = missing_images.intersection(missing_annotations)

# Filter the original DataFrame to include only complete entries
filtered_train_labels = train_labels[("p" + train_labels['Pothole number'].astype(
    str)).isin(complete_data)]

In [6]:
# Save the filtered DataFrame to a new CSV file
filtered_train_labels.to_csv(
    './data_full/filtered_train_labels.csv', index=False)

# Save the results to CSV files for further analysis
missing_images_df = pd.DataFrame(
    list(missing_images), columns=['Pothole number'])
missing_images_df.to_csv('./data_full/missing_images.csv', index=False)

missing_annotations_df = pd.DataFrame(
    list(missing_annotations), columns=['Pothole number'])
missing_annotations_df.to_csv(
    './data_full/missing_annotations.csv', index=False)

In [7]:
# Step 4: Output results
print(f"Total entries in train_labels.csv: {len(train_labels)}")
print(f"Total images: {len(image_files)}")
print(f"Total annotations: {len(annotation_files)}")
print()
print(f"Potholes with complete data: {len(complete_data)}")
print(f"Potholes missing images: {len(missing_images)}")
print(f"Potholes missing annotations: {len(missing_annotations)}")
print(f"Potholes missing both images and annotations: {len(missing_both)}")
print()
print(f"Original CSV entries: {len(train_labels)}")
print(f"Filtered CSV entries: {len(filtered_train_labels)}")
print(f"Filtered CSV saved as 'filtered_train_labels.csv'")

Total entries in train_labels.csv: 644
Total images: 485
Total annotations: 482

Potholes with complete data: 386
Potholes missing images: 255
Potholes missing annotations: 258
Potholes missing both images and annotations: 255

Original CSV entries: 644
Filtered CSV entries: 386
Filtered CSV saved as 'filtered_train_labels.csv'


In [8]:
missing_annotations - missing_images

{'p1188', 'p1239', 'p1450'}

## Make new filtered image/annotation folders

In [9]:
# Step 1: Load the filtered train_labels.csv (from the previous step)
filtered_train_labels = pd.read_csv('./data_full/filtered_train_labels.csv')
filtered_ids = set(filtered_train_labels['Pothole number'].astype(str))

# Define the original and new folder paths
# Replace with your actual image folder path
original_image_folder = './data_full/train_images'
# Replace with your actual annotation folder path
original_annotation_folder = './data_full/train_annotations'

new_image_folder = './data_full/filtered_train_images'
new_annotation_folder = './data_full/filtered_train_annotations'

# Create new folders if they don't exist
os.makedirs(new_image_folder, exist_ok=True)
os.makedirs(new_annotation_folder, exist_ok=True)

In [10]:
# Step 2: Filter and copy images
for pothole_id in filtered_ids:
    image_file = f"p{pothole_id}.jpg"
    annotation_file = f"p{pothole_id}.txt"

    # Copy image
    if os.path.exists(os.path.join(original_image_folder, image_file)):
        shutil.copy(os.path.join(original_image_folder,
                    image_file), new_image_folder)

    # Copy annotation
    if os.path.exists(os.path.join(original_annotation_folder, annotation_file)):
        shutil.copy(os.path.join(original_annotation_folder,
                    annotation_file), new_annotation_folder)

print(f"Filtered images copied to {new_image_folder}")
print(f"Filtered annotations copied to {new_annotation_folder}")

Filtered images copied to ./data_full/filtered_train_images
Filtered annotations copied to ./data_full/filtered_train_annotations


Manually, I have also constructed a validation set of the data from the tail-end of the original filtered training set data using an 80/20 split.

** All this data analysis has been done with the original data, not re-annotated or modified yet

## Really bad data in original form

The data in its current state is not good to work with. Although it has been cleaned up by matching up the training labels with the correct images and annotations, there is still some improvement to be made. Specifically, the YOLOv8 bounding box detection model struggles with its accuracy especially when identifying the L2 boxes of the meter sticks. This is because there are so few labels of L2, and they are done inconsistently. The model does not understand the difference between L1 and L2, because there is no difference. They are both just halves of a meter stick, and so having the label ambiguity poisons the model. So, we can reduce the labels and remove L2 altogether to have a simplified model which only detects potholes and half-meter sticks.

In [12]:
# Define the path to the annotation folder
annotation_folder = './data_full/filtered_train_annotations'

In [13]:
# Function to replace L2 labels with L1 in each annotation file
def update_annotations(folder):
    for file_name in os.listdir(folder):
        if file_name.endswith('.txt'):
            file_path = os.path.join(folder, file_name)
            with open(file_path, 'r') as file:
                lines = file.readlines()

            # Replace label '2' with '1'
            updated_lines = [line.replace('2 ', '1 ') for line in lines]

            # Write the updated content back to the file
            with open(file_path, 'w') as file:
                file.writelines(updated_lines)

            print(f"Updated {file_name}")

In [14]:
# Run the update function
update_annotations(annotation_folder)
print("All L2 labels have been updated to L1.")

Updated p101.txt
Updated p102.txt
Updated p1032.txt
Updated p1033.txt
Updated p1034.txt
Updated p1035.txt
Updated p1036.txt
Updated p1037.txt
Updated p1038.txt
Updated p1039.txt
Updated p1041.txt
Updated p1042.txt
Updated p1043.txt
Updated p1047.txt
Updated p1048.txt
Updated p1049.txt
Updated p1051.txt
Updated p1052.txt
Updated p1054.txt
Updated p1055.txt
Updated p1056.txt
Updated p1058.txt
Updated p1059.txt
Updated p106.txt
Updated p1060.txt
Updated p1061.txt
Updated p1062.txt
Updated p1063.txt
Updated p1064.txt
Updated p1065.txt
Updated p1069.txt
Updated p107.txt
Updated p1070.txt
Updated p1071.txt
Updated p1072.txt
Updated p1073.txt
Updated p1074.txt
Updated p1075.txt
Updated p1076.txt
Updated p1077.txt
Updated p1078.txt
Updated p1079.txt
Updated p1080.txt
Updated p1081.txt
Updated p1082.txt
Updated p1083.txt
Updated p1084.txt
Updated p1087.txt
Updated p1088.txt
Updated p1089.txt
Updated p109.txt
Updated p1090.txt
Updated p1091.txt
Updated p1092.txt
Updated p1095.txt
Updated p1096.t

: 