In this example:

Each defect type (e.g., line_crack, particle_material) shows up with a certain frequency in both the original and sliced data.
If there are discrepancies between the counts in the original and sliced data, it could indicate that some classes became more or less frequent after slicing. 
E.g. if residue_stain had a count of 4 in the original data but became 0 in the sliced data, this would mean the slicing process didn’t preserve that defect type well.

<details open>
<summary>Why Class Imbalance Matters in ML?</summary>
<br>
When training a model, we want it to identify all defect types, regardless of their frequency in the dataset. If some defect types are rare, the model might:

- **Predict only the common defects**: It might say “crack” for every image, missing other defects.
  
- **Miss rare but important defects**: Some rare defects could be critical, and we want the model to catch those as well.
</details>


| Step                     | Action                                           | Goal                                                      |
|--------------------------|--------------------------------------------------|-----------------------------------------------------------|
| 1. Count Each Defect     | Check how many times each defect type appears.   | Identify if any defect type is much rarer than others.    |
| 2. Apply Augmentation    | Create new variations of rare defects.           | Increase the number of examples for rare classes.         |
| 3. Use Class Weights     | Assign higher importance to rare classes.        | Encourage the model to focus more on rare defects.        |
| 4. Balance the Dataset   | Use sampling techniques to even class counts.    | Ensure the model sees all defect types fairly often.      |


In [21]:
%pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   ---------------------------------------- 11.6/11.6 MB 66.0 MB/s eta 0:00:00
Downloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
Downloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2
Note: you may need to restart the kernel to use updated packages.


In [27]:
# class_imbalance_check.py
import json
from collections import Counter, defaultdict

import os.path

from PIL import Image, ImageDraw

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

In [56]:
coco_file_name = 'coco' # Change file name of original coco json file

# Load the original and sliced annotation files
with open(f"../data/coco_json_files/{coco_file_name}.json") as f:  # . in script, .. in notebook
    original_annotations = json.load(f)

with open(f"../data/coco_json_files/{coco_file_name}_sliced_coco.json") as f:  # . in script, .. in notebook
    sliced_annotations = json.load(f)

print(original_annotations)
print(sliced_annotations)

{'images': [{'width': 4096, 'height': 2000, 'id': 0, 'file_name': '01BE01.bmp'}], 'annotations': [{'id': 0, 'image_id': 0, 'category_id': 6, 'segmentation': [], 'bbox': [3243.732519135956, 674.1576397397976, 96.69150425145112, 116.35757291276306], 'ignore': 0, 'iscrowd': 0, 'area': 11250.788755982963}, {'id': 1, 'image_id': 0, 'category_id': 6, 'segmentation': [], 'bbox': [2752.15197481034, 1706.6478091354281, 45.88749354306201, 80.30311370035747], 'ignore': 0, 'iscrowd': 0, 'area': 3684.9086114129277}, {'id': 2, 'image_id': 0, 'category_id': 6, 'segmentation': [], 'bbox': [1529.1878991321566, 557.4169134727985, 45.887493543061424, 37.693298267514876], 'ignore': 0, 'iscrowd': 0, 'area': 1729.6509808672772}, {'id': 3, 'image_id': 0, 'category_id': 6, 'segmentation': [], 'bbox': [1567.6792100432908, 505.00420582112133, 23.630943165701463, 29.886192827210394], 'ignore': 0, 'iscrowd': 0, 'area': 706.2389241390035}, {'id': 4, 'image_id': 0, 'category_id': 6, 'segmentation': [], 'bbox': [170

# Count Checker

In [10]:
# Function to calculate class imbalance from annotations
def calculate_class_counts(annotations):
    defect_labels = [ann['category_id'] for ann in annotations['annotations']]
    return Counter(defect_labels)

In [None]:

# Get class counts for original and sliced annotations
original_counts = calculate_class_counts(original_annotations)
sliced_counts = calculate_class_counts(sliced_annotations)

print("Original counts: ",original_counts)
print("Sliced counts: ", sliced_counts)

# Map category IDs to names for better readability
category_mapping = {cat['id']: cat['name'] for cat in original_annotations['categories']}
print("Category Mapping: ",category_mapping)


Original counts Counter({6: 15, 7: 9, 5: 2, 1: 2, 2: 1})
Sliced counts Counter({6: 25, 7: 16, 1: 3, 5: 2, 2: 1})
Category Mapping:  {0: 'blocked_valve', 1: 'bubble', 2: 'chip_crack', 3: 'excessive_flash', 4: 'improper_welding', 5: 'light_stain', 6: 'line_crack', 7: 'particle_material', 8: 'residue_stain', 9: 'unknown', 10: 'welding_blob'}


In [18]:
print("Class Imbalance in Original Annotations:")
for cat_id, count in original_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

print("\nClass Imbalance in Sliced Annotations:")
for cat_id, count in sliced_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

Class Imbalance in Original Annotations:
line_crack: 15
particle_material: 9
light_stain: 2
chip_crack: 1
bubble: 2

Class Imbalance in Sliced Annotations:
line_crack: 25
chip_crack: 1
bubble: 3
light_stain: 2
particle_material: 16


In [57]:
# Convert counts into DataFrame format
data = {
    "Defect Type": [category_mapping[cat_id] for cat_id in original_counts.keys()],
    "Original Count": [original_counts[cat_id] for cat_id in original_counts.keys()],
    "Sliced Count": [sliced_counts.get(cat_id, 0) for cat_id in original_counts.keys()]
}

df = pd.DataFrame(data)

print("Class Imbalance Comparison")
print(df.to_string(index=False))


Class Imbalance Comparison
      Defect Type  Original Count  Sliced Count
       line_crack              15            25
particle_material               9            16
      light_stain               2             2
       chip_crack               1             1
           bubble               2             3


#### Interpretation

- **Class Imbalance Remains**: In both the original and sliced datasets, certain defects, like `line_crack` and `particle_material`, are **more** **common** than others, like `chip_crack` and `light_stain`.

- **Increased Counts After Slicing**: For most classes (except `light_stain` and `chip_crack`), slicing increased the counts, which is good for adding examples. However, rare defects like `chip_crack` still have only 1 instance.

## Bounding Box Size Checker

In [58]:
# Function to calculate bounding box area
def calculate_area(bbox):
    return bbox[2] * bbox[3]  # width * height

# Collect all bounding box areas from the original dataset
all_areas = [calculate_area(ann['bbox']) for ann in original_annotations['annotations']]

# Calculate dynamic thresholds based on percentiles
small_threshold = np.percentile(all_areas, 33)
medium_threshold = np.percentile(all_areas, 66)

print("Small Threshold: ", small_threshold)
print("Medium Threshold: ", medium_threshold)
print("All Areas: ", all_areas)

Small Threshold:  152.9398707527034
Medium Threshold:  1194.2404039037347
All Areas:  [11250.788755982963, 3684.9086114129277, 1729.6509808672772, 706.2389241390035, 536.200551158886, 956.4658480131437, 524.1239621688045, 127.2271400619153, 149.1001533907278, 57.86618380479451, 28.05633154172121, 19.63272858873214, 1771.2918776398874, 2486.0236879155777, 6605.720085032894, 225.5296437842936, 52.17205830644165, 634.823906021296, 488.32608155478863, 2097.582486678786, 110.25893696121001, 1194.215924463596, 165.09897573229276, 128.49490685429316, 3573.99412372357, 3529.6336383840285, 83.94520793686539, 71.80437153737263, 1194.2669232972185]


### Functions

Benefits of Dynamic Thresholds
1. **Adaptability**: Since the thresholds are derived from the dataset, they better represent the natural distribution of bounding box sizes.
2. **Scalability**: This approach automatically adjusts if the dataset grows or changes in defect types and bounding box sizes.

In [59]:
# Categorising dynamic threshold
def categorize_by_dynamic_size(area, small_threshold, medium_threshold):
    if area > medium_threshold:
        return "Large"
    elif area > small_threshold:
        return "Medium"
    else:
        return "Small"

In [60]:
# Prepare data for bounding box size analysis with dynamic thresholds
def create_bbox_size_data(annotations, category_mapping, small_threshold, medium_threshold):
    bbox_data = []
    for ann in annotations['annotations']:
        defect_type = category_mapping[ann['category_id']]
        area = calculate_area(ann['bbox'])
        size_category = categorize_by_dynamic_size(area, small_threshold, medium_threshold)
        bbox_data.append({
            "Defect Type": defect_type,
            "Bounding Box Area": area,
            "Size Category": size_category
        })
    return bbox_data

In [61]:
# Create bounding box data for original and sliced annotations
original_bbox_data = create_bbox_size_data(original_annotations, category_mapping, small_threshold, medium_threshold)
sliced_bbox_data = create_bbox_size_data(sliced_annotations, category_mapping, small_threshold, medium_threshold)

print("Original bounding box: ", original_bbox_data)
print("Sliced bounding box: ", sliced_bbox_data)

Original bounding box:  [{'Defect Type': 'line_crack', 'Bounding Box Area': 11250.788755982963, 'Size Category': 'Large'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 3684.9086114129277, 'Size Category': 'Large'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 1729.6509808672772, 'Size Category': 'Large'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 706.2389241390035, 'Size Category': 'Medium'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 536.200551158886, 'Size Category': 'Medium'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 956.4658480131437, 'Size Category': 'Medium'}, {'Defect Type': 'line_crack', 'Bounding Box Area': 524.1239621688045, 'Size Category': 'Medium'}, {'Defect Type': 'particle_material', 'Bounding Box Area': 127.2271400619153, 'Size Category': 'Small'}, {'Defect Type': 'particle_material', 'Bounding Box Area': 149.1001533907278, 'Size Category': 'Small'}, {'Defect Type': 'particle_material', 'Bounding Box Area': 57.86618380479451, 'Size

In [None]:
original_bbox_df = pd.DataFrame(original_bbox_data)
sliced_bbox_df = pd.DataFrame(sliced_bbox_data)

# Group by defect type and size category, counting occurrences
grouped_original_bbox_df = original_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Original Count')
grouped_sliced_bbox_df = sliced_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Sliced Count')

# Merge original and sliced data on defect type and size category
merged_bbox_df = pd.merge(grouped_original_bbox_df, grouped_sliced_bbox_df, on=['Defect Type', 'Size Category'], how='outer').fillna(0)

# Converting the count to integers
merged_bbox_df['Original Count'] = merged_bbox_df['Original Count'].astype(int)
merged_bbox_df['Sliced Count'] = merged_bbox_df['Sliced Count'].astype(int)

print("Grouped by Dynamic Bounding Box Size Category")
print(merged_bbox_df)