# Class Imbalancer Checker Script

The class imbalance checker script will serve as a diagnostic tool that:

- Confirms that the slicing process has not unintentionally altered the dataset (such as by adding new defect types).
  
- Identifies any class imbalances in the original dataset that may require balancing measures.
  
- Highlights any changes in the distribution of bounding box sizes due to slicing, allowing you to see if the slicing process fragments large defects excessively.

<details open>
<summary>Why Class Imbalance Matters in ML?</summary>
<br>
When training a model, we want it to identify all defect types, regardless of their frequency in the dataset. If some defect types are rare, the model might:

- **Predict only the common defects**: It might say “crack” for every image, missing other defects.
  
- **Miss rare but important defects**: Some rare defects could be critical, and we want the model to catch those as well.

In this example:

Each defect type (e.g., line_crack, particle_material) shows up with a certain frequency in both the original and sliced data.
If there are discrepancies between the counts in the original and sliced data, it could indicate that some classes became more or less frequent after slicing. 
E.g. if residue_stain had a count of 4 in the original data but became 0 in the sliced data, this would mean the slicing process didn’t preserve that defect type well.
</details>




| Step                          | Output                        | Purpose                                                        |
|-------------------------------|-------------------------------|----------------------------------------------------------------|
| 1                             | Loaded datasets               | Verify data structure and integrity                            |
| 2                             | Defect types in each dataset  | Identify unexpected defect types in sliced data               |
| 3                             | Total class counts            | Compare overall defect counts before and after slicing         |
| 4                             | Size-based classification     | Analyze if slicing affects size distribution of defects       |
| 5                             | Final summary table           | Comprehensive view of class imbalance and size distribution, guiding augmentation and balancing decisions |

| Step                     | Action                                           | Goal                                                      |
|--------------------------|--------------------------------------------------|-----------------------------------------------------------|
| 1. Count Each Defect     | Check how many times each defect type appears.   | Identify if any defect type is much rarer than others.    |
| 2. Apply Augmentation    | Create new variations of rare defects.           | Increase the number of examples for rare classes.         |
| 3. Use Class Weights     | Assign higher importance to rare classes.        | Encourage the model to focus more on rare defects.        |
| 4. Balance the Dataset   | Use sampling techniques to even class counts.    | Ensure the model sees all defect types fairly often.      |


In [None]:
%pip install pandas; pycocotools; matplotlib; pandas; numpy

Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   ---------------------------------------- 11.6/11.6 MB 66.0 MB/s eta 0:00:00
Downloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
Downloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
# class_imbalance_check.py
import json
from collections import Counter, defaultdict
import os.path
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pycocotools.coco import COCO

In [None]:
coco_file_name = 'cassette1_train' # Change file name of original coco json file

# Load the original and sliced annotation files
with open(f"../data/coco/{coco_file_name}.json") as f:  # . in script, .. in notebook
    original_annotations = json.load(f)

with open(f"../data/coco/{coco_file_name}_sliced_coco.json") as f:  # . in script, .. in notebook
    sliced_annotations = json.load(f)

# Use COCO objects specifically for counting bounding boxes
coco_original = COCO(f"../data/coco/{coco_file_name}.json")
coco_sliced = COCO(f"../data/coco/{coco_file_name}_sliced_coco.json")

print(original_annotations)
print(sliced_annotations)

# Inspect the structure of both datasets to ensure they are loaded correctly
print("\nKeys in original dataset:", original_annotations.keys())
print("Keys in sliced dataset:", sliced_annotations.keys())
print("Sample data from original dataset:", original_annotations['annotations'][:2])
print("Sample data from sliced dataset:", sliced_annotations['annotations'][:2])

FileNotFoundError: [Errno 2] No such file or directory: '../data/coco_json_files/cassette1_train.json'

# Count Checker

In [27]:
# Function to calculate class counts by using COCO API
def calculate_class_counts(coco_obj):
    category_ids = coco_obj.getCatIds()  # Get all category IDs in the dataset
    bbox_counts = Counter()
    
    for cat_id in category_ids:
        # Get all annotation IDs for each category ID and count them
        ann_ids = coco_obj.getAnnIds(catIds=[cat_id])
        bbox_counts[cat_id] = len(ann_ids)
    
    return bbox_counts


# Calculate total counts for original and sliced datasets
original_counts = calculate_class_counts(coco_original)
sliced_counts = calculate_class_counts(coco_sliced)

print(original_counts)
print(sliced_counts)

Counter({6: 246, 7: 181, 1: 34, 5: 34, 9: 24, 2: 15, 4: 13, 0: 1, 3: 0, 8: 0, 10: 0})
Counter({6: 423, 7: 312, 9: 53, 1: 51, 5: 50, 2: 27, 4: 21, 0: 2, 3: 0, 8: 0, 10: 0})


In [28]:

# Map category IDs to names for readability
category_mapping = {cat['id']: cat['name'] for cat in coco_original.loadCats(original_counts.keys())}
original_counts_named = {category_mapping[k]: v for k, v in original_counts.items()}
sliced_counts_named = {category_mapping.get(k, 'Unknown'): v for k, v in sliced_counts.items()}

print("Category Mapping: ",category_mapping)
print("Bounding Box Counts in Original Dataset:", original_counts_named)
print("Bounding Box Counts in Sliced Dataset:", sliced_counts_named)

Category Mapping:  {0: 'blocked_valve', 1: 'bubble', 2: 'chip_crack', 3: 'excessive_flash', 4: 'improper_welding', 5: 'light_stain', 6: 'line_crack', 7: 'particle_material', 8: 'residue_stain', 9: 'unknown', 10: 'welding_blob'}
Bounding Box Counts in Original Dataset: {'blocked_valve': 1, 'bubble': 34, 'chip_crack': 15, 'excessive_flash': 0, 'improper_welding': 13, 'light_stain': 34, 'line_crack': 246, 'particle_material': 181, 'residue_stain': 0, 'unknown': 24, 'welding_blob': 0}
Bounding Box Counts in Sliced Dataset: {'blocked_valve': 2, 'bubble': 51, 'chip_crack': 27, 'excessive_flash': 0, 'improper_welding': 21, 'light_stain': 50, 'line_crack': 423, 'particle_material': 312, 'residue_stain': 0, 'unknown': 53, 'welding_blob': 0}


In [29]:
# Calculate the total number of bounding boxes across all defect types
total_original_bboxes = sum(original_counts.values())
total_sliced_bboxes = sum(sliced_counts.values())

print(f"\nTotal bounding boxes in the original dataset: {total_original_bboxes}")
print(f"Total bounding boxes in the sliced dataset: {total_sliced_bboxes}")


Total bounding boxes in the original dataset: 548
Total bounding boxes in the sliced dataset: 939


In [31]:

# Check unique defect types in each dataset
original_defect_types = {ann['category_id'] for ann in original_annotations['annotations']}
sliced_defect_types = {ann['category_id'] for ann in sliced_annotations['annotations']}

print("Defect types in the original dataset:", original_defect_types)
print("Defect types in the sliced dataset:", sliced_defect_types)

# Ensure that all defect types in the sliced dataset exist in the original dataset
unexpected_defects = sliced_defect_types - original_defect_types
if unexpected_defects:
    print("Warning: These defect types appear in the sliced dataset but not in the original dataset:", unexpected_defects)
else:
    print("All defect types in the sliced dataset are also present in the original dataset.")

Defect types in the original dataset: {0, 1, 2, 4, 5, 6, 7, 9}
Defect types in the sliced dataset: {0, 1, 2, 4, 5, 6, 7, 9}
All defect types in the sliced dataset are also present in the original dataset.


In [11]:
print("Class Imbalance in Original Annotations:")
for cat_id, count in original_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

print("\nClass Imbalance in Sliced Annotations:")
for cat_id, count in sliced_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

Class Imbalance in Original Annotations:
blocked_valve: 1
bubble: 34
chip_crack: 15
excessive_flash: 0
improper_welding: 13
light_stain: 34
line_crack: 246
particle_material: 181
residue_stain: 0
unknown: 24
welding_blob: 0

Class Imbalance in Sliced Annotations:
blocked_valve: 2
bubble: 51
chip_crack: 27
excessive_flash: 0
improper_welding: 21
light_stain: 50
line_crack: 423
particle_material: 312
residue_stain: 0
unknown: 53
welding_blob: 0


In [96]:
# Convert counts into DataFrame format
data = {
    "Defect Type": [category_mapping[cat_id] for cat_id in original_counts.keys()],
    "Original Count": [original_counts[cat_id] for cat_id in original_counts.keys()],
    "Sliced Count": [sliced_counts.get(cat_id, 0) for cat_id in original_counts.keys()]
}

df = pd.DataFrame(data)

print("Class Imbalance Comparison")
print(df.to_string(index=False))

Class Imbalance Comparison
      Defect Type  Original Count  Sliced Count
    blocked_valve               1             2
           bubble              34            51
       chip_crack              15            27
  excessive_flash               0             0
 improper_welding              13            21
      light_stain              34            50
       line_crack             246           423
particle_material             181           312
    residue_stain               0             0
          unknown              24            53
     welding_blob               0             0


#### Interpretation

- **Class Imbalance Remains**: In both the original and sliced datasets, certain defects, like `line_crack` and `particle_material`, are **more** **common** than others, like `chip_crack` and `light_stain`.

- **Increased Counts After Slicing**: For most classes (except `light_stain` and `chip_crack`), slicing increased the counts, which is good for adding examples. However, rare defects like `chip_crack` still have only 1 instance.

## Bounding Box Size Checker

In [94]:
# Function to calculate bounding box area
def calculate_area(bbox):
    return bbox[2] * bbox[3]  # width * height

# Collect all bounding box areas from the original dataset
all_areas = [calculate_area(ann['bbox']) for ann in original_annotations['annotations']]

# Calculate dynamic thresholds based on percentiles
small_threshold = np.percentile(all_areas, 33)
medium_threshold = np.percentile(all_areas, 66)

print("Small Threshold: ", small_threshold)
print("Medium Threshold: ", medium_threshold)
print("All Areas: ", all_areas)

Small Threshold:  184.98383589988535
Medium Threshold:  872.8431891833333
All Areas:  [11250.788755982963, 3684.9086114129277, 1729.6509808672772, 706.2389241390035, 536.200551158886, 956.4658480131437, 524.1239621688045, 127.2271400619153, 149.1001533907278, 57.86618380479451, 28.05633154172121, 19.63272858873214, 1771.2918776398874, 2486.0236879155777, 6605.720085032894, 225.5296437842936, 52.17205830644165, 634.823906021296, 488.32608155478863, 2097.582486678786, 110.25893696121001, 1194.215924463596, 165.09897573229276, 128.49490685429316, 3573.99412372357, 3529.6336383840285, 83.94520793686539, 71.80437153737263, 1194.2669232972185, 2209.8708494682955, 8257.767394851355, 1520.9111140458522, 5661.652589546959, 484.6091519332006, 261.978879143369, 608.8679088391576, 298.22101657427, 159.7101634148239, 2487.6613966990813, 128.59151956620022, 871.7347203495611, 252.50698387546726, 618.1394345547818, 75.25879292348442, 6160.859301859299, 2439.670589034275, 1372.0237741698409, 175.64256

### Functions

Benefits of Dynamic Thresholds
1. **Adaptability**: Since the thresholds are derived from the dataset, they better represent the natural distribution of bounding box sizes.
2. **Scalability**: This approach automatically adjusts if the dataset grows or changes in defect types and bounding box sizes.

In [93]:
# Categorising dynamic threshold
def categorize_by_dynamic_size(area, small_threshold, medium_threshold):
    if area > medium_threshold:
        return "Large"
    elif area > small_threshold:
        return "Medium"
    else:
        return "Small"
    
# Prepare data for bounding box size analysis with dynamic thresholds
def create_bbox_size_data(annotations, category_mapping, small_threshold, medium_threshold):
    bbox_data = []
    for ann in annotations['annotations']:
        defect_type = category_mapping[ann['category_id']]
        area = calculate_area(ann['bbox'])
        size_category = categorize_by_dynamic_size(area, small_threshold, medium_threshold)
        bbox_data.append({
            "Defect Type": defect_type,
            "Bounding Box Area": area,
            "Size Category": size_category
        })
    return bbox_data

In [47]:
original_bbox_df = pd.DataFrame(original_bbox_data)
sliced_bbox_df = pd.DataFrame(sliced_bbox_data)

# Group by defect type and size category, counting occurrences
grouped_original_bbox_df = original_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Original Count')
grouped_sliced_bbox_df = sliced_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Sliced Count')

# Merge original and sliced data on defect type and size category
merged_bbox_df = pd.merge(grouped_original_bbox_df, grouped_sliced_bbox_df, on=['Defect Type', 'Size Category'], how='outer').fillna(0)

# Converting the count to integers
merged_bbox_df['Original Count'] = merged_bbox_df['Original Count'].astype(int)
merged_bbox_df['Sliced Count'] = merged_bbox_df['Sliced Count'].astype(int)

print("Grouped by Dynamic Bounding Box Size Category")
print(merged_bbox_df)

Grouped by Dynamic Bounding Box Size Category
          Defect Type Size Category  Original Count  Sliced Count
0       blocked_valve         Large               1             2
1              bubble         Large               4             7
2              bubble        Medium               8            12
3              bubble         Small              22            32
4          chip_crack         Large              10            20
5          chip_crack        Medium               4             6
6          chip_crack         Small               1             1
7    improper_welding         Large               6             9
8    improper_welding        Medium               4             7
9    improper_welding         Small               3             5
10        light_stain         Large              17            23
11        light_stain        Medium              14            23
12        light_stain         Small               3             4
13         line_crack         

In [92]:
def calculate_iou(boxA, boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[0] + boxA[2], boxB[0] + boxB[2])
    yB = min(boxA[1] + boxA[3], boxB[1] + boxB[3])

    inter_width = max(0, xB - xA)
    inter_height = max(0, yB - yA)
    inter_area = inter_width * inter_height

    boxA_area = boxA[2] * boxA[3]
    boxB_area = boxB[2] * boxB[3]

    iou = inter_area / float(boxA_area + boxB_area - inter_area)
    return iou

def remove_duplicate_bboxes(annotations, iou_threshold=0.7):
    # Dictionary to keep unique annotations
    unique_annotations = []
    # Set to track which annotation IDs have been removed
    removed_ids = set()
    
    # Group annotations by category_id
    category_groups = defaultdict(list)
    for ann in annotations:
        category_groups[ann['category_id']].append(ann)

    # Loop through each category's annotations to detect duplicates
    for cat_id, bboxes in category_groups.items():
        # Compare each pair of bounding boxes within the same category
        for i in range(len(bboxes)):
            for j in range(i + 1, len(bboxes)):
                # Skip if either bounding box has already been marked as duplicate
                if bboxes[i]['id'] in removed_ids or bboxes[j]['id'] in removed_ids:
                    continue
                
                bboxA = bboxes[i]['bbox']
                bboxB = bboxes[j]['bbox']
                iou = calculate_iou(bboxA, bboxB)
                
                if iou > iou_threshold:
                    # Mark the second bounding box as duplicate
                    removed_ids.add(bboxes[j]['id'])

    # Collect annotations that are not marked as duplicates
    unique_annotations = [ann for ann in annotations if ann['id'] not in removed_ids]
    
    return unique_annotations, removed_ids

In [91]:
# Remove duplicates
unique_annotations, removed_ids = remove_duplicate_bboxes(sliced_annotations['annotations'], iou_threshold=0.7)

# Print results
print(f"Total bounding boxes before removing duplicates: {len(sliced_annotations['annotations'])}")
print(f"Total bounding boxes after removing duplicates: {len(unique_annotations)}")
print(f"Number of duplicates removed: {len(removed_ids)}")

sliced_annotations['annotations'] = unique_annotations
with open("../data/coco_json_files/coco_sliced_coco.json", "w") as f:
    json.dump(sliced_annotations, f)

Total bounding boxes before removing duplicates: 939
Total bounding boxes after removing duplicates: 939
Number of duplicates removed: 0


In [90]:
def calculate_class_counts_deduped(annotations, category_mapping):
    bbox_counts = Counter(ann['category_id'] for ann in annotations)
    return {category_mapping[cat_id]: count for cat_id, count in bbox_counts.items()}

# Calculate class counts on deduplicated annotations
deduped_counts = calculate_class_counts_deduped(unique_annotations, category_mapping)

# Display the class imbalance in a DataFrame format
deduped_counts_df = pd.DataFrame(list(deduped_counts.items()), columns=['Defect Type', 'Bounding Box Count'])
deduped_counts_df = deduped_counts_df.sort_values(by="Bounding Box Count", ascending=False).reset_index(drop=True)

print("Class Imbalance After Deduplication")
print(deduped_counts_df)

Class Imbalance After Deduplication
         Defect Type  Bounding Box Count
0         line_crack                 423
1  particle_material                 312
2            unknown                  53
3             bubble                  51
4        light_stain                  50
5         chip_crack                  27
6   improper_welding                  21
7      blocked_valve                   2


In [89]:
# Create bounding box data for original and sliced annotations
original_bbox_data = create_bbox_size_data(original_annotations, category_mapping, small_threshold, medium_threshold)
sliced_bbox_data = create_bbox_size_data(sliced_annotations, category_mapping, small_threshold, medium_threshold)

# Remove duplicates from sliced dataset
unique_annotations, removed_ids = remove_duplicate_bboxes(sliced_annotations['annotations'], iou_threshold=0.7)
sliced_annotations['annotations'] = unique_annotations  # Overwrite sliced_annotations with deduped annotations

# Calculate class counts after deduplication
deduped_counts = calculate_class_counts_deduped(unique_annotations, category_mapping)

# Prepare data for DataFrame display
data = {
    "Defect Type": [category_mapping[cat_id] for cat_id in original_counts.keys()],
    "Original Count": [original_counts[cat_id] for cat_id in original_counts.keys()],
    "Sliced Count": [sliced_counts.get(cat_id, 0) for cat_id in original_counts.keys()],
    "Deduplicated Sliced Count": [deduped_counts.get(category_mapping[cat_id], 0) for cat_id in original_counts.keys()]
}

# Create and display DataFrame
df = pd.DataFrame(data).sort_values(by="Original Count", ascending=False).reset_index(drop=True)
print("Class Imbalance Comparison\n", df)

# Display the bounding box data for original and sliced images
#print("\nBounding Box Data for Original Dataset:", original_bbox_data)
#print("\nBounding Box Data for Sliced Dataset (After Deduplication):", sliced_bbox_data)

# Optional: Save deduplicated annotations back to the original file
with open(f"../data/coco_json_files/{coco_file_name}_sliced_coco.json", "w") as f:
    json.dump(sliced_annotations, f)

Class Imbalance Comparison
           Defect Type  Original Count  Sliced Count  Deduplicated Sliced Count
0          line_crack             246           423                        423
1   particle_material             181           312                        312
2              bubble              34            51                         51
3         light_stain              34            50                         50
4             unknown              24            53                         53
5          chip_crack              15            27                         27
6    improper_welding              13            21                         21
7       blocked_valve               1             2                          2
8     excessive_flash               0             0                          0
9       residue_stain               0             0                          0
10       welding_blob               0             0                          0


In [88]:
# Calculate size-category-based counts
def calculate_size_category_counts(bbox_data):
    size_category_counts = defaultdict(lambda: defaultdict(int))
    for data in bbox_data:
        defect_type = data["Defect Type"]
        size_category = data["Size Category"]
        size_category_counts[defect_type][size_category] += 1
    return size_category_counts

# Calculate counts for original, sliced, and deduplicated sliced datasets
original_size_counts = calculate_size_category_counts(original_bbox_data)
sliced_size_counts = calculate_size_category_counts(sliced_bbox_data)
deduped_size_counts = calculate_size_category_counts(create_bbox_size_data(sliced_annotations, category_mapping, small_threshold, medium_threshold))

size_data = []
for defect_type in category_mapping.values():  # Iterate over all known defect types
    for size_category in ["Small", "Medium", "Large"]:
        size_data.append({
            "Defect Type": defect_type,
            "Size Category": size_category,
            "Original Count": original_size_counts[defect_type].get(size_category, 0),
            "Sliced Count": sliced_size_counts[defect_type].get(size_category, 0),
            "Deduplicated Sliced Count": deduped_size_counts[defect_type].get(size_category, 0)
        })

# Create and display DataFrame
df = pd.DataFrame(size_data).sort_values(by=["Defect Type", "Size Category"]).reset_index(drop=True)
print("Class Imbalance by Size Category\n", df)

# Optional: Save deduplicated annotations back to the original file
with open(f"../data/coco_json_files/{coco_file_name}_sliced_coco.json", "w") as f:
    json.dump(sliced_annotations, f)

Class Imbalance by Size Category
           Defect Type Size Category  Original Count  Sliced Count  \
0       blocked_valve         Large               1             2   
1       blocked_valve        Medium               0             0   
2       blocked_valve         Small               0             0   
3              bubble         Large               4             7   
4              bubble        Medium               8            12   
5              bubble         Small              22            32   
6          chip_crack         Large              10            20   
7          chip_crack        Medium               4             6   
8          chip_crack         Small               1             1   
9     excessive_flash         Large               0             0   
10    excessive_flash        Medium               0             0   
11    excessive_flash         Small               0             0   
12   improper_welding         Large               6             9   


In [87]:
def calculate_median_bbox_size(annotations, category_mapping):
    bbox_sizes = defaultdict(list)
    for ann in annotations['annotations']:
        defect_type = category_mapping[ann['category_id']]
        bbox_area = calculate_area(ann['bbox'])
        bbox_sizes[defect_type].append(bbox_area)
    return {defect_type: np.median(sizes) if sizes else 0 for defect_type, sizes in bbox_sizes.items()}


def classify_defect_scale(median_bbox_sizes, category_mapping):
    # Set a dynamic threshold using the median of non-zero median sizes
    size_threshold = np.median([size for size in median_bbox_sizes.values() if size > 0])
    return {
        defect_type: "Large Scale" if median_size > size_threshold else "Small Scale"
        for defect_type, median_size in median_bbox_sizes.items()
    }

In [86]:
bbox_sizes = defaultdict(list)

for ann in original_annotations['annotations']:
    defect_type = category_mapping[ann['category_id']]
    bbox_area = calculate_area(ann['bbox'])
    bbox_sizes[defect_type].append(bbox_area)

# Calculate the median size for each defect type, including all categories in the mapping
median_bbox_sizes = {defect_type: np.median(sizes) if sizes else 0 for defect_type, sizes in bbox_sizes.items()}
median_bbox_sizes = {defect_type: median_bbox_sizes.get(defect_type, 0) for defect_type in category_mapping.values()}

# Determine a threshold for "Small Scale" vs "Large Scale" using the median of non-zero medians
size_threshold = np.median([size for size in median_bbox_sizes.values() if size > 0])

# Classify each defect type based on the threshold
scale_classification = {
    defect_type: "Large Scale" if median_size > size_threshold else "Small Scale"
    for defect_type, median_size in median_bbox_sizes.items()
}

# Remove duplicates from sliced dataset
unique_annotations, removed_ids = remove_duplicate_bboxes(sliced_annotations['annotations'], iou_threshold=0.7)
sliced_annotations['annotations'] = unique_annotations  # Overwrite with deduplicated annotations

# Calculate median bounding box sizes and classify scale for original annotations
median_bbox_sizes_original = calculate_median_bbox_size(original_annotations, category_mapping)
scale_classification_original = classify_defect_scale(median_bbox_sizes_original, category_mapping)

# Calculate median bounding box sizes and classify scale for sliced annotations
median_bbox_sizes_sliced = calculate_median_bbox_size(sliced_annotations, category_mapping)
scale_classification_sliced = classify_defect_scale(median_bbox_sizes_sliced, category_mapping)

# Prepare data for display
scale_data_original = [
    {"Defect Type": defect_type, "Median Bounding Box Size": median_size, "Scale Classification": scale_classification_original[defect_type]}
    for defect_type, median_size in median_bbox_sizes_original.items()
]

scale_data_sliced = [
    {"Defect Type": defect_type, "Median Bounding Box Size": median_size, "Scale Classification": scale_classification_sliced[defect_type]}
    for defect_type, median_size in median_bbox_sizes_sliced.items()
]

# Create and display DataFrames
df_original = pd.DataFrame(scale_data_original).sort_values(by="Median Bounding Box Size", ascending=False).reset_index(drop=True)
df_sliced = pd.DataFrame(scale_data_sliced).sort_values(by="Median Bounding Box Size", ascending=False).reset_index(drop=True)

print("\nDefect Type Scale Classification (Original)\n", df_original)
print("\nDefect Type Scale Classification (Sliced)\n", df_sliced)


Defect Type Scale Classification (Original)
          Defect Type  Median Bounding Box Size Scale Classification
0      blocked_valve              14686.476591          Large Scale
1            unknown               2102.931893          Large Scale
2         chip_crack               1112.886786          Large Scale
3         line_crack                937.670135          Large Scale
4        light_stain                878.451083          Small Scale
5   improper_welding                864.219932          Small Scale
6             bubble                160.517035          Small Scale
7  particle_material                 83.051081          Small Scale

Defect Type Scale Classification (Sliced)
          Defect Type  Median Bounding Box Size Scale Classification
0      blocked_valve              11270.149334          Large Scale
1            unknown               1898.650919          Large Scale
2         chip_crack               1113.572241          Large Scale
3         line_crack      