# Class Imbalancer Checker Script

The class imbalance checker script will serve as a diagnostic tool that:

- Confirms that the slicing process has not unintentionally altered the dataset (such as by adding new defect types).
  
- Identifies any class imbalances in the original dataset that may require balancing measures.
  
- Highlights any changes in the distribution of bounding box sizes due to slicing, allowing you to see if the slicing process fragments large defects excessively.

<details open>
<summary>Why Class Imbalance Matters in ML?</summary>
<br>
When training a model, we want it to identify all defect types, regardless of their frequency in the dataset. If some defect types are rare, the model might:

- **Predict only the common defects**: It might say “crack” for every image, missing other defects.
  
- **Miss rare but important defects**: Some rare defects could be critical, and we want the model to catch those as well.

In this example:

Each defect type (e.g., line_crack, particle_material) shows up with a certain frequency in both the original and sliced data.
If there are discrepancies between the counts in the original and sliced data, it could indicate that some classes became more or less frequent after slicing. 
E.g. if residue_stain had a count of 4 in the original data but became 0 in the sliced data, this would mean the slicing process didn’t preserve that defect type well.
</details>




| Step                          | Output                        | Purpose                                                        |
|-------------------------------|-------------------------------|----------------------------------------------------------------|
| 1                             | Loaded datasets               | Verify data structure and integrity                            |
| 2                             | Defect types in each dataset  | Identify unexpected defect types in sliced data               |
| 3                             | Total class counts            | Compare overall defect counts before and after slicing         |
| 4                             | Size-based classification     | Analyze if slicing affects size distribution of defects       |
| 5                             | Final summary table           | Comprehensive view of class imbalance and size distribution, guiding augmentation and balancing decisions |

| Step                     | Action                                           | Goal                                                      |
|--------------------------|--------------------------------------------------|-----------------------------------------------------------|
| 1. Count Each Defect     | Check how many times each defect type appears.   | Identify if any defect type is much rarer than others.    |
| 2. Apply Augmentation    | Create new variations of rare defects.           | Increase the number of examples for rare classes.         |
| 3. Use Class Weights     | Assign higher importance to rare classes.        | Encourage the model to focus more on rare defects.        |
| 4. Balance the Dataset   | Use sampling techniques to even class counts.    | Ensure the model sees all defect types fairly often.      |


In [47]:
%pip install pandas; pycocotools; matplotlib; pandas; numpy

Note: you may need to restart the kernel to use updated packages.


In [48]:
# class_imbalance_check.py
import json
from collections import Counter, defaultdict
import os.path
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pycocotools.coco import COCO
import shutil

In [49]:
def calculate_class_counts(coco_obj):
    return Counter(cat_id for cat_id in coco_obj.getCatIds() for _ in coco_obj.getAnnIds(catIds=[cat_id]))

def calculate_area(bbox):
    return bbox[2] * bbox[3]

def categorize_by_dynamic_size(area, small_threshold, medium_threshold):
    if area > medium_threshold:
        return "Large"
    elif area > small_threshold:
        return "Medium"
    return "Small"

def create_bbox_size_data(annotations, category_mapping, small_threshold, medium_threshold):
    return [
        {
            "Defect Type": category_mapping[ann['category_id']],
            "Bounding Box Area": calculate_area(ann['bbox']),
            "Size Category": categorize_by_dynamic_size(calculate_area(ann['bbox']), small_threshold, medium_threshold),
        }
        for ann in annotations['annotations']
    ]

def calculate_size_category_counts(bbox_data):
    size_counts = defaultdict(lambda: defaultdict(int))
    for data in bbox_data:
        size_counts[data["Defect Type"]][data["Size Category"]] += 1
    return size_counts

def calculate_iou(boxA, boxB):
    xA, yA = max(boxA[0], boxB[0]), max(boxA[1], boxB[1])
    xB, yB = min(boxA[0] + boxA[2], boxB[0] + boxB[2]), min(boxA[1] + boxA[3], boxB[1] + boxB[3])
    inter_area = max(0, xB - xA) * max(0, yB - yA)
    return inter_area / float(boxA[2] * boxA[3] + boxB[2] * boxB[3] - inter_area)

def remove_duplicate_bboxes(annotations, iou_threshold=0.7):
    removed_ids = set()
    unique_annotations = []
    category_groups = defaultdict(list)

    for ann in annotations:
        category_groups[ann['category_id']].append(ann)

    for bboxes in category_groups.values():
        for i in range(len(bboxes)):
            for j in range(i + 1, len(bboxes)):
                if bboxes[i]['id'] in removed_ids or bboxes[j]['id'] in removed_ids:
                    continue
                if calculate_iou(bboxes[i]['bbox'], bboxes[j]['bbox']) > iou_threshold:
                    removed_ids.add(bboxes[j]['id'])

    unique_annotations = [ann for ann in annotations if ann['id'] not in removed_ids]
    return unique_annotations, removed_ids


In [50]:
coco_file_name = 'cassette1_val'
base_path = "../data/coco/"

with open(f"{base_path}{coco_file_name}_corrected_coco.json") as f:
    original_annotations = json.load(f)

with open(f"{base_path}{coco_file_name}_sliced_coco.json") as f:
    sliced_annotations = json.load(f)

coco_original = COCO(f"{base_path}{coco_file_name}_corrected_coco.json")
coco_sliced = COCO(f"{base_path}{coco_file_name}_sliced_coco.json")

print(original_annotations)
print(sliced_annotations)

# Inspect the structure of both datasets to ensure they are loaded correctly
print("\nKeys in original dataset:", original_annotations.keys())
print("Keys in sliced dataset:", sliced_annotations.keys())
print("Sample data from original dataset:", original_annotations['annotations'][:2])
print("Sample data from sliced dataset:", sliced_annotations['annotations'][:2])

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
{'images': [{'width': 4096, 'height': 2000, 'id': 4, 'file_name': '01BN02.bmp'}, {'width': 4096, 'height': 2000, 'id': 25, 'file_name': '01FW01.bmp'}], 'annotations': [{'id': 98, 'image_id': 4, 'category_id': 5, 'segmentation': [], 'bbox': [2214.833880112831, 223.5076988929449, 21.737717054118985, 45.54569287529642], 'ignore': 0, 'iscrowd': 0, 'area': 990.0593847569965}, {'id': 99, 'image_id': 4, 'category_id': 5, 'segmentation': [], 'bbox': [2729.2931837269757, 203.84024060588504, 28.983622738824923, 38.29978719059021], 'ignore': 0, 'iscrowd': 0, 'area': 1110.066582909346}, {'id': 100, 'image_id': 4, 'category_id': 5, 'segmentation': [], 'bbox': [1882.5573479998739, 603.4001826482585, 12.421552602353332, 22.772846437648226], 'ignore': 0, 'iscrowd': 0, 'area': 282.87410993056216}, {'id': 101, 'image_id': 4, 'category_id': 

# Count Checker

In [51]:
# Calculate total counts for original and sliced datasets
original_counts = calculate_class_counts(coco_original)
sliced_counts = calculate_class_counts(coco_sliced)

coco_original = COCO(f"{base_path}{coco_file_name}_corrected_coco.json")
coco_sliced = COCO(f"{base_path}{coco_file_name}_sliced_coco.json")

category_mapping = {cat['id']: cat['name'] for cat in coco_original.loadCats(coco_original.getCatIds())}

all_areas = [calculate_area(ann['bbox']) for ann in original_annotations['annotations']]
small_threshold, medium_threshold = np.percentile(all_areas, 33), np.percentile(all_areas, 66)

original_bbox_data = create_bbox_size_data(original_annotations, category_mapping, small_threshold, medium_threshold)
sliced_bbox_data = create_bbox_size_data(sliced_annotations, category_mapping, small_threshold, medium_threshold)

unique_annotations, removed_ids = remove_duplicate_bboxes(sliced_annotations['annotations'], iou_threshold=0.7)
sliced_annotations['annotations'] = unique_annotations

print(original_counts)
print(sliced_counts)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Counter({6: 22, 7: 18, 9: 5, 5: 4, 0: 1, 1: 1})
Counter({7: 34, 6: 33, 9: 14, 5: 6, 0: 2, 1: 1})


In [52]:
# Map category IDs to names for readability
category_mapping = {cat['id']: cat['name'] for cat in coco_original.loadCats(original_counts.keys())}
original_counts_named = {category_mapping[k]: v for k, v in original_counts.items()}
sliced_counts_named = {category_mapping.get(k, 'Unknown'): v for k, v in sliced_counts.items()}

print("Category Mapping: ",category_mapping)
print("Bounding Box Counts in Original Dataset:", original_counts_named)
print("Bounding Box Counts in Sliced Dataset:", sliced_counts_named)

Category Mapping:  {0: 'blocked_valve', 1: 'bubble', 5: 'light_stain', 6: 'line_crack', 7: 'particle_material', 9: 'unknown'}
Bounding Box Counts in Original Dataset: {'blocked_valve': 1, 'bubble': 1, 'light_stain': 4, 'line_crack': 22, 'particle_material': 18, 'unknown': 5}
Bounding Box Counts in Sliced Dataset: {'blocked_valve': 2, 'bubble': 1, 'light_stain': 6, 'line_crack': 33, 'particle_material': 34, 'unknown': 14}


In [53]:
# Calculate the total number of bounding boxes across all defect types
total_original_bboxes = sum(original_counts.values())
total_sliced_bboxes = sum(sliced_counts.values())

print(f"\nTotal bounding boxes in the original dataset: {total_original_bboxes}")
print(f"Total bounding boxes in the sliced dataset: {total_sliced_bboxes}")


Total bounding boxes in the original dataset: 51
Total bounding boxes in the sliced dataset: 90


In [54]:

# Check unique defect types in each dataset
original_defect_types = {ann['category_id'] for ann in original_annotations['annotations']}
sliced_defect_types = {ann['category_id'] for ann in sliced_annotations['annotations']}

print("Defect types in the original dataset:", original_defect_types)
print("Defect types in the sliced dataset:", sliced_defect_types)

# Ensure that all defect types in the sliced dataset exist in the original dataset
unexpected_defects = sliced_defect_types - original_defect_types
if unexpected_defects:
    print("Warning: These defect types appear in the sliced dataset but not in the original dataset:", unexpected_defects)
else:
    print("All defect types in the sliced dataset are also present in the original dataset.")

Defect types in the original dataset: {0, 1, 5, 6, 7, 9}
Defect types in the sliced dataset: {0, 1, 5, 6, 7, 9}
All defect types in the sliced dataset are also present in the original dataset.


In [55]:
print("Class Imbalance in Original Annotations:")
for cat_id, count in original_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

print("\nClass Imbalance in Sliced Annotations:")
for cat_id, count in sliced_counts.items():
    print(f"{category_mapping[cat_id]}: {count}")

Class Imbalance in Original Annotations:
blocked_valve: 1
bubble: 1
light_stain: 4
line_crack: 22
particle_material: 18
unknown: 5

Class Imbalance in Sliced Annotations:
blocked_valve: 2
bubble: 1
light_stain: 6
line_crack: 33
particle_material: 34
unknown: 14


In [56]:
# Convert counts into DataFrame format
data = {
    "Defect Type": [category_mapping[cat_id] for cat_id in original_counts.keys()],
    "Original Count": [original_counts[cat_id] for cat_id in original_counts.keys()],
    "Sliced Count": [sliced_counts.get(cat_id, 0) for cat_id in original_counts.keys()]
}

df = pd.DataFrame(data)

print("Class Imbalance Comparison")
print(df.to_string(index=False))

Class Imbalance Comparison
      Defect Type  Original Count  Sliced Count
    blocked_valve               1             2
           bubble               1             1
      light_stain               4             6
       line_crack              22            33
particle_material              18            34
          unknown               5            14


#### Interpretation

- **Class Imbalance Remains**: In both the original and sliced datasets, certain defects, like `line_crack` and `particle_material`, are **more** **common** than others, like `chip_crack` and `light_stain`.

- **Increased Counts After Slicing**: For most classes (except `light_stain` and `chip_crack`), slicing increased the counts, which is good for adding examples. However, rare defects like `chip_crack` still have only 1 instance.

## Bounding Box Size Checker

In [57]:
# Function to calculate bounding box area
def calculate_area(bbox):
    return bbox[2] * bbox[3]  # width * height

# Collect all bounding box areas from the original dataset
all_areas = [calculate_area(ann['bbox']) for ann in original_annotations['annotations']]

# Calculate dynamic thresholds based on percentiles
small_threshold = np.percentile(all_areas, 33)
medium_threshold = np.percentile(all_areas, 66)

print("Small Threshold: ", small_threshold)
print("Medium Threshold: ", medium_threshold)
print("All Areas: ", all_areas)

Small Threshold:  250.12635587586595
Medium Threshold:  953.813782137882
All Areas:  [990.0593847569965, 1110.066582909346, 282.87410993056216, 56.33275559766667, 900.3757195004466, 906.7840238195217, 1042.1911786788512, 603.3301968299852, 529.317463279127, 642.8957043876513, 754.9425645364165, 197.72805826358837, 953.813782137882, 489.45707241286266, 47.601199211601596, 3128.9597991885657, 113.55992584693855, 11435.185893943242, 14593.475331318026, 14686.47659130488, 82.62137487657748, 494.65708503305996, 63.76438986754367, 35.1550236437419, 27.22455574764209, 19.293073369316055, 80.28010200613629, 187.33935232430693, 3629.448557420617, 2397.237127632902, 432.8390548598679, 726.7066299716221, 2155.061040605481, 60.13361926459078, 4175.623823354017, 415.86497427712925, 7879.924412757136, 982.0961624978435, 7932.0885674048195, 407.0583457910841, 162.88424158501132, 217.37860182116972, 1463.682585595875, 355.5347798675234, 1704.2482382779692, 37.53094768786432, 422.3218060865453, 871.446

### Functions

Benefits of Dynamic Thresholds
1. **Adaptability**: Since the thresholds are derived from the dataset, they better represent the natural distribution of bounding box sizes.
2. **Scalability**: This approach automatically adjusts if the dataset grows or changes in defect types and bounding box sizes.

In [58]:
original_bbox_df = pd.DataFrame(original_bbox_data)
sliced_bbox_df = pd.DataFrame(sliced_bbox_data)

# Group by defect type and size category, counting occurrences
grouped_original_bbox_df = original_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Original Count')
grouped_sliced_bbox_df = sliced_bbox_df.groupby(['Defect Type', 'Size Category']).size().reset_index(name='Sliced Count')

# Merge original and sliced data on defect type and size category
merged_bbox_df = pd.merge(grouped_original_bbox_df, grouped_sliced_bbox_df, on=['Defect Type', 'Size Category'], how='outer').fillna(0)

# Converting the count to integers
merged_bbox_df['Original Count'] = merged_bbox_df['Original Count'].astype(int)
merged_bbox_df['Sliced Count'] = merged_bbox_df['Sliced Count'].astype(int)

print("Grouped by Dynamic Bounding Box Size Category")
print(merged_bbox_df)

Grouped by Dynamic Bounding Box Size Category
          Defect Type Size Category  Original Count  Sliced Count
0       blocked_valve         Large               1             2
1              bubble         Small               1             1
2         light_stain         Large               2             2
3         light_stain        Medium               2             4
4          line_crack         Large              10            15
5          line_crack        Medium              11            16
6          line_crack         Small               1             2
7   particle_material        Medium               3             5
8   particle_material         Small              15            29
9             unknown         Large               4            10
10            unknown        Medium               1             4


In [59]:
def calculate_iou(boxA, boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[0] + boxA[2], boxB[0] + boxB[2])
    yB = min(boxA[1] + boxA[3], boxB[1] + boxB[3])

    inter_width = max(0, xB - xA)
    inter_height = max(0, yB - yA)
    inter_area = inter_width * inter_height

    boxA_area = boxA[2] * boxA[3]
    boxB_area = boxB[2] * boxB[3]

    iou = inter_area / float(boxA_area + boxB_area - inter_area)
    return iou

def remove_duplicate_bboxes(annotations, iou_threshold=0.7):
    # Dictionary to keep unique annotations
    unique_annotations = []
    # Set to track which annotation IDs have been removed
    removed_ids = set()
    
    # Group annotations by category_id
    category_groups = defaultdict(list)
    for ann in annotations:
        category_groups[ann['category_id']].append(ann)

    # Loop through each category's annotations to detect duplicates
    for cat_id, bboxes in category_groups.items():
        # Compare each pair of bounding boxes within the same category
        for i in range(len(bboxes)):
            for j in range(i + 1, len(bboxes)):
                # Skip if either bounding box has already been marked as duplicate
                if bboxes[i]['id'] in removed_ids or bboxes[j]['id'] in removed_ids:
                    continue
                
                bboxA = bboxes[i]['bbox']
                bboxB = bboxes[j]['bbox']
                iou = calculate_iou(bboxA, bboxB)
                
                if iou > iou_threshold:
                    # Mark the second bounding box as duplicate
                    removed_ids.add(bboxes[j]['id'])

    # Collect annotations that are not marked as duplicates
    unique_annotations = [ann for ann in annotations if ann['id'] not in removed_ids]
    
    return unique_annotations, removed_ids

In [60]:
original_size_counts = calculate_size_category_counts(original_bbox_data)
sliced_size_counts = calculate_size_category_counts(sliced_bbox_data)
deduped_size_counts = calculate_size_category_counts(create_bbox_size_data(sliced_annotations, category_mapping, small_threshold, medium_threshold))

size_data = [
    {
        "Defect Type": defect_type,
        "Size Category": size_category,
        "Original Count": original_size_counts[defect_type].get(size_category, 0),
        "Sliced Count": sliced_size_counts[defect_type].get(size_category, 0),
        "Deduplicated Sliced Count": deduped_size_counts[defect_type].get(size_category, 0),
    }
    for defect_type in category_mapping.values()
    for size_category in ["Small", "Medium", "Large"]
]

df = pd.DataFrame(size_data).sort_values(by=["Defect Type", "Size Category"]).reset_index(drop=True)
print("Class Imbalance by Size Category\n", df)

Class Imbalance by Size Category
           Defect Type Size Category  Original Count  Sliced Count  \
0       blocked_valve         Large               1             2   
1       blocked_valve        Medium               0             0   
2       blocked_valve         Small               0             0   
3              bubble         Large               0             0   
4              bubble        Medium               0             0   
5              bubble         Small               1             1   
6         light_stain         Large               2             2   
7         light_stain        Medium               2             4   
8         light_stain         Small               0             0   
9          line_crack         Large              10            15   
10         line_crack        Medium              11            16   
11         line_crack         Small               1             2   
12  particle_material         Large               0             0   


In [61]:
def calculate_class_counts_deduped(annotations, category_mapping):
    bbox_counts = Counter(ann['category_id'] for ann in annotations)
    return {category_mapping[cat_id]: count for cat_id, count in bbox_counts.items()}

# Calculate class counts on deduplicated annotations
deduped_counts = calculate_class_counts_deduped(unique_annotations, category_mapping)

# Display the class imbalance in a DataFrame format
deduped_counts_df = pd.DataFrame(list(deduped_counts.items()), columns=['Defect Type', 'Bounding Box Count'])
deduped_counts_df = deduped_counts_df.sort_values(by="Bounding Box Count", ascending=False).reset_index(drop=True)

print("Class Imbalance After Deduplication")
print(deduped_counts_df)

Class Imbalance After Deduplication
         Defect Type  Bounding Box Count
0  particle_material                  34
1         line_crack                  33
2            unknown                  14
3        light_stain                   6
4      blocked_valve                   2
5             bubble                   1


In [66]:
def calculate_median_bbox_size(annotations, category_mapping):
    bbox_sizes = defaultdict(list)
    for ann in annotations['annotations']:
        bbox_sizes[category_mapping[ann['category_id']]].append(calculate_area(ann['bbox']))
    return {defect_type: np.median(sizes) if sizes else 0 for defect_type, sizes in bbox_sizes.items()}

def classify_defect_scale(median_bbox_sizes):
    size_threshold = np.median([size for size in median_bbox_sizes.values() if size > 0])
    return {defect_type: "Large Scale" if size > size_threshold else "Small Scale" for defect_type, size in median_bbox_sizes.items()}

with open(f"{base_path}{coco_file_name}_sliced_coco.json", "w") as f:
        json.dump(sliced_annotations, f)

median_bbox_sizes_original = calculate_median_bbox_size(original_annotations, category_mapping)
median_bbox_sizes_sliced = calculate_median_bbox_size(sliced_annotations, category_mapping)

scale_classification_original = classify_defect_scale(median_bbox_sizes_original)
scale_classification_sliced = classify_defect_scale(median_bbox_sizes_sliced)

scale_data_original = [
    {"Defect Type": defect_type, "Median Bounding Box Size": median_size, "Scale Classification": scale_classification_original[defect_type]}
    for defect_type, median_size in median_bbox_sizes_original.items()
]

scale_data_sliced = [
    {"Defect Type": defect_type, "Median Bounding Box Size": median_size, "Scale Classification": scale_classification_sliced[defect_type]}
    for defect_type, median_size in median_bbox_sizes_sliced.items()
]

df_original = pd.DataFrame(scale_data_original).sort_values(by="Median Bounding Box Size", ascending=False).reset_index(drop=True)
df_sliced = pd.DataFrame(scale_data_sliced).sort_values(by="Median Bounding Box Size", ascending=False).reset_index(drop=True)

print("\nDefect Type Scale Classification (Original)\n", df_original)
print("\nDefect Type Scale Classification (Sliced)\n", df_sliced)



Defect Type Scale Classification (Original)
          Defect Type  Median Bounding Box Size Scale Classification
0      blocked_valve              14686.476591          Large Scale
1            unknown               7879.924413          Large Scale
2        light_stain                930.753023          Large Scale
3         line_crack                930.298903          Small Scale
4             bubble                217.378602          Small Scale
5  particle_material                 61.949005          Small Scale

Defect Type Scale Classification (Sliced)
          Defect Type  Median Bounding Box Size Scale Classification
0      blocked_valve              11270.149334          Large Scale
1            unknown               1985.273988          Large Scale
2         line_crack                906.784024          Large Scale
3        light_stain                736.192562          Small Scale
4             bubble                217.378602          Small Scale
5  particle_material      

transfer

In [65]:
import shutil

def move_coco_file(coco_file_name, source_dir=base_path):
    # Automatically determine target split based on coco_file_name
    if "train" in coco_file_name.lower():
        target_split = "train"
    elif "test" in coco_file_name.lower():
        target_split = "test"
    elif "val" in coco_file_name.lower():
        target_split = "val"
    else:
        print("Unable to determine target split from coco_file_name.")
        return

    # Define source and destination paths
    source_path = os.path.join(source_dir, f"{coco_file_name}_sliced_coco.json")
    destination_path = os.path.join(source_dir, target_split, f"{coco_file_name}_sliced_coco.json")

    # Check if source file exists
    if not os.path.exists(source_path):
        print(f"Source file {source_path} does not exist.")
        return

    # Ensure target directory exists
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)

    # Move the file
    shutil.move(source_path, destination_path)

    source_path = os.path.join(source_dir, f"{coco_file_name}_corrected_coco.json")
    destination_path = os.path.join(source_dir, target_split, f"{coco_file_name}_corrected_coco.json")

    # Check if source file exists
    if not os.path.exists(source_path):
        print(f"Source file {source_path} does not exist.")
        return

    # Ensure target directory exists
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)

    # Move the file
    shutil.move(source_path, destination_path)
    print(f"File moved to {destination_path}")

    

move_coco_file(coco_file_name)

File moved to ../data/coco/val\cassette1_val_corrected_coco.json


In [76]:
# Saving all the dfs
print(df)
print(df_original)
print(df_sliced)

          Defect Type Size Category  Original Count  Sliced Count  \
0       blocked_valve         Large               1             2   
1       blocked_valve        Medium               0             0   
2       blocked_valve         Small               0             0   
3              bubble         Large               0             0   
4              bubble        Medium               0             0   
5              bubble         Small               1             1   
6         light_stain         Large               2             2   
7         light_stain        Medium               2             4   
8         light_stain         Small               0             0   
9          line_crack         Large              10            15   
10         line_crack        Medium              11            16   
11         line_crack         Small               1             2   
12  particle_material         Large               0             0   
13  particle_material        Mediu

In [77]:
df.to_csv(f'{coco_file_name}_imbalance.csv', index=False)

OSError: [Errno 22] Invalid argument: 'cassette1_val_imbalance.csv'