In [1]:
from analyze import parse_bdd100k_labels

In [2]:
!pwd

/workspace/data_analysis


In [3]:
train_json = "/workspace/bdd100k_labels_release/bdd100k/labels/bdd100k_labels_images_train.json"
val_json = "/workspace/bdd100k_labels_release/bdd100k/labels/bdd100k_labels_images_val.json"

In [None]:
df = parse_bdd100k_labels(train_json)
print(df.head())
print(f"\nParsed {len(df)} annotations across {df['image_name'].nunique()} images.")

In [None]:
from analyze import plot_bdd_piecharts
plot_bdd_piecharts(train_json, val_json)

# Image Attribute Distribution

### Overview
1. Image attributes in the dataset include **weather**, **scene**, and **timeofday**.  
2. The distribution between the **validation** and **training** sets is reasonably similar.  
3. However, each attribute in the dataset shows an **overwhelming bias (>50%)** toward a single category.

---

## Conclusion

As seen from the dataset:

- **Weather:** Most images are captured in **clear weather** conditions.  
- **Scene:** Majority of images depict **city streets**.  
- **Time of Day:** Most images were taken during **daytime**.

---

## Solution

### Recommended Data Augmentations

1. **HSV Value (`hsv_v: 0.015`)**  
   - Minor adjustment to maintain **color consistency** across different camera sensors and lighting conditions.  

2. **HSV Saturation (`hsv_s: 0.7`)**  
   - Adjusts color vividness — making images more **vibrant** or **washed out**.  
   - Useful for handling **overcast vs. clear weather** conditions.  
   - High variation is suggested to mitigate **weather distribution imbalance**.  

3. **HSV Brightness (`hsv_v: 0.4`)**  
   - Alters image brightness (darker or brighter).  
   - Helps handle **day vs. night** distribution imbalance.

---



In [None]:
from analyze import plot_class_distribution_pie
train_counts, val_counts = plot_class_distribution_pie(train_json, val_json)

# Class Distribution Analysis

### Overview
1. The **class distribution** of the training and validation sets is **nearly identical**.  
2. However, the distribution is **heavily skewed** toward the **car** category, which represents **55.4%** of the dataset.  
3. The **train** category is **severely underrepresented**, accounting for less than **0.002%** of the data (only **136 annotations** in the training set).  
4. A few **incorrect annotations** have been observed in the **train** category.  
5. Some **annotation errors** have also been noted in the **rider** class.

---

## Solution

### Recommended Augmentation Strategies

1. **Mosaic Augmentation**  
   - Combines multiple images into one, effectively **upsampling rare classes** without explicit oversampling.  
   - Helps balance class frequency while maintaining spatial context.

2. **Varifocal Loss**  
   - Encourages the model to place **greater emphasis on rare classes** during training.  
   - Improves performance in **class-imbalanced datasets** by adjusting the focus dynamically based on confidence and sample rarity.

---


In [None]:
from analyze import plot_annotations_per_image
train_counts, val_counts, stats_df = plot_annotations_per_image(train_json, val_json)

# Annotations per Image

### Overview
1. **Annotation density** is roughly consistent between the **training** and **validation** sets.  
2. The dataset exhibits a **moderate annotation density** overall.  
3. Some **crowded scenes** are present, with up to **91 objects per image**.  
4. There are **no empty images** in the dataset.  
5. **No sampling bias** observed — annotation consistency is maintained between the train and validation sets.

---

## Solution

1. **Anchor-Free Models**  
   - Better suited for **handling overlapping objects** compared to anchor-based models.  
   - Improve detection in **crowded or high-density scenes**.

2. **FPN / PAN Architectures (as in YOLOv8)**  
   - Utilize **Feature Pyramid Network (FPN)** and **Path Aggregation Network (PAN)** structures.  
   - These allow the model to **combine local details** and **broader contextual information**, enhancing detection performance across different scales.

---


In [None]:
from analyze import compare_mean_relative_bbox_size
compare_df = compare_mean_relative_bbox_size(train_json, val_json)

# Mean Relative Size per Class

### Overview
1. There is a **huge variation in object scales** across classes.  
2. A significant portion of annotations fall under the **tiny to small** category.  
3. This indicates that the detector must effectively handle **multi-scale variations** during training and inference.

---

## Solution

1. **Multi-Stage Detection Head (YOLOv8 Architecture)**  
   - Enables simultaneous detection of **small** and **large** objects.  
   - Improves feature representation across multiple scales, enhancing detection accuracy for size-diverse classes.

2. **Higher Input Resolution**  
   - Helps **preserve features** of small and tiny objects that might otherwise be lost at lower resolutions.  
   - Beneficial when dealing with datasets dominated by **fine-grained, small-scale objects**.

---


In [None]:
from analyze import plot_aspect_ratio_distribution
train_summary, val_summary = plot_aspect_ratio_distribution(train_json, val_json)

# Aspect Ratio per Class

### Overview
1. **Aspect ratio diversity** is high, with mean values ranging from **0.46 to 3.73**.  
2. There is also **significant variation within individual classes**.  
3. The **train** and **traffic sign** classes exhibit the **greatest variability** in aspect ratios.

---

## Solution

1. **Avoid Rotation Augmentation**  
   - Many classes display **consistent aspect ratio distributions**, so avoiding rotation helps the model **retain shape context** and improves generalization for shape-based recognition.  

2. **Anchor-Free Models (e.g., YOLOv8)**  
   - Better equipped to handle **high aspect ratio diversity** compared to anchor-based approaches.  
   - Improves bounding box adaptability across objects with irregular proportions.  

3. **Varifocal + DFL Loss**  
   - Enhances YOLOv8’s tolerance to **variable box shapes** and improves training stability.  
   - Enables the model to assign **dynamic confidence weighting** based on bounding box quality and aspect ratio variation.

---


In [None]:
from analyze import compute_mean_iou_matrix, plot_iou_heatmap
iou_matrix, overlap_counts = compute_mean_iou_matrix(df, iou_thresh=0.5)

print("\n📊 Mean IoU Overlaps per Class Pair (IoU ≥ 0.5):")
print(iou_matrix.round(3))

plot_iou_heatmap(iou_matrix)

# Mean IoU Overlap per Class Pair

### Overview
1. The dataset shows **moderate overlap** between certain object classes.  
2. This can cause **true positives** to be mistakenly suppressed by standard **Non-Maximum Suppression (NMS)**.  
3. Proper handling of overlapping detections is crucial to preserve detection accuracy, especially in crowded scenes.

---

## Solution

1. **Soft-NMS**  
   - Use **Soft Non-Maximum Suppression** to prevent moderate overlaps between true positives from being wrongly discarded.  
   - Instead of completely removing overlapping boxes, Soft-NMS **reduces their confidence scores** based on IoU, allowing the detector to retain valid overlapping detections.

2. **IoU-Aware Loss Functions (e.g., Varifocal Loss)**  
   - Implementing **IoU-aware loss** during training helps the model better understand spatial overlap between bounding boxes.  
   - **Varifocal Loss** improves confidence calibration by aligning predicted scores with IoU values, enhancing performance in **densely overlapped** object regions.

---


In [None]:
from analyze import compute_cooccurrence_matrix, summarize_cooccurrence, plot_cooccurrence_heatmap

co_matrix = compute_cooccurrence_matrix(df)

print("\n📊 Inter-class Co-occurrence Matrix (Train Set):")
print(co_matrix.round(3))

summarize_cooccurrence(co_matrix)
plot_cooccurrence_heatmap(co_matrix)