# Step 2: Data Pipeline & Evaluation Foundations

This notebook downloads the dataset and creates all Step 2 artifacts.
After running this notebook, Step 2 is **FROZEN** - do not regenerate these files.

## 1. Setup - Clone Repository

In [None]:
# Go to /content first, clean previous clone, then clone fresh
%cd /content
!rm -rf Deep_Learning_Project_Gil_Alon
!git clone https://github.com/gil-attar/Deep_Learning_Project_Gil_Alon.git
%cd Deep_Learning_Project_Gil_Alon

: 

In [None]:
# Install dependencies
!pip install -r requirements.txt -q

## 2. Set API Keys

Replace with your actual keys below, or use Colab Secrets (ðŸ”‘ icon in sidebar)

In [None]:
import os

# Option A: Set directly (replace with your keys)
os.environ["ROBOFLOW_API_KEY"] = "zEF9icmDY2oTcPkaDcQY"

# Option B: Use Colab Secrets (recommended)
# from google.colab import userdata
# os.environ["ROBOFLOW_API_KEY"] = userdata.get('ROBOFLOW_API_KEY')

## 3. Download Dataset

In [None]:
# Download dataset to data/raw/ (immutable source)
!python scripts/download_dataset.py --output_dir data/raw

## 4. Build Evaluation Index

Creates all Step 2 artifacts:
- `splits/split_manifest.json` - Lists all filenames per split
- `evaluation/test_index.json` - Ground truth + occlusion difficulty  
- `evaluation/difficulty_summary.csv` - Statistics per difficulty

In [None]:
# Build all Step 2 evaluation artifacts
!python scripts/build_evaluation_index.py --dataset_root data/raw --output_dir data/processed --seed 42

## 5. Verify Results

In [None]:
# Check directory structure
!echo "=== Raw Data Structure ==="
!ls -la data/raw/

!echo ""
!echo "=== Processed Artifacts ==="
!ls -la data/processed/
!ls -la data/processed/splits/
!ls -la data/processed/evaluation/

!echo ""
!echo "=== Train/Valid/Test Counts ==="
!echo "Train images: $(ls data/raw/train/images/ | wc -l)"
!echo "Valid images: $(ls data/raw/valid/images/ | wc -l)"
!echo "Test images: $(ls data/raw/test/images/ | wc -l)"

In [None]:
# Check the evaluation index and difficulty summary
import json
import pandas as pd

# Load test index
with open('data/processed/evaluation/test_index.json', 'r') as f:
    index = json.load(f)

print("=== Evaluation Index Summary ===")
print(f"Total test images: {index['metadata']['num_images']}")
print(f"Total objects: {index['metadata']['total_objects']}")
print(f"Classes: {index['metadata']['num_classes']}")
print(f"\nDifficulty distribution (based on MAX pairwise IoU):")
for diff, count in index['metadata']['difficulty_distribution'].items():
    pct = 100 * count / index['metadata']['num_images']
    print(f"  {diff.upper():8s}: {count:4d} ({pct:.1f}%)")

print(f"\nThresholds:")
for diff, thresh in index['metadata']['difficulty_thresholds'].items():
    print(f"  {diff}: {thresh}")

# Load and display difficulty summary CSV
print("\n=== Difficulty Summary (CSV) ===")
df = pd.read_csv('data/processed/evaluation/difficulty_summary.csv')
print(df.to_string(index=False))

In [None]:
# Show 5 sample images with bounding boxes (from data/raw/)
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from pathlib import Path
import yaml

# Load class names from raw data
with open('data/raw/data.yaml', 'r') as f:
    config = yaml.safe_load(f)
class_names = config['names']

# Get 5 sample images from test set
test_images_dir = Path('data/raw/test/images')
test_labels_dir = Path('data/raw/test/labels')
sample_images = list(test_images_dir.glob('*.jpg'))[:5]

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, img_path in zip(axes, sample_images):
    # Load image
    img = Image.open(img_path)
    ax.imshow(img)
    ax.set_title(img_path.name[:20] + '...', fontsize=8)
    ax.axis('off')
    
    # Load labels and draw boxes
    label_path = test_labels_dir / (img_path.stem + '.txt')
    if label_path.exists():
        with open(label_path, 'r') as f:
            for line in f.readlines():
                parts = line.strip().split()
                if len(parts) >= 5:
                    class_id = int(parts[0])
                    x_center, y_center, w, h = [float(p) for p in parts[1:5]]
                    
                    # Convert YOLO format to pixel coordinates
                    img_w, img_h = img.size
                    x1 = (x_center - w/2) * img_w
                    y1 = (y_center - h/2) * img_h
                    box_w = w * img_w
                    box_h = h * img_h
                    
                    # Draw rectangle
                    rect = patches.Rectangle((x1, y1), box_w, box_h, 
                                            linewidth=2, edgecolor='lime', facecolor='none')
                    ax.add_patch(rect)
                    
                    # Add label
                    label = class_names[class_id] if class_id < len(class_names) else f'class_{class_id}'
                    ax.text(x1, y1-5, label, fontsize=6, color='lime', 
                           bbox=dict(boxstyle='round', facecolor='black', alpha=0.7))

plt.tight_layout()
plt.show()

## âœ… Step 2 Complete - Data Pipeline FROZEN!

**Artifacts created:**
- `data/processed/splits/split_manifest.json` - All filenames per split
- `data/processed/evaluation/test_index.json` - Ground truth + difficulty labels
- `data/processed/evaluation/difficulty_summary.csv` - Statistics per difficulty

**Important:** These files should be committed to Git. Do NOT regenerate them.

**Next steps:**
- Step 3: Train baseline models (YOLOv8 & RT-DETR)
- Step 4: Evaluate and compare