# NBA Basketball Foul Detection - Colab Training

This notebook trains the E2E-Spot model for basketball foul detection on Google Colab Pro.

## Setup Overview

1. **Environment Setup** - Install dependencies (~2 min)
2. **Mount Google Drive** - Connect persistent storage (~10 sec)
3. **AWS Credentials** - Configure S3 access (~30 sec)
4. **Download Frames to Drive** - One-time download of 22GB (~10-15 min)
5. **Clone Repository** - Get training code (~30 sec)
6. **Verify Setup** - Check everything is ready (~10 sec)
7. **Start Training** - Begin training (~14-17 hours with Colab Pro)
8. **Resume Training** - Continue if session disconnects

## Important Notes

- **Colab Pro:** 24-hour sessions (vs 12 hours free)
- **Frames:** Downloaded once to Drive, persistent across sessions
- **Checkpoints:** Saved directly to Drive, no separate backup needed
- **Total time:** ~14-17 hours (might finish in one Colab Pro session)

**Enable GPU:** Runtime → Change runtime type → GPU (T4/V100/A100)

## Cell 1: Environment Setup

Install PyTorch and dependencies. Run this first.

In [None]:
# Install dependencies
print("Installing dependencies for Colab environment...")
!pip install -q torch torchvision timm tqdm tabulate opencv-python pillow matplotlib

# Verify GPU availability
import torch
print(f"\n✓ PyTorch version: {torch.__version__}")
print(f"✓ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✓ GPU: {torch.cuda.get_device_name(0)}")
    print(f"✓ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    # Colab Pro may give you V100 (32GB) or A100 (40GB)
else:
    print("\n⚠️  WARNING: No GPU detected!")
    print("Please enable GPU: Runtime → Change runtime type → GPU")

print("\n✓ Environment setup complete!")

## Cell 2: Mount Google Drive

Connect Google Drive for persistent storage of frames and checkpoints.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create directories in Drive (persistent across sessions)
DRIVE_ROOT = '/content/drive/MyDrive/nba_foul_training'
FRAME_DIR = f'{DRIVE_ROOT}/frames'
CHECKPOINT_DIR = f'{DRIVE_ROOT}/checkpoints'

os.makedirs(DRIVE_ROOT, exist_ok=True)
os.makedirs(FRAME_DIR, exist_ok=True)
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

print(f"\n✓ Google Drive mounted successfully")
print(f"✓ Frames will be stored in: {FRAME_DIR}")
print(f"✓ Checkpoints will be saved to: {CHECKPOINT_DIR}")
print(f"\nCurrent Drive contents:")
!ls -lh "$DRIVE_ROOT" 2>/dev/null || echo "  (empty)"

## Cell 3: AWS Credentials

Configure AWS credentials to download frames from S3.

**Security:** Credentials are stored only in this session (temporary).

In [None]:
import os
from getpass import getpass

print("Enter your AWS credentials (input is hidden):")
print("These are used ONLY in this Colab session to download frames from S3.")
print("Your local AWS config is NOT affected.\n")

AWS_ACCESS_KEY_ID = getpass("AWS Access Key ID: ")
AWS_SECRET_ACCESS_KEY = getpass("AWS Secret Access Key: ")
AWS_REGION = input("AWS Region (default: us-east-2): ") or "us-east-2"

# Set environment variables (temporary, session-only)
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY
os.environ['AWS_DEFAULT_REGION'] = AWS_REGION

# Install AWS CLI
print("\nInstalling AWS CLI...")
!pip install -q awscli

# Test credentials (without displaying them)
import subprocess
result = subprocess.run(['aws', 's3', 'ls', 's3://nba-foul-dataset-oh/'], 
                       capture_output=True, text=True)
if result.returncode == 0:
    print("\n✓ AWS credentials verified successfully!")
    print("✓ S3 bucket access confirmed")
else:
    print("\n❌ AWS credential verification failed. Please check your keys.")
    print(f"Error: {result.stderr}")

## Cell 4: Download Frames to Google Drive

**Downloads 22GB of frame data from S3 directly to Google Drive.**

- **First time:** Takes ~10-15 minutes
- **Subsequent sessions:** Skipped (frames already in Drive)
- **Persistent:** Frames stay in Drive across sessions

⚠️ **This is the critical step.** Make sure it completes successfully.

In [None]:
import os
import re
import time

S3_BUCKET = 's3://nba-foul-dataset-oh/frames/'

def count_clips_in_dir(base_dir):
    """Count unique clips from both foul and non-foul directories"""
    clips = set()
    
    # Check both directories with correct paths
    paths_to_check = [
        os.path.join(base_dir, '2023-24'),           # Foul clips
        os.path.join(base_dir, 'non_fouls', '2023-24')  # Non-foul clips
    ]
    
    for dir_path in paths_to_check:
        if not os.path.exists(dir_path):
            continue
        
        for game_folder in os.listdir(dir_path):
            game_path = os.path.join(dir_path, game_folder)
            if not os.path.isdir(game_path):
                continue
                
            for filename in os.listdir(game_path):
                match = re.match(r'(\d+)_(\d+)_frame_\d+\.jpg', filename)
                if match:
                    game_id, event_id = match.groups()
                    clips.add(f"{game_id}_{event_id}")
    
    return len(clips)

# Check existing
num_existing = count_clips_in_dir(FRAME_DIR)
expected_total = 2214  # 1213 fouls + 1001 non-fouls

print(f"Current: {num_existing} / {expected_total} clips")

if num_existing >= 2200:  # Allow reasonable margin for clips with varying frame counts
    print(f"✓ Download appears complete, skipping")
else:
    print(f"Syncing from S3 (output suppressed for speed)...\n")
    
    start_time = time.time()
    
    # Download with no output (fastest)
    !aws s3 sync {S3_BUCKET} {FRAME_DIR} --region {AWS_REGION} --no-progress --only-show-errors
    
    # Final count
    final_count = count_clips_in_dir(FRAME_DIR)
    elapsed = time.time() - start_time
    
    print(f"\n✓ Sync complete: {final_count} / {expected_total} clips in {elapsed/60:.1f} min")
    
    if final_count < 2200:
        print(f"⚠️  Only {final_count} clips - check S3 or re-run to resume")
    else:
        print(f"✓ Run Cell 5 to reorganize frames")

In [None]:
import os
import re
import shutil
from tqdm import tqdm

TRAINING_DIR = f'{DRIVE_ROOT}/frames_training'

# Check if already reorganized
if os.path.exists(TRAINING_DIR):
    existing_clips = [d for d in os.listdir(TRAINING_DIR) if os.path.isdir(os.path.join(TRAINING_DIR, d))]
    print(f"✓ Training frames already reorganized!")
    print(f"✓ Found {len(existing_clips)} clips in {TRAINING_DIR}")
    print(f"\nSkipping reorganization. Delete {TRAINING_DIR} to re-run.")
else:
    print("="*80)
    print("REORGANIZING FRAMES FOR TRAINING")
    print("="*80)
    print(f"Source: {FRAME_DIR}")
    print(f"Destination: {TRAINING_DIR}")
    print(f"\nProcessing both foul and non-foul clips...")
    print(f"Converting: {'{game_id}/{game_id}_{event_id}_frame_X.jpg'}")
    print(f"To:         {'{game_id}_{event_id}/XXXXXX.jpg'}")
    print()
    
    os.makedirs(TRAINING_DIR, exist_ok=True)
    
    clips_created = 0
    frames_copied = 0
    
    # Process both foul and non-foul directories with correct paths
    sources = [
        ('2023-24', os.path.join(FRAME_DIR, '2023-24')),
        ('non_fouls', os.path.join(FRAME_DIR, 'non_fouls', '2023-24'))
    ]
    
    for label, source_dir in sources:
        if not os.path.exists(source_dir):
            print(f"Skipping {label} (not found)")
            continue
        
        # Get list of all game folders
        game_folders = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]
        
        print(f"Processing {len(game_folders)} folders from {label}...")
        
        for game_folder in tqdm(game_folders, desc=label):
            game_path = os.path.join(source_dir, game_folder)
            
            for filename in os.listdir(game_path):
                # Parse: {game_id}_{event_id}_frame_{idx}.jpg
                match = re.match(r'(\d+)_(\d+)_frame_(\d+)\.jpg', filename)
                if match:
                    game_id, event_id, frame_idx = match.groups()
                    
                    # Create clip directory
                    clip_dir = os.path.join(TRAINING_DIR, f"{game_id}_{event_id}")
                    if not os.path.exists(clip_dir):
                        os.makedirs(clip_dir, exist_ok=True)
                        clips_created += 1
                    
                    # Copy frame with new name: XXXXXX.jpg (6-digit zero-padded)
                    src_file = os.path.join(game_path, filename)
                    dst_file = os.path.join(clip_dir, f"{int(frame_idx):06d}.jpg")
                    shutil.copy2(src_file, dst_file)
                    frames_copied += 1
    
    print(f"\n✓ Reorganization complete!")
    print(f"  Clips created: {clips_created}")
    print(f"  Frames copied: {frames_copied}")
    print(f"  Location: {TRAINING_DIR}")
    
    # Verify
    !du -sh {TRAINING_DIR}
    
    print(f"\n✓ Ready for training! Use this path in Cell 7: {TRAINING_DIR}")

## Cell 5: Reorganize Frames for Training

**IMPORTANT: Run this after Cell 4 completes!**

The downloaded S3 structure has frames organized by game, but training expects each clip in its own folder.

This cell:
- Reorganizes frames from `2023-24/{game_id}/{game_id}_{event_id}_frame_X.jpg`
- To training format: `{game_id}_{event_id}/XXXXXX.jpg`
- Takes ~5-10 minutes to reorganize all clips
- Creates new directory: `frames_training/`

⚠️ **This only needs to run once** after initial download.

## Cell 5.5: Copy Frames to Local Storage (FAST I/O)

**IMPORTANT: Run this for much faster training!**

Google Drive I/O is very slow (5+ sec/batch). Copying frames to local Colab storage provides 10-20x faster training:
- **Drive I/O:** ~3+ hours per epoch
- **Local I/O:** ~15-20 minutes per epoch

This one-time copy takes ~10 minutes but saves hours during training.

In [None]:
import os
import shutil
import time

# Paths
DRIVE_TRAINING = f'{DRIVE_ROOT}/frames_training'
LOCAL_TRAINING = '/content/frames_training'

# Check if already copied
if os.path.exists(LOCAL_TRAINING):
    clips = [d for d in os.listdir(LOCAL_TRAINING) if os.path.isdir(os.path.join(LOCAL_TRAINING, d))]
    print(f"✓ Frames already in local storage!")
    print(f"✓ Found {len(clips)} clips in {LOCAL_TRAINING}")
    print(f"\nSkipping copy. Delete {LOCAL_TRAINING} to re-copy.")
else:
    print("="*80)
    print("COPYING FRAMES TO LOCAL STORAGE")
    print("="*80)
    print(f"Source: {DRIVE_TRAINING} (Google Drive - slow)")
    print(f"Dest:   {LOCAL_TRAINING} (Local SSD - fast)")
    print("\nThis takes ~10 minutes but makes training 10-20x faster!")
    print("="*80)
    
    if not os.path.exists(DRIVE_TRAINING):
        print(f"\n❌ Error: {DRIVE_TRAINING} not found")
        print("Run Cell 5 first to reorganize frames in Drive")
    else:
        start = time.time()
        
        print("\nCopying frames...")
        shutil.copytree(DRIVE_TRAINING, LOCAL_TRAINING)
        
        elapsed = time.time() - start
        clips = [d for d in os.listdir(LOCAL_TRAINING) if os.path.isdir(os.path.join(LOCAL_TRAINING, d))]
        
        print(f"\n✓ Copy complete in {elapsed/60:.1f} minutes!")
        print(f"✓ Copied {len(clips)} clips to local storage")
        print(f"✓ Training will use fast local I/O")
        
        # Verify size
        !du -sh {LOCAL_TRAINING}
        
        print(f"\n✓ Ready for fast training!")

## Cell 5: Clone Repository

Clone the basketball foul detection training code.

In [None]:
import os

# Clone repository if not already present
REPO_DIR = '/content/basketball_foul_detection'

if os.path.exists(REPO_DIR):
    print(f"✓ Repository already exists at {REPO_DIR}")
    %cd {REPO_DIR}
    !git pull origin main 2>/dev/null || echo "(git pull skipped)"
else:
    print("Cloning repository...")
    !git clone https://github.com/githubhomie/basketball_foul_detection.git {REPO_DIR}
    %cd {REPO_DIR}

# Install dependencies
print("\nInstalling dependencies...")
!pip install -q -r requirements.txt

# Verify critical files exist
print("\nVerifying project structure...")
critical_files = [
    'train_e2e.py',
    'data/basketball/train.json',
    'data/basketball/val.json',
    'data/basketball/test.json',
    'data/basketball/class.txt'
]

all_present = True
for file in critical_files:
    if os.path.exists(file):
        print(f"  ✓ {file}")
    else:
        print(f"  ❌ {file} - MISSING!")
        all_present = False

if all_present:
    print("\n✓ All critical files present!")
    print(f"✓ Working directory: {os.getcwd()}")
else:
    print("\n❌ Some files are missing. Check repository.")

## Cell 6: Verify Setup

**Run this before training to verify everything is ready.**

This cell checks:
- GPU is available
- Frames are in Drive
- Dataset files are correct
- All dependencies are installed

In [None]:
import osimport timefrom datetime import datetime# Training configurationDATASET = "basketball"MODEL_ARCH = "rny002_gsm"  # RegNet-Y 200MF + Gated Shift ModuleTEMPORAL_ARCH = "gru"      # Bidirectional GRU# Hyperparameters - UPDATED CONFIGBATCH_SIZE = 24       # Optimized for A100 (use 16 if OOM)CLIP_LEN = 30         # 30 frames per clipNUM_EPOCHS = 50       # Total training epochsLEARNING_RATE = 0.001 # Initial learning rate (with warmup)CROP_DIM = 224        # Input image size# Use LOCAL frames directory for FAST training (not Drive!)TRAINING_FRAME_DIR = '/content/frames_training'# Save directory in Drive (persistent)timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")SAVE_DIR = f"{CHECKPOINT_DIR}/basketball_colab_{timestamp}"os.makedirs(SAVE_DIR, exist_ok=True)print("="*80)print("NBA BASKETBALL FOUL DETECTION TRAINING")print("="*80)print(f"Dataset:        {DATASET}")print(f"Frame dir:      {TRAINING_FRAME_DIR} (local SSD - FAST)")print(f"Model:          {MODEL_ARCH} + {TEMPORAL_ARCH}")print(f"Clip length:    {CLIP_LEN} frames")print(f"Batch size:     {BATCH_SIZE}")print(f"Epochs:         {NUM_EPOCHS}")print(f"Learning rate:  {LEARNING_RATE}")print(f"Mixup:          False (disabled for single-frame labels)")print(f"Dilate len:     1 (±1 frame tolerance)")print(f"FG upsample:    0.5 (50% clips contain events)")print(f"Start val:      Epoch 5 (early mAP)")print(f"Save dir:       {SAVE_DIR}")print(f"Storage:        Frames=Local (fast), Checkpoints=Drive (persistent)")print("="*80)print()print("Expected time: ~6-8 hours (~8-10 min/epoch × 50 epochs)")print("Keep this tab open or training will stop!")print("If Colab disconnects, use Cell 18 to resume.")print()print("Starting training...\n")start_time = time.time()# Run training - UPDATED CONFIG!python3 train_e2e.py "{DATASET}" "{TRAINING_FRAME_DIR}" \    -m "{MODEL_ARCH}" \    -t "{TEMPORAL_ARCH}" \    -s "{SAVE_DIR}" \    --clip_len {CLIP_LEN} \    --crop_dim {CROP_DIM} \    --batch_size {BATCH_SIZE} \    --num_epochs {NUM_EPOCHS} \    --learning_rate {LEARNING_RATE} \    --mixup False \    --criterion map \    --dilate_len 1 \    --fg_upsample 0.5 \    --start_val_epoch 5 \    --warm_up_epochs 3elapsed = time.time() - start_timeprint(f"\n\n{'='*80}")print(f"Training completed in {elapsed/3600:.1f} hours!")print(f"{'='*80}")print(f"\nResults saved to: {SAVE_DIR}")print(f"\nCheckpoints are in Google Drive (persistent).")print(f"You can close Colab now.")

## Cell 7: Start Training

**Starts training the foul detection model.**

### Configuration:
- **Model:** RegNet-Y 200MF + Gated Shift Module + BiGRU
- **Batch size:** 8 (optimized for V100/A100, reduce to 6 if OOM on T4)
- **Epochs:** 50
- **Learning rate:** 0.001 with warmup and cosine decay
- **Clip length:** 30 frames
- **Input size:** 224×224

### Timeline:
- **Each epoch:** ~20-25 minutes (depends on GPU)
- **Total time:** ~14-17 hours
- **Colab Pro:** 24-hour session should complete in one go

### During Training:
- Checkpoints saved to Drive automatically
- You can close laptop and check back later
- If session disconnects, use Cell 8 to resume

⚠️ **Keep this tab open** or training will stop!

In [None]:
import os
import time
from glob import glob
from datetime import datetime

# Use LOCAL frames directory (FAST I/O)
TRAINING_FRAME_DIR = '/content/frames_training'

# Find available checkpoints in Drive
checkpoint_dirs = glob(os.path.join(CHECKPOINT_DIR, 'basketball_*'))

if not checkpoint_dirs:
    print("❌ No checkpoints found in Google Drive.")
    print(f"Expected location: {CHECKPOINT_DIR}")
    print("\nMake sure Cell 7 (training) created checkpoints.")
    print("Or run Cell 7 to start fresh training.")
else:
    print("Available checkpoints in Drive:")
    for i, ckpt_dir in enumerate(sorted(checkpoint_dirs)):
        name = os.path.basename(ckpt_dir)
        files = os.listdir(ckpt_dir)
        checkpoint_files = [f for f in files if f.startswith('checkpoint_') and f.endswith('.pt')]
        print(f"  [{i}] {name} ({len(checkpoint_files)} checkpoint files)")
    
    # Auto-select most recent checkpoint
    selected_checkpoint = sorted(checkpoint_dirs)[-1]
    checkpoint_name = os.path.basename(selected_checkpoint)
    
    print(f"\nResuming from: {checkpoint_name}")
    print(f"Location: {selected_checkpoint}")
    
    # Training configuration (same as Cell 7)
    DATASET = "basketball"
    MODEL_ARCH = "rny002_gsm"
    TEMPORAL_ARCH = "gru"
    BATCH_SIZE = 8
    CLIP_LEN = 30
    NUM_EPOCHS = 50
    LEARNING_RATE = 0.001
    CROP_DIM = 224
    
    print("\n" + "="*80)
    print("RESUMING TRAINING")
    print("="*80)
    print(f"Checkpoint: {checkpoint_name}")
    print(f"Frame dir:  {TRAINING_FRAME_DIR} (local SSD - FAST)")
    print(f"Save dir:   {selected_checkpoint}")
    print("="*80)
    print()
    
    start_time = time.time()
    
    # Resume training - using LOCAL frames for fast I/O
    !python3 train_e2e.py "{DATASET}" "{TRAINING_FRAME_DIR}" \
        -m "{MODEL_ARCH}" \
        -t "{TEMPORAL_ARCH}" \
        -s "{selected_checkpoint}" \
        --clip_len {CLIP_LEN} \
        --crop_dim {CROP_DIM} \
        --batch_size {BATCH_SIZE} \
        --num_epochs {NUM_EPOCHS} \
        --learning_rate {LEARNING_RATE} \
        --mixup True \
        --criterion map \
        --dilate_len 0 \
        --warm_up_epochs 3 \
        --resume
    
    elapsed = time.time() - start_time
    print(f"\n\n{'='*80}")
    print(f"✓ Training session completed in {elapsed/3600:.1f} hours!")
    print(f"{'='*80}")
    print(f"\nResults saved to: {selected_checkpoint}")

## Cell 8: Resume Training from Checkpoint

**Use this if your Colab session disconnected during training.**

### When to use:
- Colab session timed out or disconnected
- You manually stopped training and want to continue
- Starting a new session to finish remaining epochs

### Prerequisites:
1. Start a new Colab session
2. Run Cells 1-6 first (setup, but skip frame download if already in Drive)
3. Then run this cell to resume

This will:
- Find your latest checkpoint in Drive
- Resume training from the last completed epoch
- Continue until all epochs are done

In [None]:
import os
import time
from glob import glob
from datetime import datetime

# Use reorganized frames directory
TRAINING_FRAME_DIR = f'{DRIVE_ROOT}/frames_training'

# Find available checkpoints in Drive
checkpoint_dirs = glob(os.path.join(CHECKPOINT_DIR, 'basketball_*'))

if not checkpoint_dirs:
    print("❌ No checkpoints found in Google Drive.")
    print(f"Expected location: {CHECKPOINT_DIR}")
    print("\nMake sure Cell 7 (training) created checkpoints.")
    print("Or run Cell 7 to start fresh training.")
else:
    print("Available checkpoints in Drive:")
    for i, ckpt_dir in enumerate(sorted(checkpoint_dirs)):
        name = os.path.basename(ckpt_dir)
        files = os.listdir(ckpt_dir)
        checkpoint_files = [f for f in files if f.startswith('checkpoint_') and f.endswith('.pt')]
        print(f"  [{i}] {name} ({len(checkpoint_files)} checkpoint files)")
    
    # Auto-select most recent checkpoint
    selected_checkpoint = sorted(checkpoint_dirs)[-1]
    checkpoint_name = os.path.basename(selected_checkpoint)
    
    print(f"\nResuming from: {checkpoint_name}")
    print(f"Location: {selected_checkpoint}")
    
    # Training configuration (same as Cell 7)
    DATASET = "basketball"
    MODEL_ARCH = "rny002_gsm"
    TEMPORAL_ARCH = "gru"
    BATCH_SIZE = 8
    CLIP_LEN = 30
    NUM_EPOCHS = 50
    LEARNING_RATE = 0.001
    CROP_DIM = 224
    
    print("\n" + "="*80)
    print("RESUMING TRAINING")
    print("="*80)
    print(f"Checkpoint: {checkpoint_name}")
    print(f"Frame dir:  {TRAINING_FRAME_DIR} (reorganized)")
    print(f"Save dir:   {selected_checkpoint}")
    print("="*80)
    print()
    
    start_time = time.time()
    
    # Resume training - using reorganized frames from Drive
    !python3 train_e2e.py "{DATASET}" "{TRAINING_FRAME_DIR}" \
        -m "{MODEL_ARCH}" \
        -t "{TEMPORAL_ARCH}" \
        -s "{selected_checkpoint}" \
        --clip_len {CLIP_LEN} \
        --crop_dim {CROP_DIM} \
        --batch_size {BATCH_SIZE} \
        --num_epochs {NUM_EPOCHS} \
        --learning_rate {LEARNING_RATE} \
        --mixup True \
        --criterion map \
        --dilate_len 0 \
        --warm_up_epochs 3 \
        --resume
    
    elapsed = time.time() - start_time
    print(f"\n\n{'='*80}")
    print(f"✓ Training session completed in {elapsed/3600:.1f} hours!")
    print(f"{'='*80}")
    print(f"\nResults saved to: {selected_checkpoint}")

## Cell 9: Check Training Progress

View training metrics and progress without interrupting training.

Run this anytime to check:
- Which checkpoints have been saved
- Training loss history
- Validation mAP (if computed)
- Current training status

In [None]:
import os
import json
from glob import glob

# Find checkpoint directories
checkpoint_dirs = sorted(glob(os.path.join(CHECKPOINT_DIR, 'basketball_*')))

if not checkpoint_dirs:
    print("No training runs found yet.")
    print("Start training with Cell 7.")
else:
    latest_run = checkpoint_dirs[-1]
    run_name = os.path.basename(latest_run)
    
    print(f"Latest training run: {run_name}")
    print(f"Location: {latest_run}\n")
    
    # List checkpoint files
    checkpoint_files = sorted([f for f in os.listdir(latest_run) 
                               if f.startswith('checkpoint_') and f.endswith('.pt')])
    
    if checkpoint_files:
        print(f"Saved checkpoints ({len(checkpoint_files)}):")
        for ckpt in checkpoint_files:
            ckpt_path = os.path.join(latest_run, ckpt)
            size_mb = os.path.getsize(ckpt_path) / (1024*1024)
            print(f"  {ckpt} ({size_mb:.1f} MB)")
    else:
        print("No checkpoint files saved yet (training may be starting...)")
    
    # Show training history if available
    loss_file = os.path.join(latest_run, 'loss.json')
    if os.path.exists(loss_file):
        print(f"\nTraining history:")
        with open(loss_file) as f:
            lines = f.readlines()
        
        # Show last 10 epochs
        print(f"  (showing last 10 epochs)\n")
        for line in lines[-20:]:  # Last 20 lines (train+val per epoch)
            data = json.loads(line)
            split = data['split']
            epoch = data['epoch']
            loss = data['loss']
            map_score = data.get('mAP', 'N/A')
            if map_score != 'N/A':
                print(f"  Epoch {epoch:3d} [{split:5s}]: Loss={loss:.4f}, mAP={map_score:.4f}")
            else:
                print(f"  Epoch {epoch:3d} [{split:5s}]: Loss={loss:.4f}")
    else:
        print(f"\nNo loss.json found yet (training may be in first epoch)")
    
    # Show directory listing
    print(f"\nFull directory contents:")
    !ls -lh "{latest_run}"

---

## Troubleshooting

### Session disconnected during training
1. Start new Colab session
2. Run Cells 1-3 (setup + mount Drive)
3. Skip Cell 4 (frames already in Drive)
4. Run Cell 5 (clone repo)
5. Run Cell 8 (resume training)

### Out of memory error
- Edit Cell 7, change `BATCH_SIZE = 8` to `BATCH_SIZE = 6` or `BATCH_SIZE = 4`
- Restart runtime and run Cell 7 again

### Frames not downloading
- Check AWS credentials in Cell 3
- Verify S3 bucket access: `!aws s3 ls s3://nba-foul-dataset-oh/`
- Check internet connection
- Re-run Cell 4 (aws s3 sync will resume interrupted downloads)

### GPU not detected
- Runtime → Change runtime type → GPU
- If still no GPU, try Runtime → Factory reset runtime

### Training is slow
- Check GPU usage in Cell 1 (should show V100 or A100 with Colab Pro)
- T4 GPU is slower (~25-30 min/epoch)
- V100/A100 is faster (~15-20 min/epoch)

---

## Expected Results

**Success criteria:**
- Overall mAP @ tolerance=2: **≥0.60** (good), **≥0.65** (excellent)
- Training time: ~14-17 hours
- Final test mAP computed automatically

**After training completes:**
- Checkpoints are in Drive: `/content/drive/MyDrive/nba_foul_training/checkpoints/`
- Best model saved as `checkpoint_best.pt`
- Training history in `loss.json`
- Test predictions in `pred-test.*.json`

You can download checkpoints from Drive to your computer or use them directly for inference.