# NB00 - Video Keyframe Summarization: Project Overview

## Goal
The objective of this project is to build an automated system that identifies the most representative frames (keyframes) from a video. We compare two modern temporal modeling techniques:
1. **Bidirectional LSTM (BiLSTM)**: Modeling local temporal context.
2. **Transformer (Self-Attention)**: Modeling global dependencies across all frames.

---

## Project Map

The following diagram illustrates the complete pipeline of the project from raw video input to final evaluation.

```mermaid
graph TD
    A[Input Video] --> B[Frame Extraction @ 2 FPS]
    B --> C[CNN Feature Extraction - GoogLeNet]
    C --> D{Temporal Modeling}
    D --> E[BiLSTM Regressor]
    D --> F[Transformer Encoder]
    E --> G[Importance Scores]
    F --> G
    G --> H[Keyframe Selection - Top-K + Suppression]
    H --> I[Summary Visualization]
    I --> J[Evaluation - F1 Score & Diversity]
```

---

## Notebook Roadmap

| ID | Title | Description |
|:---|:---|:---|
| **NB00** | **Project Overview** | Architecture and Goals. |
| **NB01** | **Setup and Sanity Checks** | GPU, FFmpeg, and Library validation. |
| **NB02** | **Dataset Structure** | Metadata indexing and dataset validation. |
| **NB03** | **TVSum Preprocessing** | 2 FPS Annotation alignment and splits. |
| **NB04** | **Feature Extraction** | Precomputing frozen CNN embeddings. |
| **NB05** | **BiLSTM Training** | Training the baseline temporal model. |
| **NB06** | **Transformer Training** | Training the attention-based model. |
| **NB07** | **Inference and Selection** | Generating summaries and visualizations. |
| **NB08** | **TVSum Evaluation** | Quantitative results on benchmark data. |
| **NB09** | **SumMe Transfer** | Evaluating generalization capabilities. |
| **NB10** | **Ablation Studies** | Comparing model choices vs performance. |
| **NB11** | **Final Report Assets** | Generating figures for the final package. |

---

## Datasets
We utilize two primary benchmark datasets for Video Summarization:
- **TVSum**: 50 videos with frame-level human importance scores (10 categories).
- **SumMe**: 25 videos with user-selected video segments for external validation.

## Design Philosophy
- **Decoupled Extraction**: Features are extracted once and cached to disk to allow for rapid training iterations on temporal models.
- **Deterministic Splits**: All train/validation splits are seeded for reproducible science.
- **Qualitative and Quantitative**: We do not just look at F1 scores; we visualize what the model thinks is important on a timeline.