# NB04 — CNN Feature Extraction

**Goal**: Convert every sampled frame from **NB03** into a high-dimensional visual embedding using a frozen CNN backbone.

### Why Feature Extraction?
1. **Efficiency**: Running heavy CNNs (MobileNetV3) during every epoch of temporal model training (BiLSTM/Transformer) is computationally prohibitive. Pre-calculating them once saves hours of GPU time.
2. **Standardization**: Ensures that both the BiLSTM and Transformer see the exact same visual inputs.
3. **Focus**: Allows Phase 3 models to focus purely on temporal importance rather than pixel-level visual recognition.

In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms
import pandas as pd
import numpy as np
import cv2
from pathlib import Path
import json
from tqdm.auto import tqdm
from PIL import Image

# Environment detection
IS_KAGGLE = Path("/kaggle/input").exists()

if IS_KAGGLE:
    # UPDATED: Exact Kaggle paths per user specification
    FRAME_INDEX_PATH = Path("/kaggle/input/tvsum-frame-index/tvsum_frame_index.parquet")
    VIDEO_INDEX_PATH = Path("/kaggle/input/tvsum-index/tvsum_index.csv")
    # Note: NB04 will use the 'video_path' from tvsum_index.csv directly.
else:
    FRAME_INDEX_PATH = Path("data/processed/tvsum_frame_index.parquet")
    VIDEO_INDEX_PATH = Path("data/processed/tvsum_index.csv")

PROCESSED_DIR = Path("data/processed")
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Using device: {DEVICE}")

Using device: cuda


## 1. Metadata Preparation

We merge `video_path` and `fps` from the master index into our sampling list to ensure the Dataset has everything it needs for accurate frame seeking.

In [2]:
# Load indices
frame_df = pd.read_parquet(FRAME_INDEX_PATH)
video_df = pd.read_csv(VIDEO_INDEX_PATH)

# Merge verification
merged_df = frame_df.merge(video_df[['video_id', 'video_path', 'fps']], on='video_id', how='left')

# Final check for missing paths
missing_count = merged_df['video_path'].isna().sum()
if missing_count > 0:
    raise ValueError(f"Found {missing_count} frames with no matching video path. Ensure NB02 was run correctly.")

print(f"Ready to process {len(merged_df)} frames across {merged_df['video_id'].nunique()} videos.")

Ready to process 25190 frames across 50 videos.


## 2. Model Initialization (Frozen MobileNetV3)

We use **MobileNetV3-Large** for its excellent balance of semantic richness and speed. We remove the classification head to get raw GAP embeddings.

In [3]:
def get_feature_extractor():
    # Load pre-trained model
    model = models.mobilenet_v3_large(weights="IMAGENET1K_V1")
    
    # Identity layer removes the final classifier but keeps the GAP output
    model.classifier = nn.Identity()
    
    model = model.to(DEVICE)
    model.eval()
    return model

model = get_feature_extractor()
FEATURE_DIM = 960  # Default for MobileNetV3-Large architecture
print(f"Extractor ready. Feature dimension: {FEATURE_DIM}")

Downloading: "https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v3_large-8738ca79.pth


100%|██████████| 21.1M/21.1M [00:00<00:00, 178MB/s]


Extractor ready. Feature dimension: 960


## 3. High-Resolution Data Pipe

To solve the temporal alignment bug, we calculate `native_idx = timestamp * fps`. This ensures OpenCV seeks to the exact frame associated with our ground truth labels.

In [4]:
class VideoFrameDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        video_path = row['video_path']
        timestamp = row['timestamp_sec']
        fps = row['fps']
        
        # FIX: Calculate native frame index from timestamp to ensure accurate seeking
        native_idx = int(round(timestamp * fps))
        
        cap = cv2.VideoCapture(video_path)
        cap.set(cv2.CAP_PROP_POS_FRAMES, native_idx)
        ret, frame = cap.read()
        cap.release()
        
        if not ret:
            # Fallback: Create black frame if read fails to prevent crash
            frame = np.zeros((224, 224, 3), dtype=np.uint8)
            
        # Convert BGR (OpenCV) to RGB (PIL/Torchvision)
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        img = Image.fromarray(frame)
        
        if self.transform:
            img = self.transform(img)
            
        return img

# Standard ImageNet transforms
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

dataset = VideoFrameDataset(merged_df, transform=preprocess)
loader = DataLoader(dataset, batch_size=128, shuffle=False, num_workers=0)

print(f"Pipeline initialized. Sequential extraction (shuffle=False) is ACTIVE.")

Pipeline initialized. Sequential extraction (shuffle=False) is ACTIVE.


## 4. Batch Extraction Loop

In [5]:
num_frames = len(merged_df)
all_features = np.zeros((num_frames, FEATURE_DIM), dtype=np.float32)

start_ptr = 0
with torch.no_grad():
    for i, batch in enumerate(tqdm(loader)):
        batch = batch.to(DEVICE)
        features = model(batch)
        
        # Flatten and store
        features_np = features.cpu().numpy()
        batch_size = features_np.shape[0]
        
        all_features[start_ptr:start_ptr + batch_size] = features_np
        start_ptr += batch_size

print(f"Extraction complete. Shape: {all_features.shape}")

  0%|          | 0/197 [00:00<?, ?it/s]

Extraction complete. Shape: (25190, 960)


## 5. Serialization & Manifest

In [7]:
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

FEATURE_SAVE_PATH = PROCESSED_DIR / "tvsum_features.npy"
MANIFEST_SAVE_PATH = PROCESSED_DIR / "feature_manifest.json"

# Save array
np.save(FEATURE_SAVE_PATH, all_features)

# Save metadata
manifest = {
    "backbone": "MobileNetV3-Large",
    "weights": "IMAGENET1K_V1",
    "feature_dim": FEATURE_DIM,
    "total_frames": num_frames,
    "alignment_verified": True,
    "order": "sequential_per_frame_index"
}

with open(MANIFEST_SAVE_PATH, 'w') as f:
    json.dump(manifest, f, indent=4)

print(f"Features saved to {FEATURE_SAVE_PATH}")
print("NB04 complete. Ready for Phase 3 Modeling.")

Features saved to data/processed/tvsum_features.npy
NB04 complete. Ready for Phase 3 Modeling.
