# VGGSound Sound Effects Filtering Pipeline - Demo

This notebook demonstrates the VGGSound filtering pipeline for identifying videos
containing only sound effects (no music, no speech).

**Requirements:**
- Google Colab with T4 GPU (free tier works)
- VGGSound data files (vggsound_00.tar.gz and vggsound.csv)

**Supported Platforms:**
- Linux with CUDA (Colab, cloud GPUs) - 8-bit quantization
- macOS Apple Silicon (MPS with fp16) - GPU-accelerated
- Linux/Windows CPU (fp16) - slower fallback

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install uv (much faster than pip)
!curl -LsSf https://astral.sh/uv/install.sh | sh
import os
os.environ['PATH'] = f"{os.environ['HOME']}/.local/bin:{os.environ['PATH']}"

In [None]:
# Clone and install the pipeline with CUDA 12.1 support (Colab T4)
# Replace with your repo URL
# !git clone https://github.com/YOUR_USERNAME/vggsound-pipeline.git
# %cd vggsound-pipeline
# !uv sync --extra cu121

In [None]:
# Alternative: Install with pip if uv doesn't work
# !pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
# !pip install transformers accelerate bitsandbytes>=0.44 \
#     typer pydantic pydantic-settings orjson ffmpeg-python tqdm
# !pip install -e ".[cu121]"

## 2. Upload Data Files

Upload your VGGSound files:
- `vggsound.csv` - metadata file
- `vggsound_00.tar.gz` - video archive (or a smaller sample)

In [None]:
from google.colab import files
import os

# Check if files exist, otherwise upload
if not os.path.exists('vggsound.csv'):
    print("Please upload vggsound.csv")
    uploaded = files.upload()

if not os.path.exists('vggsound_00.tar.gz'):
    print("Please upload vggsound_00.tar.gz (or a sample tar file)")
    uploaded = files.upload()

## 3. Quick Test - Label Extraction

Let's start with a simple test that doesn't require GPU - extracting and categorizing labels.

In [None]:
from vggsound_pipeline.extraction import parse_vggsound_csv
from vggsound_pipeline.label_filter import get_unique_labels, categorize_label
from pathlib import Path
from collections import Counter

# Parse CSV
metadata = parse_vggsound_csv(Path('vggsound.csv'))
print(f"Total samples in CSV: {len(metadata)}")

# Get unique labels
labels = get_unique_labels(Path('vggsound.csv'))
print(f"Unique labels: {len(labels)}")

# Categorize
categories = Counter(categorize_label(l) for l in labels)
print(f"\nLabel categories:")
for cat, count in categories.items():
    print(f"  {cat}: {count}")

## 4. Run Pipeline (Small Sample)

Run the full pipeline on a small sample to test everything works.

In [None]:
# Run with CLI (simplest approach)
!uv run vggsound run vggsound_00.tar.gz vggsound.csv \
    --sample-limit 50 \
    --output demo_output.jsonl \
    --device cuda

## 5. View Results

In [None]:
import orjson
from pathlib import Path

# Load results
results = [
    orjson.loads(line)
    for line in Path('demo_output.jsonl').read_text().splitlines()
    if line.strip()
]

print(f"Total SFX samples found: {len(results)}")
print("\n" + "="*60)
print("Sample Results:")
print("="*60)

for r in results[:5]:
    print(f"\n[{r['video_id']}]")
    print(f"  Original label: {r.get('original_label', 'N/A')}")
    print(f"  Speech score: {r['speech_score']:.4f}")
    print(f"  Music score: {r['music_score']:.4f}")
    print(f"  Confidence: {r['confidence']}")
    print(f"  Caption: {r['audio_text_description'][:150]}...")

In [None]:
# Show statistics
!uv run vggsound stats demo_output.jsonl

## 6. Score Distribution Visualization

In [None]:
import matplotlib.pyplot as plt

speech_scores = [r['speech_score'] for r in results]
music_scores = [r['music_score'] for r in results]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(speech_scores, bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Speech Score')
axes[0].set_ylabel('Count')
axes[0].set_title('Speech Score Distribution')
axes[0].axvline(x=0.1, color='r', linestyle='--', label='Threshold')
axes[0].legend()

axes[1].hist(music_scores, bins=20, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Music Score')
axes[1].set_ylabel('Count')
axes[1].set_title('Music Score Distribution')
axes[1].axvline(x=0.3, color='r', linestyle='--', label='Threshold')
axes[1].legend()

plt.tight_layout()
plt.show()

## 7. Larger Run (Optional)

If the small sample worked, try a larger run.

In [None]:
# Run on 500 samples
!uv run vggsound run vggsound_00.tar.gz vggsound.csv \
    --sample-limit 500 \
    --output full_output.jsonl \
    --device cuda \
    --batch-size 8

In [None]:
# Download results
from google.colab import files
files.download('full_output.jsonl')

## 8. Manual Validation

Listen to a few samples to verify quality.

In [None]:
import IPython.display as ipd
import random

# Pick random samples to validate
sample = random.sample(results, min(5, len(results)))

for r in sample:
    audio_path = Path('.cache/vggsound/audio') / f"{r['video_id']}.wav"
    if audio_path.exists():
        print(f"\n[{r['video_id']}]")
        print(f"Caption: {r['audio_text_description']}")
        print(f"Speech: {r['speech_score']:.3f}, Music: {r['music_score']:.3f}")
        display(ipd.Audio(str(audio_path)))