# SAM Audio Demo - Stadium Mode

This notebook demonstrates how to use Meta's SAM Audio model to remove commentary from sports broadcasts, keeping only the stadium atmosphere.

**Prerequisites:**
- Run this in Google Colab with GPU enabled (Runtime ‚Üí Change runtime type ‚Üí GPU)
- You need approved access to `facebook/sam-audio-large` on Hugging Face
- Have your Hugging Face token ready

**What this demo does:**
1. Downloads a sports video from YouTube
2. Extracts the audio
3. Uses SAM Audio to isolate the commentator voice
4. Subtracts commentator to leave stadium atmosphere
5. Re-muxes video with processed audio


## Step 1: Check GPU and Install Dependencies


In [None]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí GPU")


In [None]:
# Install SAM Audio and dependencies (this takes 2-3 minutes)
!pip install -q git+https://github.com/facebookresearch/sam-audio.git
!pip install -q yt-dlp soundfile librosa
print("‚úì Dependencies installed!")


## Step 2: Login to Hugging Face

You need to authenticate to access the gated SAM Audio model.


In [None]:
# Login to Hugging Face
from huggingface_hub import login

# Option 1: Use Colab secrets (recommended - add HF_TOKEN in the Secrets panel on the left)
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("‚úì Logged in using Colab secrets")
except:
    # Option 2: Manual login - enter your token when prompted
    print("Enter your Hugging Face token below:")
    login()


## Step 3: Load SAM Audio Model

This downloads and loads the model (takes 3-5 minutes on first run).


In [None]:
import torch
from sam_audio import SAMAudio

# Load the model
print("Loading SAM Audio model (this may take a few minutes)...")
model = SAMAudio.from_pretrained("facebook/sam-audio-large")
model = model.cuda().eval()
print("‚úì SAM Audio model loaded successfully!")


## Step 4: Download a Sports Video

Enter the URL of a sports video with commentary. Good examples:
- NBA/NFL/Soccer highlights with commentary
- Full game broadcasts (process shorter clips for faster results)
- Videos with clear crowd noise in background


In [None]:
import yt_dlp
import os

# Create output directories
os.makedirs('data/input', exist_ok=True)
os.makedirs('data/output', exist_ok=True)

# ========== ENTER YOUR VIDEO URL HERE ==========
VIDEO_URL = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"  # @param {type:"string"}
# ===============================================

# Download the video
ydl_opts = {
    'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best',
    'outtmpl': 'data/input/%(title)s.%(ext)s',
    'quiet': False,
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(VIDEO_URL, download=True)
    video_path = ydl.prepare_filename(info)
    print(f"‚úì Downloaded: {video_path}")


## Step 5: Extract and Load Audio


In [None]:
import subprocess
from pathlib import Path
import librosa
import numpy as np

# Extract audio from video
video_name = Path(video_path).stem
audio_path = f"data/input/{video_name}.wav"

subprocess.run([
    'ffmpeg', '-i', video_path,
    '-vn', '-acodec', 'pcm_s16le',
    '-ar', '48000', '-ac', '2',
    '-y', audio_path
], check=True, capture_output=True)

print(f"‚úì Audio extracted to: {audio_path}")

# Load audio
print("Loading audio...")
audio, sr = librosa.load(audio_path, sr=48000, mono=False)

# If mono, convert to stereo
if len(audio.shape) == 1:
    audio = np.stack([audio, audio])

print(f"Audio shape: {audio.shape}")
print(f"Sample rate: {sr} Hz")
print(f"Duration: {audio.shape[1] / sr:.2f} seconds")


## Step 6: Process Audio with SAM Audio üèüÔ∏è

The key insight: We **isolate the commentator** and then **subtract** it from the original audio. This leaves us with the stadium atmosphere (crowd + game sounds).


In [None]:
# Process with SAM Audio
print("Processing audio with SAM Audio...")
print("This may take a few minutes depending on audio length...")

# Convert to tensor [batch, channels, samples]
audio_tensor = torch.from_numpy(audio).float().unsqueeze(0).cuda()

# The prompt identifies what we want to ISOLATE (and then subtract)
# Try different prompts for different types of content:
PROMPT = "sports commentator announcing the game"  # @param {type:"string"}

with torch.no_grad():
    # Get the isolated commentator audio
    result = model.separate(
        audio_tensor,
        sample_rate=sr,
        text_prompt=PROMPT
    )

print("‚úì SAM Audio processing complete!")


In [None]:
# Extract the separated audio and compute stadium atmosphere
import soundfile as sf

# Handle different output formats from the model
if isinstance(result, dict):
    print("Result keys:", result.keys())
    # Get the isolated audio (commentator)
    if 'separated' in result:
        isolated_audio = result['separated'].squeeze().cpu().numpy()
    elif 'audio' in result:
        isolated_audio = result['audio'].squeeze().cpu().numpy()
    else:
        # Use the first tensor in the dict
        isolated_audio = list(result.values())[0].squeeze().cpu().numpy()
    
    # Check for residual (what's left after removal)
    if 'residual' in result:
        stadium_audio = result['residual'].squeeze().cpu().numpy()
    else:
        # Compute by subtraction: original - isolated = stadium atmosphere
        stadium_audio = audio - isolated_audio
else:
    # Result is the separated audio directly
    isolated_audio = result.squeeze().cpu().numpy()
    stadium_audio = audio - isolated_audio

# Normalize to prevent clipping
stadium_audio = stadium_audio / np.max(np.abs(stadium_audio)) * 0.95
isolated_audio = isolated_audio / np.max(np.abs(isolated_audio)) * 0.95

print(f"Stadium audio shape: {stadium_audio.shape}")
print(f"Isolated audio shape: {isolated_audio.shape}")


## Step 7: Save Results and Create Output Video


In [None]:
# Save audio files
stadium_audio_path = f"data/output/{video_name}_stadium_atmosphere.wav"
isolated_audio_path = f"data/output/{video_name}_commentator_only.wav"

# Handle shape: (channels, samples) -> (samples, channels) for soundfile
if stadium_audio.shape[0] == 2:
    sf.write(stadium_audio_path, stadium_audio.T, sr)
    sf.write(isolated_audio_path, isolated_audio.T, sr)
else:
    sf.write(stadium_audio_path, stadium_audio, sr)
    sf.write(isolated_audio_path, isolated_audio, sr)

print(f"‚úì Stadium atmosphere saved: {stadium_audio_path}")
print(f"‚úì Commentator audio saved: {isolated_audio_path}")


In [None]:
# Create final video with stadium atmosphere audio
output_video_path = f"data/output/{video_name}_STADIUM_MODE.mp4"

subprocess.run([
    'ffmpeg',
    '-i', video_path,
    '-i', stadium_audio_path,
    '-c:v', 'copy',
    '-c:a', 'aac',
    '-map', '0:v:0',
    '-map', '1:a:0',
    '-shortest',
    '-y', output_video_path
], check=True, capture_output=True)

print(f"‚úì Stadium Mode video created: {output_video_path}")


## Step 8: Listen and Compare üéß


In [None]:
from IPython.display import Audio, display

print("üéß ORIGINAL AUDIO:")
display(Audio(audio_path))

print("\nüèüÔ∏è STADIUM ATMOSPHERE (Commentary Removed):")
display(Audio(stadium_audio_path))

print("\nüéôÔ∏è COMMENTATOR ONLY (What was removed):")
display(Audio(isolated_audio_path))


## Step 9: Download Your Results üì•


In [None]:
from google.colab import files

print("üì• Downloading Stadium Mode video...")
files.download(output_video_path)

# Uncomment to also download audio files:
# files.download(stadium_audio_path)
# files.download(isolated_audio_path)


---

## üéâ Demo Complete!

You now have:
1. **Stadium Mode Video** - Original video with commentary removed
2. **Stadium Atmosphere Audio** - Just the crowd and game sounds  
3. **Commentator Audio** - The isolated commentary track

### Tips for Best Results:

**Good prompts to try:**
- `"sports commentator announcing the game"`
- `"play-by-play announcer voice"`
- `"male voice commentary"` or `"female voice commentary"`
- `"English sports broadcast commentary"`

**Videos that work well:**
- Sports highlights with enthusiastic crowds
- Games with clear commentary vs. crowd separation
- Shorter clips (1-5 min) process faster

**Videos that are challenging:**
- Quiet indoor sports (tennis, golf)
- Videos with music mixed in
- Very long videos (process in chunks)
