# Notebook 19: ASR with Whisper

---

## Inference Engineering Course

Welcome to Notebook 19! Here we explore **Automatic Speech Recognition (ASR)** using OpenAI's **Whisper** model -- one of the most robust and widely-used ASR systems.

### What You Will Learn

| Topic | Description |
|-------|-------------|
| **Whisper Architecture** | How audio is processed into text |
| **Model Loading** | Load and run Whisper models of different sizes |
| **Transcription** | Transcribe audio samples |
| **Real-Time Factor** | Measure RTF (key ASR performance metric) |
| **Chunked Transcription** | Handle long audio efficiently |
| **Mel Spectrograms** | Visualize audio preprocessing |
| **Model Comparison** | Compare tiny, base, and small models |

### How Whisper Works

```
Audio Waveform -> Mel Spectrogram -> Encoder (Transformer) -> Decoder -> Text
```

Whisper processes audio in **30-second chunks**, converting each chunk into a mel spectrogram (80 frequency bins x 3000 time steps), then using an encoder-decoder Transformer to generate text.

---

## Part 1: Setup & Installations

In [None]:
%%capture
!pip install openai-whisper torch torchaudio matplotlib numpy librosa soundfile

In [None]:
import torch
import whisper
import numpy as np
import matplotlib.pyplot as plt
import time
import warnings
import os
warnings.filterwarnings('ignore')

plt.style.use('default')
plt.rcParams['figure.figsize'] = (14, 5)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print(f"Whisper version: {whisper.__version__}")

## Part 2: Understanding the Whisper Architecture

### Whisper Model Sizes

| Model | Parameters | Multilingual | Relative Speed | English WER |
|-------|-----------|-------------|----------------|-------------|
| tiny | 39M | Yes (tiny.en for English only) | ~32x | ~7.7% |
| base | 74M | Yes | ~16x | ~5.7% |
| small | 244M | Yes | ~6x | ~4.2% |
| medium | 769M | Yes | ~2x | ~3.6% |
| large | 1.55B | Yes | 1x (baseline) | ~2.7% |

### Audio Processing Pipeline

1. **Resample** audio to 16kHz
2. **Extract** log-mel spectrogram (80 bins, 25ms window, 10ms hop)
3. **Pad/trim** to 30 seconds (3000 time steps)
4. **Encode** with Transformer encoder
5. **Decode** text autoregressively with Transformer decoder

In [None]:
# Load the Whisper tiny model first (fast loading)
print("Loading Whisper tiny model...")
model_tiny = whisper.load_model("tiny", device=device)
print(f"Model loaded!")
print(f"Parameters: {sum(p.numel() for p in model_tiny.parameters()) / 1e6:.1f}M")
print(f"Encoder layers: {len(model_tiny.encoder.blocks)}")
print(f"Decoder layers: {len(model_tiny.decoder.blocks)}")
print(f"Model dimension: {model_tiny.dims.n_audio_state}")
print(f"Attention heads: {model_tiny.dims.n_audio_head}")

## Part 3: Creating and Processing Audio

Let's create synthetic audio for testing and also download a real audio sample.

In [None]:
import soundfile as sf

def create_synthetic_audio(text_hint: str, duration_s: float = 5.0, 
                            sample_rate: int = 16000) -> np.ndarray:
    """
    Create synthetic audio (sine waves) for testing.
    This won't transcribe meaningfully but demonstrates the pipeline.
    """
    t = np.linspace(0, duration_s, int(sample_rate * duration_s))
    
    # Create a mixture of frequencies (simulating speech-like audio)
    audio = (
        0.3 * np.sin(2 * np.pi * 200 * t) +  # Fundamental
        0.2 * np.sin(2 * np.pi * 400 * t) +  # Harmonic
        0.1 * np.sin(2 * np.pi * 800 * t) +  # Higher harmonic
        0.05 * np.random.randn(len(t))         # Noise
    )
    
    # Add amplitude envelope (speech-like)
    envelope = np.ones_like(t)
    for i in range(5):
        pos = int(len(t) * (i + 0.5) / 5)
        width = len(t) // 10
        start = max(0, pos - width)
        end = min(len(t), pos + width)
        envelope[start:end] *= 0.3  # Create pauses
    
    audio *= envelope
    audio = audio / np.max(np.abs(audio))  # Normalize
    
    return audio.astype(np.float32)

# Create a test audio file
sample_rate = 16000
test_audio = create_synthetic_audio("test audio", duration_s=5.0)
sf.write('/tmp/test_audio.wav', test_audio, sample_rate)
print(f"Created test audio: {len(test_audio)} samples, {len(test_audio)/sample_rate:.1f}s")

# Also download a real audio sample
print("\nDownloading a real speech sample...")
try:
    # Use Whisper's built-in test audio
    !wget -q -O /tmp/jfk.flac https://upload.wikimedia.org/wikipedia/commons/transcoded/a/a7/JFK_Inaugural_Address_-_Excerpt.ogg/JFK_Inaugural_Address_-_Excerpt.ogg.mp3
    real_audio_path = '/tmp/jfk.flac'
    if os.path.exists(real_audio_path):
        print(f"Downloaded real audio sample.")
    else:
        real_audio_path = None
        print("Download failed, will use synthetic audio.")
except:
    real_audio_path = None
    print("Could not download real audio, using synthetic audio.")

## Part 4: Visualizing Mel Spectrograms

The mel spectrogram is Whisper's input representation of audio. It captures:
- **Time** (horizontal axis): when things happen
- **Frequency** (vertical axis): pitch content
- **Intensity** (color): energy at each time-frequency point

In [None]:
def plot_mel_spectrogram(audio_path_or_array, title="Mel Spectrogram", 
                          sample_rate=16000):
    """
    Load audio and plot its mel spectrogram using Whisper's preprocessing.
    """
    # Load audio
    if isinstance(audio_path_or_array, str):
        audio = whisper.load_audio(audio_path_or_array)
    else:
        audio = audio_path_or_array
    
    duration = len(audio) / sample_rate
    
    # Pad/trim to 30 seconds (Whisper's expected input)
    audio_padded = whisper.pad_or_trim(audio)
    
    # Compute mel spectrogram
    mel = whisper.log_mel_spectrogram(audio_padded).numpy()
    
    fig, axes = plt.subplots(2, 1, figsize=(16, 8))
    
    # Top: Waveform
    ax = axes[0]
    time_axis = np.arange(len(audio)) / sample_rate
    ax.plot(time_axis, audio, color='#2196F3', linewidth=0.5, alpha=0.7)
    ax.set_xlabel('Time (s)', fontsize=12)
    ax.set_ylabel('Amplitude', fontsize=12)
    ax.set_title(f'{title} - Waveform ({duration:.1f}s, {sample_rate}Hz)', 
                 fontsize=14, fontweight='bold')
    ax.set_xlim(0, min(duration, 30))
    
    # Bottom: Mel spectrogram
    ax = axes[1]
    im = ax.imshow(mel, aspect='auto', origin='lower', cmap='magma',
                   extent=[0, 30, 0, 80])
    ax.set_xlabel('Time (s)', fontsize=12)
    ax.set_ylabel('Mel Frequency Bin', fontsize=12)
    ax.set_title(f'Log-Mel Spectrogram (80 bins x 3000 frames)', 
                 fontsize=14, fontweight='bold')
    plt.colorbar(im, ax=ax, label='Log Energy')
    
    plt.tight_layout()
    plt.show()
    
    return mel

# Plot mel spectrogram of synthetic audio
mel_synthetic = plot_mel_spectrogram(test_audio, "Synthetic Audio")
print(f"Mel spectrogram shape: {mel_synthetic.shape}")
print(f"  80 frequency bins x 3000 time frames = 240,000 values")
print(f"  Each frame covers ~10ms of audio")

In [None]:
# If real audio is available, visualize it too
if real_audio_path and os.path.exists(real_audio_path):
    try:
        mel_real = plot_mel_spectrogram(real_audio_path, "Real Speech Audio")
    except Exception as e:
        print(f"Could not process real audio: {e}")
else:
    print("Real audio not available. Using synthetic audio for demonstrations.")

## Part 5: Transcribing Audio

Let's transcribe audio and measure performance.

In [None]:
def transcribe_and_measure(model, audio_path: str, model_name: str = "unknown") -> dict:
    """
    Transcribe audio and measure performance metrics.
    
    Returns:
        dict with transcription text, timing, and RTF
    """
    # Load audio to get duration
    audio = whisper.load_audio(audio_path)
    audio_duration = len(audio) / 16000  # 16kHz sample rate
    
    # Transcribe with timing
    start = time.time()
    result = model.transcribe(audio_path, fp16=(device == 'cuda'))
    elapsed = time.time() - start
    
    # Real-Time Factor: ratio of processing time to audio duration
    # RTF < 1 means faster than real-time
    rtf = elapsed / audio_duration if audio_duration > 0 else float('inf')
    
    return {
        'model': model_name,
        'text': result['text'].strip(),
        'language': result.get('language', 'unknown'),
        'audio_duration_s': round(audio_duration, 2),
        'process_time_s': round(elapsed, 3),
        'rtf': round(rtf, 4),
        'faster_than_realtime': rtf < 1.0,
        'segments': result.get('segments', []),
    }

# Transcribe with tiny model
print("Transcribing synthetic audio (tiny model)...")
result = transcribe_and_measure(model_tiny, '/tmp/test_audio.wav', 'tiny')
print(f"\nResult:")
print(f"  Text: '{result['text']}'")
print(f"  Language: {result['language']}")
print(f"  Audio duration: {result['audio_duration_s']}s")
print(f"  Processing time: {result['process_time_s']}s")
print(f"  Real-Time Factor: {result['rtf']}")
print(f"  Faster than real-time: {result['faster_than_realtime']}")

# Also transcribe real audio if available
if real_audio_path and os.path.exists(real_audio_path):
    print("\n" + "=" * 60)
    print("Transcribing real speech audio...")
    try:
        result_real = transcribe_and_measure(model_tiny, real_audio_path, 'tiny')
        print(f"\nResult:")
        print(f"  Text: '{result_real['text'][:200]}'")
        print(f"  Audio duration: {result_real['audio_duration_s']}s")
        print(f"  Processing time: {result_real['process_time_s']}s")
        print(f"  RTF: {result_real['rtf']}")
    except Exception as e:
        print(f"  Error: {e}")

## Part 6: Measuring Real-Time Factor (RTF)

**Real-Time Factor (RTF)** is the key performance metric for ASR:

$$\text{RTF} = \frac{\text{Processing Time}}{\text{Audio Duration}}$$

- **RTF < 1**: Faster than real-time (good for batch processing)
- **RTF = 1**: Exactly real-time (minimum for live transcription)
- **RTF > 1**: Slower than real-time (not suitable for live use)

In [None]:
# Create audio samples of different durations for RTF measurement
durations = [2, 5, 10, 15, 20, 30]
rtf_results = []

print("Measuring RTF for different audio durations...")
for dur in durations:
    # Create audio
    audio = create_synthetic_audio("test", duration_s=dur)
    audio_path = f'/tmp/test_audio_{dur}s.wav'
    sf.write(audio_path, audio, 16000)
    
    # Measure RTF (run twice, use second for warmup)
    _ = model_tiny.transcribe(audio_path, fp16=(device == 'cuda'))
    result = transcribe_and_measure(model_tiny, audio_path, 'tiny')
    rtf_results.append(result)
    print(f"  Duration {dur:>3d}s: RTF = {result['rtf']:.4f} | Process time = {result['process_time_s']:.3f}s")

# Visualize RTF
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Processing time vs audio duration
ax = axes[0]
audio_durs = [r['audio_duration_s'] for r in rtf_results]
proc_times = [r['process_time_s'] for r in rtf_results]

ax.plot(audio_durs, proc_times, 'o-', color='#2196F3', linewidth=2.5, markersize=10, label='Actual')
ax.plot([0, max(audio_durs)], [0, max(audio_durs)], '--', color='red', 
        linewidth=2, label='Real-time line (RTF=1)')
ax.fill_between([0, max(audio_durs)], [0, 0], [0, max(audio_durs)], 
                alpha=0.1, color='green', label='Faster than real-time')

ax.set_xlabel('Audio Duration (s)', fontsize=12)
ax.set_ylabel('Processing Time (s)', fontsize=12)
ax.set_title('Processing Time vs Audio Duration (Whisper-tiny)',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

# Right: RTF by duration
ax = axes[1]
rtfs = [r['rtf'] for r in rtf_results]
colors = ['#4CAF50' if rtf < 1 else '#F44336' for rtf in rtfs]
bars = ax.bar(range(len(durations)), rtfs, color=colors, alpha=0.8, edgecolor='black')
ax.axhline(y=1.0, color='red', linestyle='--', linewidth=2, label='Real-time threshold')

ax.set_xticks(range(len(durations)))
ax.set_xticklabels([f'{d}s' for d in durations])
ax.set_xlabel('Audio Duration', fontsize=12)
ax.set_ylabel('Real-Time Factor', fontsize=12)
ax.set_title('RTF by Audio Duration', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

for bar, rtf in zip(bars, rtfs):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
           f'{rtf:.3f}', ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

## Part 7: Chunked Transcription for Long Audio

Whisper processes audio in **30-second chunks**. For long audio, we need to chunk the input and stitch the results together. Let's implement and visualize this process.

In [None]:
def chunked_transcription(model, audio_path: str, chunk_length_s: int = 30) -> dict:
    """
    Transcribe long audio using chunking.
    
    Demonstrates how Whisper handles audio longer than 30 seconds
    by splitting into chunks and processing sequentially.
    """
    # Load full audio
    audio = whisper.load_audio(audio_path)
    total_duration = len(audio) / 16000
    samples_per_chunk = chunk_length_s * 16000
    
    # Split into chunks
    chunks = []
    for start in range(0, len(audio), samples_per_chunk):
        chunk = audio[start:start + samples_per_chunk]
        chunks.append(chunk)
    
    # Process each chunk
    chunk_results = []
    total_start = time.time()
    
    for i, chunk in enumerate(chunks):
        chunk_duration = len(chunk) / 16000
        chunk_start = time.time()
        
        # Pad to 30s if needed
        padded = whisper.pad_or_trim(chunk)
        mel = whisper.log_mel_spectrogram(padded).to(device)
        
        # Detect language on first chunk
        if i == 0:
            _, probs = model.detect_language(mel)
            language = max(probs, key=probs.get)
        
        # Decode
        options = whisper.DecodingOptions(fp16=(device == 'cuda'))
        result = whisper.decode(model, mel, options)
        
        chunk_time = time.time() - chunk_start
        
        chunk_results.append({
            'chunk_idx': i,
            'start_time': i * chunk_length_s,
            'duration': round(chunk_duration, 2),
            'text': result.text,
            'process_time': round(chunk_time, 3),
            'rtf': round(chunk_time / chunk_duration, 4) if chunk_duration > 0 else 0,
        })
    
    total_time = time.time() - total_start
    full_text = ' '.join([cr['text'] for cr in chunk_results])
    
    return {
        'total_duration': round(total_duration, 2),
        'total_process_time': round(total_time, 3),
        'overall_rtf': round(total_time / total_duration, 4),
        'num_chunks': len(chunks),
        'full_text': full_text,
        'chunk_results': chunk_results,
        'language': language if 'language' in dir() else 'unknown',
    }

# Create a longer audio sample (60 seconds)
long_audio = create_synthetic_audio("long test", duration_s=60.0)
sf.write('/tmp/long_audio.wav', long_audio, 16000)

print("Processing 60-second audio with chunking...")
chunked_result = chunked_transcription(model_tiny, '/tmp/long_audio.wav')

print(f"\nResults:")
print(f"  Total audio: {chunked_result['total_duration']}s")
print(f"  Chunks: {chunked_result['num_chunks']}")
print(f"  Total processing: {chunked_result['total_process_time']}s")
print(f"  Overall RTF: {chunked_result['overall_rtf']}")
print(f"\nPer-chunk breakdown:")
for cr in chunked_result['chunk_results']:
    print(f"  Chunk {cr['chunk_idx']}: {cr['duration']}s audio -> {cr['process_time']}s processing (RTF={cr['rtf']})")

In [None]:
# Visualize chunked processing
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Timeline view of chunk processing
ax = axes[0]
chunks = chunked_result['chunk_results']
cum_audio = 0
cum_process = 0

for cr in chunks:
    # Audio chunk (blue)
    ax.barh(0, cr['duration'], left=cum_audio, height=0.3,
           color='#2196F3', alpha=0.7, edgecolor='black')
    # Processing time (red/green)
    color = '#4CAF50' if cr['rtf'] < 1 else '#F44336'
    ax.barh(1, cr['process_time'], left=cum_process, height=0.3,
           color=color, alpha=0.7, edgecolor='black')
    
    cum_audio += cr['duration']
    cum_process += cr['process_time']

ax.set_yticks([0, 1])
ax.set_yticklabels(['Audio Duration', 'Processing Time'], fontsize=11)
ax.set_xlabel('Time (seconds)', fontsize=12)
ax.set_title('Chunked Processing Timeline', fontsize=14, fontweight='bold')

# Right: RTF per chunk
ax = axes[1]
chunk_rtfs = [cr['rtf'] for cr in chunks]
chunk_labels = [f'Chunk {cr["chunk_idx"]}' for cr in chunks]
colors = ['#4CAF50' if rtf < 1 else '#F44336' for rtf in chunk_rtfs]

ax.bar(chunk_labels, chunk_rtfs, color=colors, alpha=0.8, edgecolor='black')
ax.axhline(y=1.0, color='red', linestyle='--', linewidth=2, label='Real-time')
ax.set_ylabel('RTF', fontsize=12)
ax.set_title('RTF per Chunk', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

plt.tight_layout()
plt.show()

## Part 8: Comparing Model Sizes

Let's compare Whisper tiny, base, and small models on the same audio.

In [None]:
# Load multiple model sizes
models_to_compare = ['tiny', 'base', 'small']
loaded_models = {}
model_load_times = {}

for model_name in models_to_compare:
    print(f"Loading Whisper {model_name}...")
    start = time.time()
    loaded_models[model_name] = whisper.load_model(model_name, device=device)
    load_time = time.time() - start
    model_load_times[model_name] = load_time
    params = sum(p.numel() for p in loaded_models[model_name].parameters())
    print(f"  Loaded in {load_time:.1f}s | Parameters: {params/1e6:.1f}M")

print("\nAll models loaded!")

In [None]:
# Benchmark all models on the same audio
audio_path = '/tmp/test_audio.wav'  # 5 second audio
comparison_results = []

# Create a 10-second test audio for fair comparison
test_audio_10s = create_synthetic_audio("benchmark", duration_s=10.0)
sf.write('/tmp/benchmark_audio.wav', test_audio_10s, 16000)

print("Benchmarking models (10-second audio)...\n")
print(f"{'Model':<10} {'Params':>8} {'Time (s)':>10} {'RTF':>8} {'Text (preview)'}")
print("=" * 80)

for model_name in models_to_compare:
    # Warmup run
    _ = loaded_models[model_name].transcribe('/tmp/benchmark_audio.wav', 
                                              fp16=(device == 'cuda'))
    
    # Timed run
    result = transcribe_and_measure(
        loaded_models[model_name], 
        '/tmp/benchmark_audio.wav', 
        model_name
    )
    
    params = sum(p.numel() for p in loaded_models[model_name].parameters())
    result['params_m'] = params / 1e6
    comparison_results.append(result)
    
    text_preview = result['text'][:40] + '...' if len(result['text']) > 40 else result['text']
    print(f"{model_name:<10} {result['params_m']:>7.1f}M {result['process_time_s']:>10.3f} {result['rtf']:>8.4f} {text_preview}")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

model_names = [r['model'] for r in comparison_results]
colors = ['#4CAF50', '#FF9800', '#F44336']

# Plot 1: Parameters
ax = axes[0]
params = [r['params_m'] for r in comparison_results]
bars = ax.bar(model_names, params, color=colors, alpha=0.8, edgecolor='black')
for bar, p in zip(bars, params):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2,
           f'{p:.0f}M', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Parameters (Millions)', fontsize=12)
ax.set_title('Model Size', fontsize=14, fontweight='bold')

# Plot 2: Processing Speed (RTF)
ax = axes[1]
rtfs = [r['rtf'] for r in comparison_results]
bars = ax.bar(model_names, rtfs, color=colors, alpha=0.8, edgecolor='black')
ax.axhline(y=1.0, color='red', linestyle='--', linewidth=2, label='Real-time')
for bar, rtf in zip(bars, rtfs):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
           f'{rtf:.3f}', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Real-Time Factor', fontsize=12)
ax.set_title('Inference Speed (lower = faster)', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

# Plot 3: Speed vs Size tradeoff
ax = axes[2]
for i, r in enumerate(comparison_results):
    ax.scatter(r['params_m'], r['rtf'], s=200, c=colors[i], edgecolors='black',
              linewidths=2, zorder=5, label=r['model'])
ax.plot([r['params_m'] for r in comparison_results],
        [r['rtf'] for r in comparison_results],
        '--', color='gray', alpha=0.5)
ax.axhline(y=1.0, color='red', linestyle=':', alpha=0.5)
ax.set_xlabel('Parameters (Millions)', fontsize=12)
ax.set_ylabel('RTF', fontsize=12)
ax.set_title('Speed vs Size Tradeoff', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

plt.tight_layout()
plt.show()

print("\nKey insight: Larger models are more accurate but slower.")
print("Choose the smallest model that meets your accuracy requirements.")

## Part 9: Whisper Inference Optimizations

Let's explore what optimizations make Whisper faster in production.

In [None]:
# Benchmark: FP32 vs FP16 (if GPU available)
if device == 'cuda':
    print("Comparing FP32 vs FP16 inference...\n")
    
    audio_path = '/tmp/benchmark_audio.wav'
    model = loaded_models['base']
    
    # FP16
    times_fp16 = []
    for _ in range(3):
        start = time.time()
        _ = model.transcribe(audio_path, fp16=True)
        times_fp16.append(time.time() - start)
    avg_fp16 = np.mean(times_fp16)
    
    # FP32
    times_fp32 = []
    for _ in range(3):
        start = time.time()
        _ = model.transcribe(audio_path, fp16=False)
        times_fp32.append(time.time() - start)
    avg_fp32 = np.mean(times_fp32)
    
    speedup = avg_fp32 / avg_fp16
    
    fig, ax = plt.subplots(figsize=(8, 5))
    bars = ax.bar(['FP32', 'FP16'], [avg_fp32, avg_fp16], 
                  color=['#F44336', '#4CAF50'], alpha=0.8, edgecolor='black')
    for bar, val in zip(bars, [avg_fp32, avg_fp16]):
        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
               f'{val:.3f}s', ha='center', fontsize=12, fontweight='bold')
    
    ax.set_ylabel('Processing Time (seconds)', fontsize=12)
    ax.set_title(f'FP32 vs FP16 Inference (Whisper-base)\nSpeedup: {speedup:.2f}x',
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f"FP32: {avg_fp32:.3f}s")
    print(f"FP16: {avg_fp16:.3f}s")
    print(f"Speedup: {speedup:.2f}x")
else:
    print("GPU not available for FP16 comparison.")
    print("On GPU, FP16 typically provides 1.5-2x speedup over FP32.")

## Part 10: Key Takeaways

### Summary

1. **Whisper processes 30-second chunks** of audio at a time, converting mel spectrograms to text via encoder-decoder Transformers.

2. **Real-Time Factor (RTF)** is the critical metric: RTF < 1 means the system can keep up with live audio.

3. **Model size tradeoff**: Smaller models (tiny, base) are fast but less accurate. Larger models (medium, large) are more accurate but need more compute.

4. **FP16 inference** provides significant speedup on GPU with minimal quality loss.

5. **Long audio** requires chunking. The chunking strategy (overlap, stride) affects boundary quality.

### Production Considerations

| Consideration | Recommendation |
|--------------|----------------|
| Live transcription | Use tiny/base with streaming |
| Batch processing | Use small/medium for accuracy |
| Multilingual | Use multilingual models |
| Optimization | FP16, batching, TensorRT |
| Long audio | Chunked processing with overlap |

### Further Optimizations

- **Faster Whisper** (CTranslate2): 4x faster than original
- **Whisper.cpp**: CPU-optimized C++ implementation
- **Distil-Whisper**: Knowledge-distilled smaller models
- **TensorRT**: NVIDIA's inference optimizer

---

## Exercises

### Exercise 1: Language Detection Accuracy
Test Whisper's language detection on audio in different languages.

In [None]:
# Exercise 1: Test language detection
# Whisper can detect the language of audio automatically
# Try using: model.detect_language(mel_spectrogram)

# TODO: Load audio in different languages and test detection

print("Exercise 1: Test language detection accuracy!")

### Exercise 2: Chunked Processing with Overlap
Implement chunked transcription with overlapping windows to improve boundary quality.

In [None]:
# Exercise 2: Implement overlapping chunks
# Instead of non-overlapping 30s chunks, use:
# - 30s chunks with 5s overlap
# - Merge text from overlapping regions

def overlapping_transcription(model, audio_path, chunk_s=30, overlap_s=5):
    """
    TODO: Implement chunked transcription with overlap.
    """
    pass

print("Exercise 2: Implement overlapping chunk transcription!")

### Exercise 3: Batch Processing Throughput
Measure throughput when processing multiple audio files in sequence vs parallel.

In [None]:
# Exercise 3: Measure batch processing throughput
# Create 10 audio files of varying lengths (5-30 seconds)
# Process them sequentially and measure total throughput
# Calculate: total audio minutes processed per minute of compute

# TODO: Implement batch throughput measurement

print("Exercise 3: Measure batch processing throughput!")

---

**End of Notebook 19: ASR with Whisper**

Next: [Notebook 20 - Text-to-Speech Inference](./20_tts_inference.ipynb)