# Notebook 07: Audio - Automatic Speech Recognition (ASR)

**Learning Objectives:**
- Convert speech to text using ASR models
- Use Whisper model for transcription
- Handle different audio formats
- Process audio files and microphone input

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **small (CPU-friendly)** | openai/whisper-tiny | 72MB | 4GB | 4GB RAM, CPU | Fast, 99+ languages, ~90% accuracy |
| **large (GPU-optimized)** | openai/whisper-small | 483MB | 6GB | 8GB VRAM (RTX 4080) | Better accuracy, production-grade |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`, `soundfile`, `librosa`

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline, set_seed
import soundfile as sf
import requests
from io import BytesIO
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## Expected Behaviors

### First Time Running
- **Model Download**: ~72MB for Whisper small model (~1-2 minutes)
- Smallest model in this tutorial!
- Very fast even on CPU

### Setup Cell Output
```
PyTorch version: 2.x.x
CUDA available: True/False
```

### Model Loading
```
Loading openai/whisper-tiny...
```
- **CPU**: 2-4 seconds
- **GPU**: 1-2 seconds

### Transcription Output Format
```python
{'text': 'I have a dream that one day this nation will rise up...'}
```

### With Timestamps
```python
{
  'text': 'Full transcription...',
  'chunks': [
    {'timestamp': [0.0, 2.5], 'text': 'I have a dream'},
    {'timestamp': [2.5, 5.0], 'text': 'that one day'}
  ]
}
```

### Transcription Quality
- **small model** (clear speech, quiet background): 85-90% accuracy
- **large model** (clear speech, quiet background): 92-95% accuracy
- **Noisy environment**: 75-85% accuracy (small), 85-90% (large)
- **Multiple speakers**: May struggle, best with single speaker
- **Accents**: Handles various accents reasonably well

### Language Support
- **Whisper supports 99+ languages**!
- Automatically detects language (no need to specify)
- Quality varies by language (best for English)

### Performance
- **30-second audio clip**:
  - small model (CPU): 5-10 seconds (~32x realtime)
  - small model (GPU): 1-2 seconds
  - large model (CPU): 20-30 seconds
  - large model (GPU): 3-5 seconds
- **5-minute audio**:
  - small model (CPU): 1-2 minutes
  - small model (GPU): 10-20 seconds
  - large model (GPU): 30-60 seconds

### Audio Format Support
- WAV, MP3, FLAC, OGG, M4A
- Automatically resamples to 16kHz
- Mono or stereo (converted to mono)

### Common Observations
- **Punctuation**: Added automatically (mostly accurate)
- **Capitalization**: Generally correct
- **Filler words** ("um", "uh"): Often transcribed
- **Background music**: May confuse transcription
- **Long silences**: Handled gracefully

### Model Comparison
- **small model**: Fast, good for real-time, ~85-90% accurate, CPU-friendly
- **large model**: 5x slower, ~92-95% accurate, production-grade

In [None]:
# CHOOSE YOUR MODEL:

# Option 1: small model (CPU-friendly, fast, 99+ languages)
MODEL_NAME = "openai/whisper-tiny"  # 72MB, ~32x realtime on CPU

# Option 2: large model (GPU-optimized, production-grade)
# MODEL_NAME = "openai/whisper-small"  # 483MB, better accuracy, multilingual

print(f"Selected model: {MODEL_NAME}")

In [None]:
# Create ASR pipeline
print(f"Loading {MODEL_NAME}...")
asr = pipeline(
    "automatic-speech-recognition",
    model=MODEL_NAME,
    device=0 if torch.cuda.is_available() else -1
)

In [None]:
# Example: Transcribe a sample audio file
# Using HuggingFace's audio sample
sample_audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"

# Download and transcribe
result = asr(sample_audio_url)

print("=== TRANSCRIPTION ===")
print(result["text"])

In [None]:
# Transcribe with timestamps
result_with_timestamps = asr(
    sample_audio_url,
    return_timestamps=True
)

print("\n=== TRANSCRIPTION WITH TIMESTAMPS ===")
print(f"Full text: {result_with_timestamps['text']}")
print("\nChunks:")
for chunk in result_with_timestamps.get('chunks', []):
    print(f"  [{chunk['timestamp'][0]:.2f}s - {chunk['timestamp'][1]:.2f}s]: {chunk['text']}")

In [None]:
# Process local audio files (if any)
import os

sample_data_path = "../sample_data"

if os.path.exists(sample_data_path):
    audio_files = [f for f in os.listdir(sample_data_path) 
                   if f.lower().endswith(('.wav', '.mp3', '.flac', '.ogg'))]
    
    if audio_files:
        print("=== TRANSCRIBING LOCAL AUDIO FILES ===")
        for audio_file in audio_files[:3]:  # Limit to 3
            audio_path = os.path.join(sample_data_path, audio_file)
            result = asr(audio_path)
            print(f"\n{audio_file}:")
            print(f"  {result['text']}")
    else:
        print("No audio files found. Add .wav, .mp3, or .flac files to sample_data/ to test!")
else:
    print("sample_data/ not found. Add audio files there to test transcription.")

In [None]:
# Using LibriSpeech ASR dummy dataset (small test dataset for speech recognition)
from datasets import load_dataset

print("Loading LibriSpeech ASR dummy dataset...")
# Load the test dataset (very small, just for testing)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

print(f"Loaded {len(dataset)} audio samples\n")

# Transcribe a couple of examples
print("=== LibriSpeech Dataset Transcription ===")
for i in range(min(2, len(dataset))):
    sample = dataset[i]
    
    # Get audio array and sampling rate
    audio_array = sample['audio']['array']
    sampling_rate = sample['audio']['sampling_rate']
    true_text = sample['text']
    
    # Transcribe
    result = asr({"array": audio_array, "sampling_rate": sampling_rate})
    predicted_text = result['text']
    
    print(f"\nSample {i+1}:")
    print(f"  True text:      {true_text}")
    print(f"  Predicted text: {predicted_text}")

## Exercises

1. **Custom Audio**: Record your own audio and transcribe it
2. **Language Support**: Test with non-English audio (Whisper supports 99+ languages)
3. **Model Comparison**: Compare small vs large model accuracy
4. **Long Audio**: Test with longer audio files (5+ minutes)
5. **Noisy Audio**: How does the model handle background noise?

In [None]:
# Your code here for exercises


## Key Takeaways

✅ **Whisper** is a powerful multilingual ASR model

✅ Supports **99+ languages** out of the box

✅ Can provide **timestamps** for each segment

✅ Works with various **audio formats** (WAV, MP3, FLAC)

✅ small model is surprisingly good for most use cases

## Next Steps

- Try **Notebook 08**: Text-to-Speech for audio generation
- Explore [audio models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition)
- Learn about fine-tuning for domain-specific vocabulary

## Resources

- [Whisper Paper](https://arxiv.org/abs/2212.04356)
- [ASR Task Guide](https://huggingface.co/docs/transformers/tasks/asr)
- [Whisper Model Card](https://huggingface.co/openai/whisper-tiny)