Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
A unified model for audio, visual, and audio-visual speech recognition.
Training paradigm: USR 2.0 uses self-supervised pre-training followed by semi-supervised fine-tuning. We provide both the self-supervised checkpoints (for extracting representations for your own downstream tasks) and the fine-tuned checkpoints (for speech recognition). See Extract Encoder Features for details on using either type.
Installation • Demo • Features • Models • Evaluation • Citation
FFmpeg is required for video/audio processing. Check if it is installed:
ffmpeg -versionIf not, install it:
# Ubuntu / Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpegInstall PyTorch, torchaudio, and torchvision for your system from pytorch.org. For example:
pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txtThis installs all remaining packages, including MediaPipe for face landmark detection (used for mouth cropping).
Optional: Higher-accuracy face detection with RetinaFace+FAN
MediaPipe is the default (CPU-based). For higher-accuracy landmark detection, install the ibug packages. This requires a CUDA GPU.
# face_detection uses git-lfs for model weights
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/hhj1897/face_detection.git
cd face_detection
wget -O ibug/face_detection/retina_face/weights/Resnet50_Final.pth \
https://huggingface.co/public-data/ibug-face-detection/resolve/main/retina_face/Resnet50_Final.pth
pip install -e .
cd ..
# face_alignment must be installed from a local clone (editable mode)
git clone https://github.com/hhj1897/face_alignment.git
pip install -e face_alignmentThen pass detector=retinaface to demo.py or extract_features.py.
Run demo.py to transcribe a video. Face detection and mouth cropping are handled automatically.
Download a pretrained model before running. The Huge (high-resource) checkpoint is recommended for best accuracy. For a lighter alternative, use Base+ (high-resource). See Pretrained Models for all options.
python demo.py \
video=path/to/video.mp4 \
model.pretrained_model_path=path/to/checkpoint.pthOutput:
============================================================
Modality : av
Video : path/to/video.mp4
Result : THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
============================================================
# Audio-visual (default)
python demo.py video=video.mp4 model.pretrained_model_path=model.pth
# Lip reading only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=v
# Audio only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=aBy default, MediaPipe is used for face landmark detection. For higher accuracy, use RetinaFace+FAN (requires ibug packages and a CUDA GPU):
python demo.py video=video.mp4 model.pretrained_model_path=model.pth detector=retinafacepython demo.py video=video.mp4 model.pretrained_model_path=model.pth \
model/backbone=resnet_transformer_largeAny Hydra override works. For example, to change beam size: decode.beam_size=10.
Extract learned audio-visual representations for your own downstream tasks (e.g., emotion recognition, speaker verification, audio-visual synchronization).
While demo.py uses the full model (encoder + decoder) for transcription, extract_features.py outputs only the encoder representations.
You can extract features from two types of checkpoints:
- Fine-tuned models (self-supervised + semi-supervised): Best for tasks that benefit from supervised speech knowledge
- Self-supervised models: Pure self-supervised representations, useful if you want to fine-tune on a task that requires more general knowledge.
See Pretrained Models for download links.
# Using a fine-tuned checkpoint
python extract_features.py \
video=path/to/video.mp4 \
model.pretrained_model_path=path/to/finetuned_checkpoint.pth \
output=features.pt
# Using a self-supervised checkpoint
python extract_features.py \
video=path/to/video.mp4 \
model.pretrained_model_path=path/to/selfsup_checkpoint.pth \
output=features.pt
# Single modality
python extract_features.py \
video=path/to/video.mp4 \
model.pretrained_model_path=path/to/checkpoint.pth \
modality=v output=video_features.pt
# Use RetinaFace+FAN for face detection (optional, requires ibug packages)
python extract_features.py \
video=path/to/video.mp4 \
model.pretrained_model_path=path/to/checkpoint.pth \
detector=retinaface output=features.ptLoad in Python:
import torch
features = torch.load("features.pt")
features["audio_visual"] # numpy array, shape (T, D) — fused audio-visual encoder output
features["video"] # numpy array, shape (T, D) — visual encoder output
features["audio"] # numpy array, shape (T, D) — audio encoder outputThese are self-supervised pre-trained checkpoints from USR. Use these if you want to extract representations without any supervised fine-tuning, e.g., to fine-tune for your own downstream task.
| Model | Data | Download |
|---|---|---|
| Base | LRS3 | checkpoint |
| Base+ | LRS3+Vox2 | checkpoint |
| Large | LRS3+Vox2 | checkpoint |
| Huge | LRS2+LRS3+Vox2+AVS | checkpoint |
These checkpoints have been fine-tuned with semi-supervised learning for speech recognition. Use these for transcription (demo.py) or to extract features that include supervised speech knowledge.
| Model | Data | VSR (%) | ASR (%) | AVSR (%) | Download |
|---|---|---|---|---|---|
| Base | LRS3 | 36.2 | 3.0 | 2.9 | checkpoint |
| Base+ | LRS3+Vox2 | 26.4 | 2.5 | 2.4 | checkpoint |
| Large | LRS3+Vox2 | 23.7 | 2.3 | 2.2 | checkpoint |
| Model | Data | VSR (%) | ASR (%) | AVSR (%) | Download |
|---|---|---|---|---|---|
| Base+ | LRS3+Vox2 | 24.8 | 1.4 | 1.2 | checkpoint |
| Large | LRS3+Vox2 | 21.5 | 1.3 | 1.0 | checkpoint |
| Huge | LRS2+LRS3+Vox2+AVS | 17.6 | 0.9 | 0.8 | checkpoint |
Backbone configs: resnet_transformer_base / resnet_transformer_baseplus / resnet_transformer_large / resnet_transformer_huge.
Evaluate on a test set with WER computation:
python main.py \
data.dataset.test_csv=path/to/test.csv \
model/backbone=resnet_transformer_base \
model.pretrained_model_path=path/to/checkpoint.pthGreedy decoding:
python main.py ... decode.beam_size=1 decode.ctc_weight=0.0Decoding parameters
| Parameter | Default | Description |
|---|---|---|
decode.beam_size |
40 | Beam search width |
decode.ctc_weight |
0.1 | CTC weight (0.0 = pure attention) |
decode.maxlenratio |
1.0 | Max output length ratio |
Speed tip: For faster decoding at the cost of some accuracy, reduce the beam size and disable CTC rescoring:
decode.beam_size=1 decode.ctc_weight=0.0The default (beam_size=40, ctc_weight=0.1) gives the best results but is slower. Intermediate values (e.g., beam_size=10) offer a middle ground.
Robustness to noise
| Modality | 10 dB | 5 dB | 0 dB | -5 dB | Avg |
|---|---|---|---|---|---|
| ASR | 5.2 | 13.4 | 44.0 | 94.4 | 39.3 |
| AVSR | 3.7 | 5.6 | 14.0 | 33.1 | 14.1 |
python main.py \
model/backbone=resnet_transformer_base \
model.pretrained_model_path=path/to/checkpoint.pth \
data.dataset.test_csv=path/to/test.csv \
data.noise_path=path/to/babble_noise.npy \
decode.beam_size=30 decode.ctc_weight=0.1 \
decode.maxlenratio=0.4 decode.snr_target=0Babble noise: download
Robustness to long utterances
python main.py \
data.frames_per_gpu_val=700 \
model/backbone=resnet_transformer_base \
model.pretrained_model_path=path/to/checkpoint.pth \
data.dataset.test_csv=path/to/length_bucket.csv \
decode.beam_size=1 decode.ctc_weight=0.0 \
decode.maxlenratio=0.4Length-bucketed test CSVs: 100-150 | 150-200 | 200-250 | 250-300 | 300-350 | 350-400 | 400-450 | 450-500 | 500-550 | 550-600 | Combined
Out-of-distribution datasets
| Modality | Dataset | WER (%) |
|---|---|---|
| ASR | LibriSpeech test-clean | 15.4 |
| VSR | WildVSR | 73.7 |
| AVSR | AVSpeech | 25.0 |
python main.py \
data.frames_per_gpu_val=700 \
model/backbone=resnet_transformer_base \
model.pretrained_model_path=path/to/checkpoint.pth \
data.dataset.test_csv=path/to/test.csv \
decode.beam_size=1 decode.ctc_weight=0.0 \
decode.maxlenratio=0.4Test CSVs: LibriSpeech | WildVSR | AVSpeech
Full data preparation instructions (for batch evaluation)
Note: Several of these datasets are no longer available for download from their official sources. Unfortunately there is nothing we can do about this. If you need access to a specific dataset, we recommend contacting the original authors directly.
python preprocessing/extract_mouths.py \
--src_dir /path/to/raw/videos \
--tgt_dir /path/to/mouth/videos \
--landmarks_dir /path/to/landmarksPre-computed landmarks for LRS2 and LRS3 can be downloaded from the Visual Speech Recognition repo.
| Split | Link |
|---|---|
| LRS3 test | download |
| LRS3 trainval | download |
| LRS3 train | download |
| LRS3 val | download |
| LRS3+Vox2 | download |
| LRS2+LRS3+Vox2+AVS | download |
Edit conf/data/default.yaml and set the video/audio directory prefixes for each dataset.
.
├── demo.py # Transcribe a single video
├── extract_features.py # Extract encoder features
├── main.py # Batch evaluation with WER
├── evaluator.py # PyTorch Lightning evaluation module
├── models/usr.py # USR model wrapper
├── data/ # Dataset, transforms, samplers
├── preprocessing/ # Face detection + mouth cropping
├── espnet/ # Vendored ESPnet (transformer, beam search, CTC)
├── conf/ # Hydra configuration
└── utils/utils.py # Tokenization
@article{usr2,
title={Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition},
author={},
journal={},
year={2025}
}This codebase builds on ESPnet, PyTorch Lightning, and Hydra. Preprocessing code is adapted from Visual Speech Recognition for Multiple Languages.


