Skip to content

ahaliassos/usr2

Repository files navigation

USR 2.0

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

A unified model for audio, visual, and audio-visual speech recognition.

Training paradigm: USR 2.0 uses self-supervised pre-training followed by semi-supervised fine-tuning. We provide both the self-supervised checkpoints (for extracting representations for your own downstream tasks) and the fine-tuned checkpoints (for speech recognition). See Extract Encoder Features for details on using either type.

InstallationDemoFeaturesModelsEvaluationCitation


Installation

Prerequisites

FFmpeg is required for video/audio processing. Check if it is installed:

ffmpeg -version

If not, install it:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

Step 1: Install PyTorch

Install PyTorch, torchaudio, and torchvision for your system from pytorch.org. For example:

pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu121

Step 2: Install remaining dependencies

pip install -r requirements.txt

This installs all remaining packages, including MediaPipe for face landmark detection (used for mouth cropping).

Optional: Higher-accuracy face detection with RetinaFace+FAN

MediaPipe is the default (CPU-based). For higher-accuracy landmark detection, install the ibug packages. This requires a CUDA GPU.

# face_detection uses git-lfs for model weights
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/hhj1897/face_detection.git
cd face_detection
wget -O ibug/face_detection/retina_face/weights/Resnet50_Final.pth \
  https://huggingface.co/public-data/ibug-face-detection/resolve/main/retina_face/Resnet50_Final.pth
pip install -e .
cd ..

# face_alignment must be installed from a local clone (editable mode)
git clone https://github.com/hhj1897/face_alignment.git
pip install -e face_alignment

Then pass detector=retinaface to demo.py or extract_features.py.


Transcribe a Video

Run demo.py to transcribe a video. Face detection and mouth cropping are handled automatically.

Download a pretrained model before running. The Huge (high-resource) checkpoint is recommended for best accuracy. For a lighter alternative, use Base+ (high-resource). See Pretrained Models for all options.

python demo.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth

Output:

============================================================
 Modality : av
 Video    : path/to/video.mp4
 Result   : THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
============================================================

Modalities

# Audio-visual (default)
python demo.py video=video.mp4 model.pretrained_model_path=model.pth

# Lip reading only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=v

# Audio only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=a

Face detector

By default, MediaPipe is used for face landmark detection. For higher accuracy, use RetinaFace+FAN (requires ibug packages and a CUDA GPU):

python demo.py video=video.mp4 model.pretrained_model_path=model.pth detector=retinaface

Using a different model size

python demo.py video=video.mp4 model.pretrained_model_path=model.pth \
  model/backbone=resnet_transformer_large

Any Hydra override works. For example, to change beam size: decode.beam_size=10.


Extract Encoder Features

Extract learned audio-visual representations for your own downstream tasks (e.g., emotion recognition, speaker verification, audio-visual synchronization).

While demo.py uses the full model (encoder + decoder) for transcription, extract_features.py outputs only the encoder representations.

You can extract features from two types of checkpoints:

  • Fine-tuned models (self-supervised + semi-supervised): Best for tasks that benefit from supervised speech knowledge
  • Self-supervised models: Pure self-supervised representations, useful if you want to fine-tune on a task that requires more general knowledge.

See Pretrained Models for download links.

# Using a fine-tuned checkpoint
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/finetuned_checkpoint.pth \
  output=features.pt

# Using a self-supervised checkpoint
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/selfsup_checkpoint.pth \
  output=features.pt

# Single modality
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth \
  modality=v output=video_features.pt

# Use RetinaFace+FAN for face detection (optional, requires ibug packages)
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth \
  detector=retinaface output=features.pt

Load in Python:

import torch

features = torch.load("features.pt")
features["audio_visual"]  # numpy array, shape (T, D) — fused audio-visual encoder output
features["video"]          # numpy array, shape (T, D) — visual encoder output
features["audio"]          # numpy array, shape (T, D) — audio encoder output

Pretrained Models

Self-supervised (encoder only)

These are self-supervised pre-trained checkpoints from USR. Use these if you want to extract representations without any supervised fine-tuning, e.g., to fine-tune for your own downstream task.

Model Data Download
Base LRS3 checkpoint
Base+ LRS3+Vox2 checkpoint
Large LRS3+Vox2 checkpoint
Huge LRS2+LRS3+Vox2+AVS checkpoint

Fine-tuned (full model)

These checkpoints have been fine-tuned with semi-supervised learning for speech recognition. Use these for transcription (demo.py) or to extract features that include supervised speech knowledge.

Low-resource

Model Data VSR (%) ASR (%) AVSR (%) Download
Base LRS3 36.2 3.0 2.9 checkpoint
Base+ LRS3+Vox2 26.4 2.5 2.4 checkpoint
Large LRS3+Vox2 23.7 2.3 2.2 checkpoint

High-resource

Model Data VSR (%) ASR (%) AVSR (%) Download
Base+ LRS3+Vox2 24.8 1.4 1.2 checkpoint
Large LRS3+Vox2 21.5 1.3 1.0 checkpoint
Huge LRS2+LRS3+Vox2+AVS 17.6 0.9 0.8 checkpoint

Backbone configs: resnet_transformer_base / resnet_transformer_baseplus / resnet_transformer_large / resnet_transformer_huge.


Evaluation

Evaluate on a test set with WER computation:

python main.py \
  data.dataset.test_csv=path/to/test.csv \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth

Greedy decoding:

python main.py ... decode.beam_size=1 decode.ctc_weight=0.0
Decoding parameters
Parameter Default Description
decode.beam_size 40 Beam search width
decode.ctc_weight 0.1 CTC weight (0.0 = pure attention)
decode.maxlenratio 1.0 Max output length ratio

Speed tip: For faster decoding at the cost of some accuracy, reduce the beam size and disable CTC rescoring:

decode.beam_size=1 decode.ctc_weight=0.0

The default (beam_size=40, ctc_weight=0.1) gives the best results but is slower. Intermediate values (e.g., beam_size=10) offer a middle ground.


Reproducing Paper Results

Robustness to noise
Modality 10 dB 5 dB 0 dB -5 dB Avg
ASR 5.2 13.4 44.0 94.4 39.3
AVSR 3.7 5.6 14.0 33.1 14.1
python main.py \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/test.csv \
  data.noise_path=path/to/babble_noise.npy \
  decode.beam_size=30 decode.ctc_weight=0.1 \
  decode.maxlenratio=0.4 decode.snr_target=0

Babble noise: download

Robustness to long utterances

python main.py \
  data.frames_per_gpu_val=700 \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/length_bucket.csv \
  decode.beam_size=1 decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4

Length-bucketed test CSVs: 100-150 | 150-200 | 200-250 | 250-300 | 300-350 | 350-400 | 400-450 | 450-500 | 500-550 | 550-600 | Combined

Out-of-distribution datasets
Modality Dataset WER (%)
ASR LibriSpeech test-clean 15.4
VSR WildVSR 73.7
AVSR AVSpeech 25.0
python main.py \
  data.frames_per_gpu_val=700 \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/test.csv \
  decode.beam_size=1 decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4

Test CSVs: LibriSpeech | WildVSR | AVSpeech


Data Preparation

Full data preparation instructions (for batch evaluation)

1. Download datasets

Note: Several of these datasets are no longer available for download from their official sources. Unfortunately there is nothing we can do about this. If you need access to a specific dataset, we recommend contacting the original authors directly.

2. Extract mouth ROIs

python preprocessing/extract_mouths.py \
  --src_dir /path/to/raw/videos \
  --tgt_dir /path/to/mouth/videos \
  --landmarks_dir /path/to/landmarks

Pre-computed landmarks for LRS2 and LRS3 can be downloaded from the Visual Speech Recognition repo.

3. Download test CSVs

Split Link
LRS3 test download
LRS3 trainval download
LRS3 train download
LRS3 val download
LRS3+Vox2 download
LRS2+LRS3+Vox2+AVS download

4. Set dataset paths

Edit conf/data/default.yaml and set the video/audio directory prefixes for each dataset.


Architecture


Repository Structure

.
├── demo.py                   # Transcribe a single video
├── extract_features.py       # Extract encoder features
├── main.py                   # Batch evaluation with WER
├── evaluator.py              # PyTorch Lightning evaluation module
├── models/usr.py             # USR model wrapper
├── data/                     # Dataset, transforms, samplers
├── preprocessing/            # Face detection + mouth cropping
├── espnet/                   # Vendored ESPnet (transformer, beam search, CTC)
├── conf/                     # Hydra configuration
└── utils/utils.py            # Tokenization

Citation

@article{usr2,
  title={Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition},
  author={},
  journal={},
  year={2025}
}

Acknowledgements

This codebase builds on ESPnet, PyTorch Lightning, and Hydra. Preprocessing code is adapted from Visual Speech Recognition for Multiple Languages.

About

PyTorch implementation of USR 2.0 (ICLR 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages