USR 2.0

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

A unified model for audio, visual, and audio-visual speech recognition.

Training paradigm: USR 2.0 uses self-supervised pre-training followed by semi-supervised fine-tuning. We provide both the self-supervised checkpoints (for extracting representations for your own downstream tasks) and the fine-tuned checkpoints (for speech recognition). See Extract Encoder Features for details on using either type.

Installation • Demo • Features • Models • Evaluation • Citation

Installation

Prerequisites

FFmpeg is required for video/audio processing. Check if it is installed:

ffmpeg -version

If not, install it:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

Step 1: Install PyTorch

Install PyTorch, torchaudio, and torchvision for your system from pytorch.org. For example:

pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu121

Step 2: Install remaining dependencies

pip install -r requirements.txt

This installs all remaining packages, including MediaPipe for face landmark detection (used for mouth cropping).

Optional: Higher-accuracy face detection with RetinaFace+FAN

MediaPipe is the default (CPU-based). For higher-accuracy landmark detection, install the ibug packages. This requires a CUDA GPU.

# face_detection uses git-lfs for model weights
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/hhj1897/face_detection.git
cd face_detection
wget -O ibug/face_detection/retina_face/weights/Resnet50_Final.pth \
  https://huggingface.co/public-data/ibug-face-detection/resolve/main/retina_face/Resnet50_Final.pth
pip install -e .
cd ..

# face_alignment must be installed from a local clone (editable mode)
git clone https://github.com/hhj1897/face_alignment.git
pip install -e face_alignment

Then pass detector=retinaface to demo.py or extract_features.py.

Transcribe a Video

Run demo.py to transcribe a video. Face detection and mouth cropping are handled automatically.

Download a pretrained model before running. The Huge (high-resource) checkpoint is recommended for best accuracy. For a lighter alternative, use Base+ (high-resource). See Pretrained Models for all options.

python demo.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth

Output:

============================================================
 Modality : av
 Video    : path/to/video.mp4
 Result   : THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
============================================================

Modalities

# Audio-visual (default)
python demo.py video=video.mp4 model.pretrained_model_path=model.pth

# Lip reading only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=v

# Audio only
python demo.py video=video.mp4 model.pretrained_model_path=model.pth modality=a

Face detector

By default, MediaPipe is used for face landmark detection. For higher accuracy, use RetinaFace+FAN (requires ibug packages and a CUDA GPU):

python demo.py video=video.mp4 model.pretrained_model_path=model.pth detector=retinaface

Using a different model size

python demo.py video=video.mp4 model.pretrained_model_path=model.pth \
  model/backbone=resnet_transformer_large

Any Hydra override works. For example, to change beam size: decode.beam_size=10.

Extract Encoder Features

Extract learned audio-visual representations for your own downstream tasks (e.g., emotion recognition, speaker verification, audio-visual synchronization).

While demo.py uses the full model (encoder + decoder) for transcription, extract_features.py outputs only the encoder representations.

You can extract features from two types of checkpoints:

Fine-tuned models (self-supervised + semi-supervised): Best for tasks that benefit from supervised speech knowledge
Self-supervised models: Pure self-supervised representations, useful if you want to fine-tune on a task that requires more general knowledge.

See Pretrained Models for download links.

# Using a fine-tuned checkpoint
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/finetuned_checkpoint.pth \
  output=features.pt

# Using a self-supervised checkpoint
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/selfsup_checkpoint.pth \
  output=features.pt

# Single modality
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth \
  modality=v output=video_features.pt

# Use RetinaFace+FAN for face detection (optional, requires ibug packages)
python extract_features.py \
  video=path/to/video.mp4 \
  model.pretrained_model_path=path/to/checkpoint.pth \
  detector=retinaface output=features.pt

Load in Python:

import torch

features = torch.load("features.pt")
features["audio_visual"]  # numpy array, shape (T, D) — fused audio-visual encoder output
features["video"]          # numpy array, shape (T, D) — visual encoder output
features["audio"]          # numpy array, shape (T, D) — audio encoder output

Pretrained Models

Self-supervised (encoder only)

These are self-supervised pre-trained checkpoints from USR. Use these if you want to extract representations without any supervised fine-tuning, e.g., to fine-tune for your own downstream task.

Model	Data	Download
Base	LRS3	checkpoint
Base+	LRS3+Vox2	checkpoint
Large	LRS3+Vox2	checkpoint
Huge	LRS2+LRS3+Vox2+AVS	checkpoint

Fine-tuned (full model)

These checkpoints have been fine-tuned with semi-supervised learning for speech recognition. Use these for transcription (demo.py) or to extract features that include supervised speech knowledge.

Low-resource

Model	Data	VSR (%)	ASR (%)	AVSR (%)	Download
Base	LRS3	36.2	3.0	2.9	checkpoint
Base+	LRS3+Vox2	26.4	2.5	2.4	checkpoint
Large	LRS3+Vox2	23.7	2.3	2.2	checkpoint

High-resource

Model	Data	VSR (%)	ASR (%)	AVSR (%)	Download
Base+	LRS3+Vox2	24.8	1.4	1.2	checkpoint
Large	LRS3+Vox2	21.5	1.3	1.0	checkpoint
Huge	LRS2+LRS3+Vox2+AVS	17.6	0.9	0.8	checkpoint

Backbone configs: resnet_transformer_base / resnet_transformer_baseplus / resnet_transformer_large / resnet_transformer_huge.

Evaluation

Evaluate on a test set with WER computation:

python main.py \
  data.dataset.test_csv=path/to/test.csv \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth

Greedy decoding:

python main.py ... decode.beam_size=1 decode.ctc_weight=0.0

Decoding parameters

Parameter	Default	Description
`decode.beam_size`	40	Beam search width
`decode.ctc_weight`	0.1	CTC weight (0.0 = pure attention)
`decode.maxlenratio`	1.0	Max output length ratio

Speed tip: For faster decoding at the cost of some accuracy, reduce the beam size and disable CTC rescoring:

decode.beam_size=1 decode.ctc_weight=0.0

The default (beam_size=40, ctc_weight=0.1) gives the best results but is slower. Intermediate values (e.g., beam_size=10) offer a middle ground.

Reproducing Paper Results

Robustness to noise

Modality	10 dB	5 dB	0 dB	-5 dB	Avg
ASR	5.2	13.4	44.0	94.4	39.3
AVSR	3.7	5.6	14.0	33.1	14.1

python main.py \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/test.csv \
  data.noise_path=path/to/babble_noise.npy \
  decode.beam_size=30 decode.ctc_weight=0.1 \
  decode.maxlenratio=0.4 decode.snr_target=0

Babble noise: download

Robustness to long utterances

python main.py \
  data.frames_per_gpu_val=700 \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/length_bucket.csv \
  decode.beam_size=1 decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4

Length-bucketed test CSVs: 100-150 | 150-200 | 200-250 | 250-300 | 300-350 | 350-400 | 400-450 | 450-500 | 500-550 | 550-600 | Combined

Out-of-distribution datasets

Modality	Dataset	WER (%)
ASR	LibriSpeech test-clean	15.4
VSR	WildVSR	73.7
AVSR	AVSpeech	25.0

python main.py \
  data.frames_per_gpu_val=700 \
  model/backbone=resnet_transformer_base \
  model.pretrained_model_path=path/to/checkpoint.pth \
  data.dataset.test_csv=path/to/test.csv \
  decode.beam_size=1 decode.ctc_weight=0.0 \
  decode.maxlenratio=0.4

Test CSVs: LibriSpeech | WildVSR | AVSpeech

Data Preparation

Full data preparation instructions (for batch evaluation)

1. Download datasets

Note: Several of these datasets are no longer available for download from their official sources. Unfortunately there is nothing we can do about this. If you need access to a specific dataset, we recommend contacting the original authors directly.

2. Extract mouth ROIs

python preprocessing/extract_mouths.py \
  --src_dir /path/to/raw/videos \
  --tgt_dir /path/to/mouth/videos \
  --landmarks_dir /path/to/landmarks

Pre-computed landmarks for LRS2 and LRS3 can be downloaded from the Visual Speech Recognition repo.

3. Download test CSVs

Split	Link
LRS3 test	download
LRS3 trainval	download
LRS3 train	download
LRS3 val	download
LRS3+Vox2	download
LRS2+LRS3+Vox2+AVS	download

4. Set dataset paths

Edit conf/data/default.yaml and set the video/audio directory prefixes for each dataset.

Architecture

Repository Structure

.
├── demo.py                   # Transcribe a single video
├── extract_features.py       # Extract encoder features
├── main.py                   # Batch evaluation with WER
├── evaluator.py              # PyTorch Lightning evaluation module
├── models/usr.py             # USR model wrapper
├── data/                     # Dataset, transforms, samplers
├── preprocessing/            # Face detection + mouth cropping
├── espnet/                   # Vendored ESPnet (transformer, beam search, CTC)
├── conf/                     # Hydra configuration
└── utils/utils.py            # Tokenization

Citation

@article{usr2,
  title={Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition},
  author={},
  journal={},
  year={2025}
}

Acknowledgements

This codebase builds on ESPnet, PyTorch Lightning, and Hydra. Preprocessing code is adapted from Visual Speech Recognition for Multiple Languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USR 2.0

Installation

Prerequisites

Step 1: Install PyTorch

Step 2: Install remaining dependencies

Transcribe a Video

Modalities

Face detector

Using a different model size

Extract Encoder Features

Pretrained Models

Self-supervised (encoder only)

Fine-tuned (full model)

Low-resource

High-resource

Evaluation

Reproducing Paper Results

Data Preparation

1. Download datasets

2. Extract mouth ROIs

3. Download test CSVs

4. Set dataset paths

Architecture

Repository Structure

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
conf		conf
data		data
espnet		espnet
models		models
preprocessing		preprocessing
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
evaluator.py		evaluator.py
extract_features.py		extract_features.py
main.py		main.py
metrics.py		metrics.py
requirements.txt		requirements.txt

ahaliassos/usr2

Folders and files

Latest commit

History

Repository files navigation

USR 2.0

Installation

Prerequisites

Step 1: Install PyTorch

Step 2: Install remaining dependencies

Transcribe a Video

Modalities

Face detector

Using a different model size

Extract Encoder Features

Pretrained Models

Self-supervised (encoder only)

Fine-tuned (full model)

Low-resource

High-resource

Evaluation

Reproducing Paper Results

Data Preparation

1. Download datasets

2. Extract mouth ROIs

3. Download test CSVs

4. Set dataset paths

Architecture

Repository Structure

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages