SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization (INTERSPEECH 2023 Oral Presentation)

Introduction

This repository contains the official PyTorch implementation of the following paper:

SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization
Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang
Conference of the International Speech Communication Association (INTERSPEECH), 2023, (Oral Presentation, 348/2293=15.18%)

Abstract: Automatic speech recognition (ASR) models are frequently exposed to data distribution shifts in many real-world scenarios, leading to erroneous predictions. To tackle this issue, an existing test-time adaptation (TTA) method has recently been proposed to adapt the pre-trained ASR model on unlabeled test instances without source data. Despite decent performance gain, this work relies solely on naive greedy decoding and performs adaptation across timesteps at a frame level, which may not be optimal given the sequential nature of the model output. Motivated by this, we propose a novel TTA framework, dubbed SGEM, for general ASR models. To treat the sequential output, SGEM first exploits beam search to explore candidate output logits and selects the most plausible one. Then, it utilizes generalized entropy minimization and negative sampling as unsupervised objectives to adapt the model. SGEM achieves state-of-the-art performance for three mainstream ASR models under various domain shifts.

Environmental Setup

conda create -y -n sgem python=3.7
conda activate sgem
pip install -r requirements.txt

Datasets

LibriSpeech
- You can get test-other.tar.gz in LibriSpeech using the link above.
CHiME-3
- You need to manually download CHiME-3 dataset using the link above with a standard Linguistic Data Consortium account.
TED-LIUM 2
- You can get TED-LIUM 2 dataset using the link above.
- You also need to preprocess the data with data/preprocess_ted.py and data/preprocess_ted.sh.
CommonVoice
- You can get Common Voice Corpus 5.1 dataset using the link above.
Valentini
- You can get noisy_testset_wav.zip and testset_txt.zip in TED-LIUM 2 dataset using the link above.
L2-Arctic
- You can get L2-Arctic dataset using the link above.
- Speakers who were utilized for each native language are as follows:
Language Speaker

Arabic SKA

Mandarin BWC

Hindi RRBI

Korean HKK

Spanish EBVS

Vietnamese PNV
MS-SNSD
- All background noises used in the paper are included in res folder. (res/*.wav)
- Set speech_dir and snr_lower in conf/noisyspeech_synthesizer.cfg.
- You can make synthetic distribution shift datasets with the following command:
```
python corpus/noisyspeech_synthesizer.py
```

Pre-trained Models

CTC-based Model
- CTC-based model will be automatically downloaded if you set asr as facebook/wav2vec2-base-960h.

Conformer

You need to download conformer by your own using following command:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_ctc_small_ls/versions/1.0.0/zip -P pretrained_models

Transducer

You need to download transducer by your own using following command:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_small/versions/1.6.0/zip -P pretrained_models

4-gram Language Model for CTC-based Model

You need to download language by your own using following command:

git lfs install
git clone https://huggingface.co/patrickvonplaten/wav2vec2-base-100h-with-lm pretrained_models/wav2vec2-base-100h-with-lm

Run

You can run main.py using the command below:

python main.py \
    --config-name [CONFIG.YAML] \
    dataset_name=[DATASET_NAME] \
    dataset_dir=[DATASET_DIR] \

Currently available parameters are as follows:

Parameter	Value
CONFIG.YAML	config.yaml, config_{sgem\|suta}_{ctc\|conformer\|transducer}.yaml
DATASET_NAME	librispeech, chime, ted, commonvoice, valentini, l2arctic

Contact

If you have any questions or comments, feel free to contact us via changhun.kim@kaist.ac.kr.

Citation

@inproceedings{sgem,
  title={{SGEM}: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization},
  author={Kim, Changhun and Park, Joonhyung and Shim, Hajin and Yang, Eunho},
  booktitle={Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2023}
}

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) 
(No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)).

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
conf		conf
corpus		corpus
data		data
pretrained_models		pretrained_models
res		res
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.py		data.py
forward.py		forward.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Language	Speaker
Arabic	SKA
Mandarin	BWC
Hindi	RRBI
Korean	HKK
Spanish	EBVS
Vietnamese	PNV

License

drumpt/SGEM

Folders and files

Latest commit

History

Repository files navigation

SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization (INTERSPEECH 2023 Oral Presentation)

Introduction

Environmental Setup

Datasets

Pre-trained Models

Run

Contact

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages