Conventional speech quality assessment (SQA) models predict a single score for the entire audio clip. However, in many applications, it might be desirable to have quality estimates of finer temporal resolution. This repository implements a framework for training SQA models that predict frame-level quality scores. Building upon SSL-MOS [1], the idea is to add a consistency constraint that brings the encoder outputs of audio segments within and detached from the context close to each other in the embedding space.
[1] Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi, “Generalization Ability of MOS Prediction Networks,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2022.
git clone https://github.com/fgnt/local_sqa.git
cd local_sqa
pip install -e .[fgnt]pip install git+https://github.com/fgnt/local_sqa.gitWe train our models on BVCC + NISQA. Training can be extended to other datasets with utterance-level MOS annotations.
Please refer to the data preparation instructions for downloading and preparing the data.
We use SSL-based encoders with a simple decoder architecture (one BLSTM layer + one linear layer and average pooling). Encoder configurations are provided at conf/encoder.
Training can be started with
python -m local_sqa.trainThe default behaviour is as follows:
- Use
conf/default.yamlas configuration - Create a new directory under
./exp/to save logs and checkpoints - Load
bvcc.jsonandnisqa.jsonfromlocal_sqa/data - Use
wav2vec2_baseas encoder
You can visualize the training progress with TensorBoard:
tensorboard --logdir ./exp/We use Hydra for configuration management.
You can change the output directory by overwriting base_dir.
Databases are configured under the key databases.
You can add or remove databases by adding or deleting entries there.
To change the path pointing to the database structure file, overwrite json_path, e.g., databases.bvcc.json_path=path/to/bvcc.json.
Instructions following soon.