Audio input requirements #36

AndyLogi · 2023-08-22T10:00:00Z

Are there any specific requirements for audio files to make the results of NISQA valid?
I couldn't find any documentation in this repo or the original paper describing the audio requirements, but I was hoping to use home-made recordings to evaluate the performance of speech enhancement algorithms. Can any audio be used and gives valid results?

I've been running NISQA on some local files and have found that the MOS scores don't always correlate with subjectively listening to the files. Is there anything I should be doing to make these files valid for use in NISQA?
For example, are there requirements/recommendations on:

total duration of audio file;
proportion of speech and non-speech in the file;
level requirements;
suggested SNR for evaluation files (before speech enhancement is applied).

v-nhandt21 · 2023-08-29T07:33:05Z

And also which is the best sample rate for input audio? :))

gabrielmittag · 2023-09-12T15:54:45Z

Hi, the model is trained to predict the quality of speech that was transmitted via telecommunication systems. It wasn't trained to predict the quality of enhanced speech, so the correlations might not be that high if you apply them to your samples. Regarding the other question I can give following recommendations. Let me know if you have more questions.

total duration of audio file: I recommend 6-12 seconds but it should also work for longer or shorter files.
proportion of speech and non-speech in the file: I recommend at least 50 %. The model should be able to handle 100% speech.
level requirements: The model is trained with an active speech level of -26 dB (according to ITU-T P.56) as default. Speech samples with different level will be judged as loudness degradation and result in a lower MOS score. The following figure shows the overall predicted MOS for different speech levels:
suggested SNR for evaluation files (before speech enhancement is applied): I cannot comment on this as the model is not trained for speech enhancement algorithms.
best sample rate: The model is trained with 48 kHz but it is able to handle any sample rate. However, the missing frequencies for lower sample rates will be judged as quality degradation by the model and will results in lower MOS scores. Following figure shows the overall MOS for different cut-off frequencies of a low pass filter:

(Figures taken from Deep Learning Based Speech Quality Prediction)

gabrielmittag closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio input requirements #36

Audio input requirements #36

AndyLogi commented Aug 22, 2023

v-nhandt21 commented Aug 29, 2023

gabrielmittag commented Sep 12, 2023 •

edited

Loading

Audio input requirements #36

Audio input requirements #36

Comments

AndyLogi commented Aug 22, 2023

v-nhandt21 commented Aug 29, 2023

gabrielmittag commented Sep 12, 2023 • edited Loading

gabrielmittag commented Sep 12, 2023 •

edited

Loading