Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio input requirements #36

Closed
AndyLogi opened this issue Aug 22, 2023 · 2 comments
Closed

Audio input requirements #36

AndyLogi opened this issue Aug 22, 2023 · 2 comments

Comments

@AndyLogi
Copy link

Are there any specific requirements for audio files to make the results of NISQA valid?
I couldn't find any documentation in this repo or the original paper describing the audio requirements, but I was hoping to use home-made recordings to evaluate the performance of speech enhancement algorithms. Can any audio be used and gives valid results?

I've been running NISQA on some local files and have found that the MOS scores don't always correlate with subjectively listening to the files. Is there anything I should be doing to make these files valid for use in NISQA?
For example, are there requirements/recommendations on:

  • total duration of audio file;
  • proportion of speech and non-speech in the file;
  • level requirements;
  • suggested SNR for evaluation files (before speech enhancement is applied).
@v-nhandt21
Copy link

And also which is the best sample rate for input audio? :))

@gabrielmittag
Copy link
Owner

gabrielmittag commented Sep 12, 2023

Hi, the model is trained to predict the quality of speech that was transmitted via telecommunication systems. It wasn't trained to predict the quality of enhanced speech, so the correlations might not be that high if you apply them to your samples. Regarding the other question I can give following recommendations. Let me know if you have more questions.

  • total duration of audio file: I recommend 6-12 seconds but it should also work for longer or shorter files.

  • proportion of speech and non-speech in the file: I recommend at least 50 %. The model should be able to handle 100% speech.

  • level requirements: The model is trained with an active speech level of -26 dB (according to ITU-T P.56) as default. Speech samples with different level will be judged as loudness degradation and result in a lower MOS score. The following figure shows the overall predicted MOS for different speech levels:
    image

  • suggested SNR for evaluation files (before speech enhancement is applied): I cannot comment on this as the model is not trained for speech enhancement algorithms.

  • best sample rate: The model is trained with 48 kHz but it is able to handle any sample rate. However, the missing frequencies for lower sample rates will be judged as quality degradation by the model and will results in lower MOS scores. Following figure shows the overall MOS for different cut-off frequencies of a low pass filter:
    image

(Figures taken from Deep Learning Based Speech Quality Prediction)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants