New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VOSK Versus Coqui STT - A practical comparison. #892
Comments
Thank you! Single file evaluation might not be perfect, but still very interesting. If you share the intermediate files (logs, output, reference file) I'd be happy to take a look in more detail what is going on there closer. |
I will put the full procedure & files ...just need some free time.......meanwhile can you suggest about this guide to train vosk models.......is it worth to try??....kindly clone/fork & correct it (if some corrections are needed). I am also working on remuxing & trimming original DVD films (with stereo audio) using SRT time codes with ffmpeg ...so that models can be trained with some complex audio based on character dialogues...also makes SRT file corrections easy....I mean no need to process the full file for spell/grammar corrections. |
It is not complete, its better to follow our training process instead: https://github.com/alphacep/vosk-api/tree/master/training |
That badly needs a good description....how to use it with prepared audio and available text (original transcription)....I mean I am not much into training or deep working of AI ...all that I can see it is shell and is to be used in linux......kindly provide a brief ...if not full steps?? Vosk Model: en-us-0.22 An original DVD was re-muxed (without encoding) using stereo audio and subtitle file was downloaded from opensubtitles. Some necessary steps carried out for an ideal transcribing...were:
The original & transcribed files are: Links (please note that srt files are renamed to txt as github does not allow srt uploads). Reservoir Dogs.1992.BluRay.720p.x264.YIFY.txt
Since the SRT files are prone to spell & grammar mistakes (as they are mostly generated from an ocr based converter) ...so it was further corrected using US spell & grammar based AI project ...one can also use dspellcheck extension in notepad++ with US dictionary installed but results are not as accurate as in former. Since SER (sentence error rate) is also an important parameter so I carried out asr-evaluation with original (base.txt file) & vosk.txt ...(both having sentences intact) but sentence error rate was 100% in both the cases of stt & vosk....also wer scores were less. Hence I opted to process both the files using neuspell or symspell ...to be frank I don't remember which one I used. The final file generated contained all lower case words in 1 sentence (as SER error rate is 100%...so it was irrelevant). The final processed file after spell & grammar corrections are: I have attached all of them. Your suggestion..idea or tips are always welcome. |
VOSK Vs Coqui STT
Description:
A hard to transcribe (bad words, slangs, low voice, background noise, typical US accent ...etc.) audio from the famous US film: Reservoir Dogs 1992 was used to get the outputs from both the models and then compared to it's original SRT file.
The original SRT file of the film was converted to text with US spell & grammar corrections applied by another AI project. Also the audio was spitted using Spleeter to check (background noise removed) it's effect on the transcriptions.
Finally, post-processing of the transcribed files of the 2 models were carried out using the same project that was used to correct the original SRT file.
Please Note: Best models available for the Vosk & STT (does not have a dedicated US model) were used.
WER & WRR scores were calculated using asr-evaluation and compared to the original & post-processed transcribed files (with & without spleeter):
-------------Original Transcriptions-----------------------
STT
WER: 79.357% ( 10795 / 13603)
WRR: 21.135% ( 2875 / 13603)
STT-Spleeter
WER: 79.821% ( 10858 / 13603)
WRR: 20.635% ( 2807 / 13603)
VOSK
WER: 54.554% ( 7421 / 13603)
WRR: 47.151% ( 6414 / 13603)
VOSK-Spleeter
WER: 56.333% ( 7663 / 13603)
WRR: 45.181% ( 6146 / 13603)
---------------Post Processing (with spelling corrections)------------
STT
WER: 78.130% ( 10628 / 13603)
WRR: 22.473% ( 3057 / 13603)
STT-Spleeter
WER: 78.571% ( 10688 / 13603)
WRR: 21.951% ( 2986 / 13603)
VOSK
WER: 52.047% ( 7080 / 13603)
WRR: 49.835% ( 6779 / 13603)
VOSK-Spleeter
WER: 53.922% ( 7335 / 13603)
WRR: 47.835% ( 6507 / 13603)
Conclusion:
Future:
Vosk needs better training ways with proper documentation to train models with film audio, to do exceptionally well in the real world of practical SRT file generation and keep itself ahead in the competition.
The text was updated successfully, but these errors were encountered: