VOSK Versus Coqui STT - A practical comparison. #892

ls-milkyway · 2022-03-21T17:25:01Z

VOSK Vs Coqui STT

Description:

A hard to transcribe (bad words, slangs, low voice, background noise, typical US accent ...etc.) audio from the famous US film: Reservoir Dogs 1992 was used to get the outputs from both the models and then compared to it's original SRT file.

The original SRT file of the film was converted to text with US spell & grammar corrections applied by another AI project. Also the audio was spitted using Spleeter to check (background noise removed) it's effect on the transcriptions.

Finally, post-processing of the transcribed files of the 2 models were carried out using the same project that was used to correct the original SRT file.

Please Note: Best models available for the Vosk & STT (does not have a dedicated US model) were used.

WER & WRR scores were calculated using asr-evaluation and compared to the original & post-processed transcribed files (with & without spleeter):

-------------Original Transcriptions-----------------------
STT

WER: 79.357% ( 10795 / 13603)
WRR: 21.135% ( 2875 / 13603)

STT-Spleeter

WER: 79.821% ( 10858 / 13603)
WRR: 20.635% ( 2807 / 13603)

VOSK

WER: 54.554% ( 7421 / 13603)
WRR: 47.151% ( 6414 / 13603)

VOSK-Spleeter

WER: 56.333% ( 7663 / 13603)
WRR: 45.181% ( 6146 / 13603)

---------------Post Processing (with spelling corrections)------------
STT

WER: 78.130% ( 10628 / 13603)
WRR: 22.473% ( 3057 / 13603)

STT-Spleeter

WER: 78.571% ( 10688 / 13603)
WRR: 21.951% ( 2986 / 13603)

VOSK

WER: 52.047% ( 7080 / 13603)
WRR: 49.835% ( 6779 / 13603)

VOSK-Spleeter

WER: 53.922% ( 7335 / 13603)
WRR: 47.835% ( 6507 / 13603)

Conclusion:

Spleeter does not improve transcription in any way.
Post-processing of the transcribed file is recommended (~2.49% improvement in the best model i.e. VOSK).
Dedicated models for language dialects is better approach...till some exceptional breakthrough to detect dialects in voice is practically implemented.
Winner: VOSK,...not only performed better in WER scores but also in word segmentation...STT has word segmentation problem.

Future:
Vosk needs better training ways with proper documentation to train models with film audio, to do exceptionally well in the real world of practical SRT file generation and keep itself ahead in the competition.

nshmyrev · 2022-03-21T17:37:13Z

Thank you! Single file evaluation might not be perfect, but still very interesting. If you share the intermediate files (logs, output, reference file) I'd be happy to take a look in more detail what is going on there closer.

ls-milkyway · 2022-03-23T17:43:21Z

I will put the full procedure & files ...just need some free time.......meanwhile can you suggest about this guide to train vosk models.......is it worth to try??....kindly clone/fork & correct it (if some corrections are needed).

I am also working on remuxing & trimming original DVD films (with stereo audio) using SRT time codes with ffmpeg ...so that models can be trained with some complex audio based on character dialogues...also makes SRT file corrections easy....I mean no need to process the full file for spell/grammar corrections.

nshmyrev · 2022-03-23T20:00:27Z

can you suggest about this guide to train vosk models.......is it worth to try??

It is not complete, its better to follow our training process instead: https://github.com/alphacep/vosk-api/tree/master/training

ls-milkyway · 2022-03-24T16:45:36Z

It is not complete, its better to follow our training process instead: https://github.com/alphacep/vosk-api/tree/master/training

That badly needs a good description....how to use it with prepared audio and available text (original transcription)....I mean I am not much into training or deep working of AI ...all that I can see it is shell and is to be used in linux......kindly provide a brief ...if not full steps??
---------------------------Below is the procedure & files for VOSK Vs STT-------------------------------------------
Models:

Vosk Model: en-us-0.22
STT Model: huge vocab

An original DVD was re-muxed (without encoding) using stereo audio and subtitle file was downloaded from opensubtitles.

Some necessary steps carried out for an ideal transcribing...were:

Original audio (without encoding was passed to spleeter) ..the spleeter turns it into wav file ....this may BE THE REASON FOR LOW SPLEETER SCORES...anyway the wav file was merged backed to video...so 2 files were created 1 normal other with spleeter (with minimum encoding/transformation possible).

The original & transcribed files are:
a) Reservoir Dogs.1992.BluRay.720p.x264.YIFY.srt.
b) stt.txt (note: I did not generate srt file....as there was no clear documentation for it).
c) stt-spleeter.txt (transcribed with spleeter).
d) vosk.srt
e) vosk-spleeter.srt

Links (please note that srt files are renamed to txt as github does not allow srt uploads).

Reservoir Dogs.1992.BluRay.720p.x264.YIFY.txt
stt.txt
stt-spleeter.txt
vosk.txt
vosk-spleeter.txt

The srt file was converted to text file by using happyscribe which just removes time-codes in srt ...then the empty lines & leading + trailing spaces of the sentences were removed using notepad++.

Since the SRT files are prone to spell & grammar mistakes (as they are mostly generated from an ocr based converter) ...so it was further corrected using US spell & grammar based AI project ...one can also use dspellcheck extension in notepad++ with US dictionary installed but results are not as accurate as in former.

Since SER (sentence error rate) is also an important parameter so I carried out asr-evaluation with original (base.txt file) & vosk.txt ...(both having sentences intact) but sentence error rate was 100% in both the cases of stt & vosk....also wer scores were less. Hence I opted to process both the files using neuspell or symspell ...to be frank I don't remember which one I used.

The final file generated contained all lower case words in 1 sentence (as SER error rate is 100%...so it was irrelevant).

The final processed file after spell & grammar corrections are:
a) base-final.txt (Reservoir Dogs.1992.BluRay.720p.x264.YIFY.srt corrected & renamed base-final for asr-evaluation)
b) stt-final.txt
c) stt-spleeter-final.txt
d) vosk-final.srt
e) vosk-spleeter-final.txt

I have attached all of them.
base-final.txt
stt-final.txt
stt-spleeter-final.txt
vosk-final.txt
vosk-spleeter-final.txt

Your suggestion..idea or tips are always welcome.

ls-milkyway mentioned this issue Apr 2, 2022

Vosk Versus Pico Voice - A quick comparison. #909

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VOSK Versus Coqui STT - A practical comparison. #892

VOSK Versus Coqui STT - A practical comparison. #892

ls-milkyway commented Mar 21, 2022

nshmyrev commented Mar 21, 2022

ls-milkyway commented Mar 23, 2022 •

edited

nshmyrev commented Mar 23, 2022

ls-milkyway commented Mar 24, 2022 •

edited

VOSK Versus Coqui STT - A practical comparison. #892

VOSK Versus Coqui STT - A practical comparison. #892

Comments

ls-milkyway commented Mar 21, 2022

VOSK Vs Coqui STT

nshmyrev commented Mar 21, 2022

ls-milkyway commented Mar 23, 2022 • edited

nshmyrev commented Mar 23, 2022

ls-milkyway commented Mar 24, 2022 • edited

ls-milkyway commented Mar 23, 2022 •

edited

ls-milkyway commented Mar 24, 2022 •

edited