Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VOSK Versus Coqui STT - A practical comparison. #892

Open
ls-milkyway opened this issue Mar 21, 2022 · 4 comments
Open

VOSK Versus Coqui STT - A practical comparison. #892

ls-milkyway opened this issue Mar 21, 2022 · 4 comments

Comments

@ls-milkyway
Copy link

VOSK Vs Coqui STT

Description:

A hard to transcribe (bad words, slangs, low voice, background noise, typical US accent ...etc.) audio from the famous US film: Reservoir Dogs 1992 was used to get the outputs from both the models and then compared to it's original SRT file.

The original SRT file of the film was converted to text with US spell & grammar corrections applied by another AI project. Also the audio was spitted using Spleeter to check (background noise removed) it's effect on the transcriptions.

Finally, post-processing of the transcribed files of the 2 models were carried out using the same project that was used to correct the original SRT file.

Please Note: Best models available for the Vosk & STT (does not have a dedicated US model) were used.

WER & WRR scores were calculated using asr-evaluation and compared to the original & post-processed transcribed files (with & without spleeter):

-------------Original Transcriptions-----------------------
STT

WER: 79.357% ( 10795 / 13603)
WRR: 21.135% ( 2875 / 13603)

STT-Spleeter

WER: 79.821% ( 10858 / 13603)
WRR: 20.635% ( 2807 / 13603)

VOSK

WER: 54.554% ( 7421 / 13603)
WRR: 47.151% ( 6414 / 13603)

VOSK-Spleeter

WER: 56.333% ( 7663 / 13603)
WRR: 45.181% ( 6146 / 13603)

---------------Post Processing (with spelling corrections)------------
STT

WER: 78.130% ( 10628 / 13603)
WRR: 22.473% ( 3057 / 13603)

STT-Spleeter

WER: 78.571% ( 10688 / 13603)
WRR: 21.951% ( 2986 / 13603)

VOSK

WER: 52.047% ( 7080 / 13603)
WRR: 49.835% ( 6779 / 13603)

VOSK-Spleeter

WER: 53.922% ( 7335 / 13603)
WRR: 47.835% ( 6507 / 13603)

Conclusion:

  1. Spleeter does not improve transcription in any way.
  2. Post-processing of the transcribed file is recommended (~2.49% improvement in the best model i.e. VOSK).
  3. Dedicated models for language dialects is better approach...till some exceptional breakthrough to detect dialects in voice is practically implemented.
  4. Winner: VOSK,...not only performed better in WER scores but also in word segmentation...STT has word segmentation problem.

Future:
Vosk needs better training ways with proper documentation to train models with film audio, to do exceptionally well in the real world of practical SRT file generation and keep itself ahead in the competition.

@nshmyrev
Copy link
Collaborator

Thank you! Single file evaluation might not be perfect, but still very interesting. If you share the intermediate files (logs, output, reference file) I'd be happy to take a look in more detail what is going on there closer.

@ls-milkyway
Copy link
Author

ls-milkyway commented Mar 23, 2022

I will put the full procedure & files ...just need some free time.......meanwhile can you suggest about this guide to train vosk models.......is it worth to try??....kindly clone/fork & correct it (if some corrections are needed).

I am also working on remuxing & trimming original DVD films (with stereo audio) using SRT time codes with ffmpeg ...so that models can be trained with some complex audio based on character dialogues...also makes SRT file corrections easy....I mean no need to process the full file for spell/grammar corrections.

@nshmyrev
Copy link
Collaborator

can you suggest about this guide to train vosk models.......is it worth to try??

It is not complete, its better to follow our training process instead: https://github.com/alphacep/vosk-api/tree/master/training

@ls-milkyway
Copy link
Author

ls-milkyway commented Mar 24, 2022

It is not complete, its better to follow our training process instead: https://github.com/alphacep/vosk-api/tree/master/training

That badly needs a good description....how to use it with prepared audio and available text (original transcription)....I mean I am not much into training or deep working of AI ...all that I can see it is shell and is to be used in linux......kindly provide a brief ...if not full steps??
---------------------------Below is the procedure & files for VOSK Vs STT-------------------------------------------
Models:

Vosk Model: en-us-0.22
STT Model: huge vocab

An original DVD was re-muxed (without encoding) using stereo audio and subtitle file was downloaded from opensubtitles.

Some necessary steps carried out for an ideal transcribing...were:

  1. Original audio (without encoding was passed to spleeter) ..the spleeter turns it into wav file ....this may BE THE REASON FOR LOW SPLEETER SCORES...anyway the wav file was merged backed to video...so 2 files were created 1 normal other with spleeter (with minimum encoding/transformation possible).

The original & transcribed files are:
a) Reservoir Dogs.1992.BluRay.720p.x264.YIFY.srt.
b) stt.txt (note: I did not generate srt file....as there was no clear documentation for it).
c) stt-spleeter.txt (transcribed with spleeter).
d) vosk.srt
e) vosk-spleeter.srt

Links (please note that srt files are renamed to txt as github does not allow srt uploads).

Reservoir Dogs.1992.BluRay.720p.x264.YIFY.txt
stt.txt
stt-spleeter.txt
vosk.txt
vosk-spleeter.txt

  1. The srt file was converted to text file by using happyscribe which just removes time-codes in srt ...then the empty lines & leading + trailing spaces of the sentences were removed using notepad++.

Since the SRT files are prone to spell & grammar mistakes (as they are mostly generated from an ocr based converter) ...so it was further corrected using US spell & grammar based AI project ...one can also use dspellcheck extension in notepad++ with US dictionary installed but results are not as accurate as in former.

Since SER (sentence error rate) is also an important parameter so I carried out asr-evaluation with original (base.txt file) & vosk.txt ...(both having sentences intact) but sentence error rate was 100% in both the cases of stt & vosk....also wer scores were less. Hence I opted to process both the files using neuspell or symspell ...to be frank I don't remember which one I used.

The final file generated contained all lower case words in 1 sentence (as SER error rate is 100%...so it was irrelevant).

The final processed file after spell & grammar corrections are:
a) base-final.txt (Reservoir Dogs.1992.BluRay.720p.x264.YIFY.srt corrected & renamed base-final for asr-evaluation)
b) stt-final.txt
c) stt-spleeter-final.txt
d) vosk-final.srt
e) vosk-spleeter-final.txt

I have attached all of them.
base-final.txt
stt-final.txt
stt-spleeter-final.txt
vosk-final.txt
vosk-spleeter-final.txt

Your suggestion..idea or tips are always welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants