Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added script to compute phoneme labels and timestamps #528

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rutujaubale
Copy link

@rutujaubale rutujaubale commented May 4, 2021

This is to add an ability to generate phone labels and timestamps in the Vosk recognizer output

  • Updated model.cc to read phone symbol table (i.e. phones.txt)
  • Phone table should be added under ("/graph/phones.txt") in your model directory following standard Kaldi convention
  • Updated Kaldi recognizer script to compute phoneme labels and timestamps and add them to the json output
  • Adds phone label, start and end timestamps in the word-level results only if you provide the phone symbol table. If you do not provide the phone symbol table then the recognizer will only generate the existing word-level features.
  • Prints silence words along with the corresponding phone information. "Gaps" or silences with duration of 0 seconds duration that don't have corresponding phone information are filtered out.
  • MBR decoding is disabled only for phone information extraction so that the outputs align but if you don't need phone output then you will be able to get word level result from MBR

Output looks like

  "result" : [{
      "conf" : 0.997802,
      "end" : 0.450000,
      "phone_end" : [0.450000],
      "phone_label" : ["SIL"],
      "phone_start" : [0.000000],
      "start" : 0.000000,
      "word" : "<eps>"
    }, {
      "conf" : 0.997153,
      "end" : 0.600000,
      "phone_end" : [0.540000, 0.600000],
      "phone_label" : ["DH_B", "AH1_E"],
      "phone_start" : [0.450000, 0.540000],
      "start" : 0.450000,
      "word" : "THE"
    }, {
      "conf" : 0.553237,
      "end" : 1.200000,
      "phone_end" : [0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000, 1.200000],
      "phone_label" : ["S_B", "T_I", "UW1_I", "D_I", "AH0_I", "N_I", "T_I", "S_E"],
      "phone_start" : [0.600000, 0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000],
      "start" : 0.600000,
      "word" : "STUDENT'S"
    }, {
      "conf" : 0.922575,
      "end" : 1.260000,
      "phone_end" : [1.260130],
      "phone_label" : ["SIL"],
      "phone_start" : [1.200130],
      "start" : 1.200130,
      "word" : "<eps>"
    }, {
      "conf" : 1.000000,
      "end" : 1.800000,
      "phone_end" : [1.440000, 1.500000, 1.590000, 1.680000, 1.800000],
      "phone_label" : ["S_B", "T_I", "AH1_I", "D_I", "IY0_E"],
      "phone_start" : [1.260000, 1.440000, 1.500000, 1.590000, 1.680000],
      "start" : 1.260000,
      "word" : "STUDY"
    }, {
      "conf" : 1.000000,
      "end" : 1.860000,
      "phone_end" : [1.860000],
      "phone_label" : ["AH0_S"],
      "phone_start" : [1.800000],
      "start" : 1.800000,
      "word" : "A"
    }, {
      "conf" : 1.000000,
      "end" : 2.190000,
      "phone_end" : [1.980000, 2.100000, 2.190000],
      "phone_label" : ["L_B", "AA1_I", "T_E"],
      "phone_start" : [1.860000, 1.980000, 2.100000],
      "start" : 1.860000,
      "word" : "LOT"
    }, {
      "conf" : 1.000000,
      "end" : 2.880000,
      "phone_end" : [2.880000],
      "phone_label" : ["SIL"],
      "phone_start" : [2.190000],
      "start" : 2.190000,
      "word" : "<eps>"
    }],
  "text" : " THE STUDENT'S STUDY A LOT"
}

@nshmyrev
Copy link
Collaborator

Thank you, I'll try to merge coming week.

@rutujaubale
Copy link
Author

Thanks @nshmyrev! As an update, in our latest commit we have implemented a separate function to compute word and phone results if you want to generate phone level results aligned with word results (with confidences). The result option can now be configured using SetResultOptions() with words or phones as input. The C script under c/test_phone_results.c provides an example of how to set this option. Hope this looks much better!

@j-j-kam
Copy link

j-j-kam commented Jun 8, 2021

Dear @nshmyrev , this functionality by @rutujaubale is fantastic!

I believe that many people would hugely appreciate it if you merged it any time soon just as you have previously planned. Thank you very much for your efforts.

@speechcon
Copy link

speechcon commented Nov 11, 2021

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

@entenbein
Copy link

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

That would be greatly appreciated!

@zhenxili96
Copy link

Hi @rutujaubale , I'm trying to rebuild vosk with your modification. But I got an error to rebuild it.

It seems that the function you are using, CompactLatticeToWordProns, is neither defined by the kaldi vosk used (https://github.com/alphacep/kaldi/blob/master/src/lat/lattice-functions.cc) nor defined by the official kaldi (https://github.com/kaldi-asr/kaldi/blob/master/src/lat/lattice-functions.cc) now. But the kaldi doc still has info about this function (https://kaldi-asr.org/doc/namespacekaldi.html#a8a2110207264ab1d31c2b04150541834).

Could you please let me know which version of kaldi are you using? Or how to build vosk with your modification?

sh-4.1# KALDI_ROOT=/opt/kaldi make
g++ -g -O3 -std=c++17 -Wno-deprecated-declarations -fPIC -DFST_NO_DYNAMIC_LINKING -I. -I/opt/kaldi/src -I/opt/kaldi/tools/openfst/include   -I/opt/kaldi/tools/OpenBLAS/install/include -c -o kaldi_recognizer.o kaldi_recognizer.cc
kaldi_recognizer.cc: In function ‘void ComputePhoneInfo(const kaldi::TransitionModel&, const CompactLattice&, const fst::SymbolTable&, const fst::SymbolTable&, std::vector<std::vector<std::basic_string<char> > >*, std::vector<std::vector<int> >*)’:
kaldi_recognizer.cc:425:12: error: ‘CompactLatticeToWordProns’ is not a member of ‘kaldi’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
kaldi_recognizer.cc:425:12: note: suggested alternative: ‘CompactLatticeToWordAlignment’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
            CompactLatticeToWordAlignment
make: *** [kaldi_recognizer.o] Error 1

@mmende
Copy link
Contributor

mmende commented Dec 7, 2021

@zhenxili96 I had to include lat/lattice-functions-transition-model.h in kaldi_recognizer.h to get it working like so:

#include "lat/lattice-functions-transition-model.h"

@zhenxili96
Copy link

Thanks @mmende, it really helps.

@x3a1n4
Copy link

x3a1n4 commented Jul 2, 2022

Hello, I also found this pull request while looking for the same feature. I figured I would leave a comment (given it has been a while, and no official comment appears to have been made) since it would be extremely useful to have this sort of functionality!

@Nathravorn
Copy link

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

@ChenFangDart
Copy link

Is there anyway to get both phonemes and words at the same time for Spanish? I checked the two available Spanish models, neither of them have a phones.txt

Thanks and appreciate your help!

@erikh2000
Copy link

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

@madhephaestus
Copy link

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.

@madhephaestus
Copy link

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

Maybe just go ahead and ope your own PR if you have a merge-able version of this code? I would love to see this feature released and to use it!

@erikh2000
Copy link

erikh2000 commented Jun 1, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.