Added script to compute phoneme labels and timestamps #528

rutujaubale · 2021-05-04T22:15:52Z

This is to add an ability to generate phone labels and timestamps in the Vosk recognizer output

Updated model.cc to read phone symbol table (i.e. phones.txt)
Phone table should be added under ("/graph/phones.txt") in your model directory following standard Kaldi convention
Updated Kaldi recognizer script to compute phoneme labels and timestamps and add them to the json output
Adds phone label, start and end timestamps in the word-level results only if you provide the phone symbol table. If you do not provide the phone symbol table then the recognizer will only generate the existing word-level features.
Prints silence words along with the corresponding phone information. "Gaps" or silences with duration of 0 seconds duration that don't have corresponding phone information are filtered out.
MBR decoding is disabled only for phone information extraction so that the outputs align but if you don't need phone output then you will be able to get word level result from MBR

Output looks like

  "result" : [{
      "conf" : 0.997802,
      "end" : 0.450000,
      "phone_end" : [0.450000],
      "phone_label" : ["SIL"],
      "phone_start" : [0.000000],
      "start" : 0.000000,
      "word" : "<eps>"
    }, {
      "conf" : 0.997153,
      "end" : 0.600000,
      "phone_end" : [0.540000, 0.600000],
      "phone_label" : ["DH_B", "AH1_E"],
      "phone_start" : [0.450000, 0.540000],
      "start" : 0.450000,
      "word" : "THE"
    }, {
      "conf" : 0.553237,
      "end" : 1.200000,
      "phone_end" : [0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000, 1.200000],
      "phone_label" : ["S_B", "T_I", "UW1_I", "D_I", "AH0_I", "N_I", "T_I", "S_E"],
      "phone_start" : [0.600000, 0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000],
      "start" : 0.600000,
      "word" : "STUDENT'S"
    }, {
      "conf" : 0.922575,
      "end" : 1.260000,
      "phone_end" : [1.260130],
      "phone_label" : ["SIL"],
      "phone_start" : [1.200130],
      "start" : 1.200130,
      "word" : "<eps>"
    }, {
      "conf" : 1.000000,
      "end" : 1.800000,
      "phone_end" : [1.440000, 1.500000, 1.590000, 1.680000, 1.800000],
      "phone_label" : ["S_B", "T_I", "AH1_I", "D_I", "IY0_E"],
      "phone_start" : [1.260000, 1.440000, 1.500000, 1.590000, 1.680000],
      "start" : 1.260000,
      "word" : "STUDY"
    }, {
      "conf" : 1.000000,
      "end" : 1.860000,
      "phone_end" : [1.860000],
      "phone_label" : ["AH0_S"],
      "phone_start" : [1.800000],
      "start" : 1.800000,
      "word" : "A"
    }, {
      "conf" : 1.000000,
      "end" : 2.190000,
      "phone_end" : [1.980000, 2.100000, 2.190000],
      "phone_label" : ["L_B", "AA1_I", "T_E"],
      "phone_start" : [1.860000, 1.980000, 2.100000],
      "start" : 1.860000,
      "word" : "LOT"
    }, {
      "conf" : 1.000000,
      "end" : 2.880000,
      "phone_end" : [2.880000],
      "phone_label" : ["SIL"],
      "phone_start" : [2.190000],
      "start" : 2.190000,
      "word" : "<eps>"
    }],
  "text" : " THE STUDENT'S STUDY A LOT"
}

nshmyrev · 2021-05-14T19:29:45Z

Thank you, I'll try to merge coming week.

rutujaubale · 2021-05-26T00:45:13Z

Thanks @nshmyrev! As an update, in our latest commit we have implemented a separate function to compute word and phone results if you want to generate phone level results aligned with word results (with confidences). The result option can now be configured using SetResultOptions() with words or phones as input. The C script under c/test_phone_results.c provides an example of how to set this option. Hope this looks much better!

j-j-kam · 2021-06-08T15:52:47Z

Dear @nshmyrev , this functionality by @rutujaubale is fantastic!

I believe that many people would hugely appreciate it if you merged it any time soon just as you have previously planned. Thank you very much for your efforts.

… to disable MBR to generate phone outputs

…to output phone results

…_options for better consistency

speechcon · 2021-11-11T16:54:40Z

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

entenbein · 2021-12-07T10:49:23Z

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

That would be greatly appreciated!

zhenxili96 · 2021-12-07T13:45:25Z

Hi @rutujaubale , I'm trying to rebuild vosk with your modification. But I got an error to rebuild it.

It seems that the function you are using, CompactLatticeToWordProns, is neither defined by the kaldi vosk used (https://github.com/alphacep/kaldi/blob/master/src/lat/lattice-functions.cc) nor defined by the official kaldi (https://github.com/kaldi-asr/kaldi/blob/master/src/lat/lattice-functions.cc) now. But the kaldi doc still has info about this function (https://kaldi-asr.org/doc/namespacekaldi.html#a8a2110207264ab1d31c2b04150541834).

Could you please let me know which version of kaldi are you using? Or how to build vosk with your modification?

sh-4.1# KALDI_ROOT=/opt/kaldi make
g++ -g -O3 -std=c++17 -Wno-deprecated-declarations -fPIC -DFST_NO_DYNAMIC_LINKING -I. -I/opt/kaldi/src -I/opt/kaldi/tools/openfst/include   -I/opt/kaldi/tools/OpenBLAS/install/include -c -o kaldi_recognizer.o kaldi_recognizer.cc
kaldi_recognizer.cc: In function ‘void ComputePhoneInfo(const kaldi::TransitionModel&, const CompactLattice&, const fst::SymbolTable&, const fst::SymbolTable&, std::vector<std::vector<std::basic_string<char> > >*, std::vector<std::vector<int> >*)’:
kaldi_recognizer.cc:425:12: error: ‘CompactLatticeToWordProns’ is not a member of ‘kaldi’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
kaldi_recognizer.cc:425:12: note: suggested alternative: ‘CompactLatticeToWordAlignment’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
            CompactLatticeToWordAlignment
make: *** [kaldi_recognizer.o] Error 1

mmende · 2021-12-07T13:56:00Z

@zhenxili96 I had to include lat/lattice-functions-transition-model.h in kaldi_recognizer.h to get it working like so:

#include "lat/lattice-functions-transition-model.h"

zhenxili96 · 2021-12-08T03:56:54Z

Thanks @mmende, it really helps.

x3a1n4 · 2022-07-02T02:36:08Z

Hello, I also found this pull request while looking for the same feature. I figured I would leave a comment (given it has been a while, and no official comment appears to have been made) since it would be extremely useful to have this sort of functionality!

Nathravorn · 2022-09-14T13:49:19Z

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

ChenFangDart · 2023-01-22T16:35:51Z

Is there anyway to get both phonemes and words at the same time for Spanish? I checked the two available Spanish models, neither of them have a phones.txt

Thanks and appreciate your help!

erikh2000 · 2023-01-22T18:06:34Z

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

madhephaestus · 2023-06-01T00:38:37Z

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.

madhephaestus · 2023-06-01T01:01:21Z

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

Maybe just go ahead and ope your own PR if you have a merge-able version of this code? I would love to see this feature released and to use it!

erikh2000 · 2023-06-01T02:47:19Z

@kevin, Rhubarb looks cool. I will check it out more later.

…

On Wed, May 31, 2023, 5:38 PM Kevin Harrington ***@***.***> wrote: I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings. Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using. — Reply to this email directly, view it on GitHub <#528 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACB3GHRV7TYX7KYLCIDXEATXI7QBPANCNFSM44DRJWNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nshmyrev mentioned this pull request Jun 30, 2021

A speech segmentation model with phone level #615

Closed

nshmyrev force-pushed the master branch from 924b447 to 2498bc5 Compare July 11, 2021 08:22

nshmyrev mentioned this pull request Sep 16, 2021

Is it possible to get the timing of phonemes, instead of full words? #687

Open

steveway mentioned this pull request Sep 27, 2021

Feature Request - Speech / Phonetics automatic generation/ alignment morevnaproject-org/papagayo-ng#49

Open

rutujaubale added 6 commits November 8, 2021 13:23

Added script to compute phoneme labels and timestamps

df7b238

Removed some debugging changes

4f1cbbb

Added handling for sizeable silences without phone outputs and option…

fb0d25d

… to disable MBR to generate phone outputs

Added an example script of how to set the recognition result options …

edf0dca

…to output phone results

Renamed vosk_recognizer_set_result_opts to vosk_recognizer_set_result…

7c0b55b

…_options for better consistency

Updated Makefile to include compiling test_phone_results.c

9f438fe

rutujaubale force-pushed the phoneInfo branch from 9374f11 to 9f438fe Compare November 8, 2021 18:26

nshmyrev force-pushed the master branch from a490f35 to b090341 Compare January 21, 2022 12:22

madhephaestus mentioned this pull request Jun 1, 2023

Plans for Rhubarb Lip Sync 2 DanielSWolf/rhubarb-lip-sync#95

Open

madhephaestus mentioned this pull request Jun 1, 2023

Add Phoneme labels and timestamps - take two #1377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added script to compute phoneme labels and timestamps #528

Added script to compute phoneme labels and timestamps #528

rutujaubale commented May 4, 2021 •

edited

Loading

nshmyrev commented May 14, 2021

rutujaubale commented May 26, 2021

j-j-kam commented Jun 8, 2021 •

edited

Loading

speechcon commented Nov 11, 2021 •

edited

Loading

entenbein commented Dec 7, 2021

zhenxili96 commented Dec 7, 2021

mmende commented Dec 7, 2021

zhenxili96 commented Dec 8, 2021

x3a1n4 commented Jul 2, 2022

Nathravorn commented Sep 14, 2022

ChenFangDart commented Jan 22, 2023

erikh2000 commented Jan 22, 2023

madhephaestus commented Jun 1, 2023

madhephaestus commented Jun 1, 2023

erikh2000 commented Jun 1, 2023 via email

Added script to compute phoneme labels and timestamps #528

Are you sure you want to change the base?

Added script to compute phoneme labels and timestamps #528

Conversation

rutujaubale commented May 4, 2021 • edited Loading

nshmyrev commented May 14, 2021

rutujaubale commented May 26, 2021

j-j-kam commented Jun 8, 2021 • edited Loading

speechcon commented Nov 11, 2021 • edited Loading

entenbein commented Dec 7, 2021

zhenxili96 commented Dec 7, 2021

mmende commented Dec 7, 2021

zhenxili96 commented Dec 8, 2021

x3a1n4 commented Jul 2, 2022

Nathravorn commented Sep 14, 2022

ChenFangDart commented Jan 22, 2023

erikh2000 commented Jan 22, 2023

madhephaestus commented Jun 1, 2023

madhephaestus commented Jun 1, 2023

erikh2000 commented Jun 1, 2023 via email

rutujaubale commented May 4, 2021 •

edited

Loading

j-j-kam commented Jun 8, 2021 •

edited

Loading

speechcon commented Nov 11, 2021 •

edited

Loading