Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

didn't integrete for custom en-in AM and custom LM, ERROR: "dict.c", Phone 'AE' is mising in the acoustic model; word 'absolutely' ignored #47

Closed
MuruganR96 opened this issue Sep 22, 2018 · 3 comments

Comments

@MuruganR96
Copy link

Problem

-Sir I didn't integrete for custom en-in Acoustic Model(Adapting the default acoustic model-Indian English) and custom Language Model.
i was download a acoustic model from this link:
i follow the instruction this link: https://cmusphinx.github.io/wiki/tutorialam/

sphinx_fe -argfile en_in/feat.params -samprate 16000 -c audio.fileids -di . -do . -ei wav -eo mfc -mswav yes

pocketsphinx_mdef_convert -text en_in/mdef en_in/mdef.txt

cp -a /usr/local/libexec/sphinxtrain/bw .
cp -a /usr/local/libexec/sphinxtrain/mk_s2sendump .
cp -a /usr/local/libexec/sphinxtrain/map_adapt .
cp -a /usr/local/libexec/sphinxtrain/mllr_solve .

./bw
-hmmdir en_in
-moddeffn en_in/mdef.txt
-ts2cbfn .cont.
-feat 1s_c_d_dd
-cmn current
-agc none
-dictfn en_in.dic
-ctlfn audio.fileids
-lsnfn audio.transcription
-accumdir .

./mllr_solve
-meanfn en_in/means
-varfn en_in/variances
-outmllrfn mllr_matrix -accumdir .

cp -a en_in en_in_own

./map_adapt
-moddeffn en_in/mdef.txt
-ts2cbfn .cont.
-meanfn en_in/means
-varfn en_in/variances
-mixwfn en_in/mixture_weights
-tmatfn en_in/transition_matrices
-accumdir .
-mapmeanfn en_in_own/means
-mapvarfn en_in_own/variances
-mapmixwfn en_in_own/mixture_weights
-maptmatfn en_in_own/transition_matrices

./mk_s2sendump
-pocketsphinx yes
-moddeffn en_in_own/mdef.txt
-mixwfn en_in_own/mixture_weights
-sendumpfn en_in_own/sendump

pocketsphinx_continuous -hmm en_in_own -lm en-us.lm.bin -dict en_in.dic -infile 38.wav > 4.txt

it is working but not predicting a particular words. words is relevant to banking sectors.so i build again own language model using language model build tool (Building a simple language model using a web service)

own language model: lm.dict & lm.bin:
transcript file: own_vocab.txt

sphinx_lm_convert -i own.lm -o own.lm.bin
sphinx_lm_convert -i own.lm.bin -ifmt bin -o own.lm -ofmt arpa

pocketsphinx_continuous -inmic yes -lm own.lm.bin -dict own.dic

sir, it is working fine. detecting that particular words. but one confusion,

which default acoustic model it takes to run on that command " pocketsphinx_continuous -inmic yes -lm own.lm.bin -dict own.dic" ?

but i integrete these two AM and LM, and run on,

pocketsphinx_continuous -hmm en_in_own -lm own.lm.bin -dict own.dic -infile 1.wav > result_own.txt

it was not return any words. and it shows error. phone words dict in the LM not present in the AM.

INFO: dict.c(333): Reading main dictionary: lm_model_resources/other/own.dic
ERROR: "dict.c", line 195: Line 5: Phone 'EH' is mising in the acoustic model; word 's' ignored
ERROR: "dict.c", line 195: Line 6: Phone 'EH' is mising in the acoustic model; word 's' ignored
ERROR: "dict.c", line 195: Line 7: Phone 'EY' is mising in the acoustic model; word 'a' ignored
ERROR: "dict.c", line 195: Line 8: Phone 'EY' is mising in the acoustic model; word 'able' ignored
ERROR: "dict.c", line 195: Line 9: Phone 'AH' is mising in the acoustic model; word 'about' ignored
ERROR: "dict.c", line 195: Line 10: Phone 'AE' is mising in the acoustic model; word 'absolutely' ignored

but some how i identify the issue. what it is, phone words(own.dict) EH, EY, AH, AE always presents in the en_in acoustic model(INDIAN ENGLISH mdef phones) also but it is in SMALL CASE.(en_in/ mdef file).
BUT OTHER ENGLISH mdef phones like wsj_all_cd30.mllt_cd_cont_4000, hub4_cd_continuous_8gau_1s_c_d_dd,
Columns definitions
#base lft rt p attrib tmat ... state id's ...
SIL - - - filler 0 0 1 2 N
UNK - - - n/a 1 3 4 5 N
aa - - - n/a 2 6 7 8 N
ae - - - n/a 3 9 10 11 N
ah - - - n/a 4 12 13 14 N

i tried something own.dic phones into small case but it was not reflect both AM & LM.

Basically that LM tool gives these kind of structure words and phones. it is affecting acoustic model model. these two not sync.

i tried another way something to create a own.lm.bin & own.dic also

Build an other way LM:
text2wfreq < own_vocab.txt | wfreq2vocab > own_vocab.tmp.vocab

text2idngram -vocab own_vocab.tmp.vocab -idngram own_vocab.idngram < own_vocab.txt

idngram2lm -vocab_type 0 -idngram own_vocab.idngram -vocab own_vocab.tmp.vocab -arpa own.lm

sphinx_lm_convert -i own.lm -o own.lm.bin

Build a own.dic an other way:
i was followed these link: &

g2p-seq2seq --decode own_vocab.tmp.vocab --model_dir g2p-seq2seq/g2p-seq2seq-model-6.2-cmudict-nostress --output own.dic
pocketsphinx_continuous -lm own.lm.bin -dict own.dic -infile 10.wav > 10.txt

it is working fine to predicting a particular words but that confusion is,

which acoustic model is combined to run on that command "pocketsphinx_continuous -lm own.lm.bin -dict own.dic -infile 10.wav > 10.txt"

but i integrete these two AM and LM, and run on,
pocketsphinx_continuous -lm own.lm.bin -dict own.dic -infile 10.wav > 10.txt -hmm en_in_own

Again it was return the same error. it was not display any text. the error log is,

INFO: dict.c(333): Reading main dictionary: lm_model_resources/other/own.dic
ERROR: "dict.c", line 195: Line 5: Phone 'EH' is mising in the acoustic model; word 's' ignored
ERROR: "dict.c", line 195: Line 6: Phone 'EH' is mising in the acoustic model; word 's' ignored
ERROR: "dict.c", line 195: Line 7: Phone 'EY' is mising in the acoustic model; word 'a' ignored
ERROR: "dict.c", line 195: Line 8: Phone 'EY' is mising in the acoustic model; word 'able' ignored
ERROR: "dict.c", line 195: Line 9: Phone 'AH' is mising in the acoustic model; word 'about' ignored
ERROR: "dict.c", line 195: Line 10: Phone 'AE' is mising in the acoustic model; word 'absolutely' ignored

LM tool produced dict(word-phone) format:
A AH
A(2) EY
ABLE EY B AH L
ABOUT AH B AW T
ABSOLUTELY AE B S AH L UW T L IY

LM g2p-seq2seq produced dict(word-phone) format:
s EH S
s EH S
a EY
able EY B AH L
about AH B AW T
absolutely AE B S AH L UW T L IY

en_in_own mdef phones structure:
ia f aa s n/a 20 2023 2038 2063 N
ia f ae e n/a 20 2023 2038 2063 N
ia f ae s n/a 20 2023 2038 2063 N
ia f ah e n/a 20 2023 2038 2063 N
ia f ah s n/a 20 2023 2038 2063 N
ia f ao e n/a 20 2023 2038 2063 N
ia f ao s n/a 20 2023 2038 2063 N
ia f aw e n/a 20 2023 2038 2063 N

really is those small case was an issue or not? i was not able to predict this issue.

Sir How can i fix this issue?
i didn't integrete for custom en-in AM and custom LM, ERROR: "dict.c", Phone 'AE' is mising in the acoustic model; word 'absolutely' ignored

  • OS: Linux with version 16.04
  • Python3:
  • Sphinx version:
    PocketSphinx 5prealpha
@nshmyrev
Copy link

You have to use Indian English phonetic dictionary with this model and train Indian English g2p model with seq2seq

@MuruganR96
Copy link
Author

@nshmyrev thank you sir. i will train, build and run a model. then i will update my status sir.

@MuruganR96
Copy link
Author

MuruganR96 commented Sep 22, 2018

sir i didn't understand this meaning,

" train Indian English g2p model with seq2seq "

we have our en_in.dic (predefined Indian English phonetic dictionary) and then custom acoustic model (en_in_own).
And then we have a g2p model,

**```
wget -O g2p-seq2seq-cmudict.tar.gz https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/g2p-seq2seq-model-6.2-cmudict-nostress.tar.gz/download
tar xf g2p-seq2seq-cmudict.tar.gz
```**

and,
G2P Models :
g2p-seq2seq-model-6.2-cmudict-nostress.tar.gz
g2p-seq2seq-model-6.2-pronasyl.tar.gz
g2p-seq2seq-model-5.2-cmudict.tar.gz
phonetisaurus-cmudict-split.tar.gz

fst:
it.tar.gz (Italian)
en_us_nostress.tar.gz (english)
zh.tar.gz(Mandarin)
ru.tar.gz (Russian)
nl.tar.gz (Dutch)
fr.tar.gz (French)
es.tar.gz (Spanish)
es_mx.tar.gz(Mexican Spanish)
de.tar.gz (German)

it is the seq2seq g2p model.not mension these particularly but what is meant by Indian English g2p model with seq2seq.

how can i train these indian English g2p model. i am really confused sir.
we take the wordlist is a text file with one word per line----> own_vocab.tmp.vocab
and run a below program,
**g2p-seq2seq --decode own_vocab.tmp.vocab --model_dir g2p-seq2seq-model-6.2-cmudict-nostress --output own.dic**
got own.dict i know these only sir.
how can i train Indian English g2p model with seq2seq. sir can you explain me sir?
@nshmyrev thank you so much sir.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants