Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect speech pauses; Out of Memory Crash #14

Open
jonathanglasmeyer opened this issue Oct 25, 2014 · 5 comments
Open

Detect speech pauses; Out of Memory Crash #14

jonathanglasmeyer opened this issue Oct 25, 2014 · 5 comments

Comments

@jonathanglasmeyer
Copy link

I'm not entirely sure if this is the best place to ask these kind of questions, so please point me to a better place in case there is one.

We are currently using the Sphinx4 Long Aligner with some success for a subtitling project at University Hamburg.

Today was the first time that I tried it successfully "in the field".
I took the transcription and this video from the CCC Congress and aligned the 35min video (of course i mean the converted wav according to your instructions) in ~88 min with Sphinx Long Aligner, which is pretty good i think. (You can see the (manually optimized) results on the linked video page.)

So right now the biggest problem for this application are pauses in speech. The words are always directly next to each other even if there are long pauses. This means a lot of manual dragging around of the results. Long story short: is there an option to turn on speech pauses detection?

Also, a little second problem: when trying the Aligner with a >50min audio, it fails with an Out of Memory error at the liveCMN stage (the java vm has a 7G limit), after about 2h. Is there a way to change this?

Thanks for your help and your great work, that enables us to work on subtitling the CCC videos a magnitude faster.

@nshmyrev
Copy link
Contributor

Hi Jonathan

Thanks for using CMUSphinx

Could you please elaborate more on this problem with pauses? I'm not sure I get it.

Also please share the problematic files where you have issues with aligner.

Thank you.

@jonathanglasmeyer
Copy link
Author

Hi,
so say the speaker makes a longer pause. Then this pause isn't represented in the timing information of the last word before the pause and the first word after the pause -- they are aligned as though they would be directly next to each other.

So an example where it failed with the same error on 2 pc's is this audio with this transcription.

The Aligner is running for ~45min and than hangs at the same position in the logging output (it just stands still, for >60min)

.
.
.

INFO: Skipping text range due to a high density [and]
Oct 25, 2014 8:55:49 PM edu.cmu.sphinx.api.SpeechAligner align
INFO: Aligning frame 0:15580 to text [id, like, to, introduce, our, speaker, here, patrick, here, has, made, a, carrer, of, datamining, for, good, prosecuting, war, crimes, got, a, conviction, in, his, own, country, gouatemala, thank] range edu.cmu.sphinx.util.Range@61dfee2f
20:55:49.086 INFO dictionary           Loading dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/dict/cmudict.0.6d
20:55:49.175 INFO dictionary           Loading filler dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/noisedict
20:55:49.176 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.176 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.177 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.597 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.598 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.598 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.608 INFO lexTreeLinguist      Max CI Units 50
20:55:49.608 INFO lexTreeLinguist      Unit table size 125000
20:55:49.640 INFO liveCMN              15.56 -0.79 -1.05 -0.39 -0.27 -0.12 -0.13 -0.16 -0.17 -0.15 -0.19 -0.16 -0.16 
20:55:49.684 INFO liveCMN              13.81 -0.78 -0.88 -0.39 -0.23 -0.10 -0.10 -0.12 -0.16 -0.15 -0.16 -0.15 -0.14 
20:55:49.823 INFO liveCMN              11.58 -0.66 -0.60 -0.30 -0.12 -0.03 -0.02 -0.07 -0.12 -0.12 -0.12 -0.11 -0.13 
20:55:50.114 INFO liveCMN              11.15 -0.74 -0.72 -0.33 -0.12 0.00 -0.01 -0.05 -0.11 -0.06 -0.10 -0.11 -0.10 
20:55:50.729 INFO liveCMN              12.25 -0.87 -0.85 -0.39 -0.17 -0.03 -0.06 -0.06 -0.11 -0.07 -0.10 -0.11 -0.12 
20:55:51.236 INFO liveCMN              13.46 -0.75 -0.88 -0.39 -0.15 -0.05 -0.07 -0.06 -0.11 -0.10 -0.12 -0.13 -0.13

So here it is probably not a Out of Memory problem, but some other kind ..

Could this be correlated to bad quality of the transcription?

@mbait
Copy link
Contributor

mbait commented Oct 25, 2014

Hi Jonathan,

Then this pause isn't represented in the timing information of the last
word before the pause and the first word after the pause -- they are
aligned as though they would be directly next to each other.

It still isn't clear what's your expected and actual output.

On Sun, Oct 26, 2014 at 9:11 AM, Jonathan Werner notifications@github.com
wrote:

Hi,
so say the speaker makes a longer pause. Then this pause isn't represented
in the timing information of the last word before the pause and the first
word after the pause -- they are aligned as though they would be directly
next to each other.

So an example where it failed with the same error on 2 pc's is this audio
https://transfer.sh/fd21Z/datamining.wav with [this transcription]
https://transfer.sh/fd21Z/datamining.wav).

The Aligner is running for ~45min and than hangs at the same position in
the logging output (it just stands still, for >60min)

.
.
.

INFO: Skipping text range due to a high density [and]
Oct 25, 2014 8:55:49 PM edu.cmu.sphinx.api.SpeechAligner align
INFO: Aligning frame 0:15580 to text [id, like, to, introduce, our, speaker, here, patrick, here, has, made, a, carrer, of, datamining, for, good, prosecuting, war, crimes, got, a, conviction, in, his, own, country, gouatemala, thank] range edu.cmu.sphinx.util.Range@61dfee2f
20:55:49.086 INFO dictionary Loading dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/dict/cmudict.0.6d
20:55:49.175 INFO dictionary Loading filler dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/noisedict
20:55:49.176 INFO dictionary The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.176 INFO dictionary The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.177 INFO dictionary The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.597 INFO dictionary The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.598 INFO dictionary The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.598 INFO dictionary The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.608 INFO lexTreeLinguist Max CI Units 50
20:55:49.608 INFO lexTreeLinguist Unit table size 125000
20:55:49.640 INFO liveCMN 15.56 -0.79 -1.05 -0.39 -0.27 -0.12 -0.13 -0.16 -0.17 -0.15 -0.19 -0.16 -0.16
20:55:49.684 INFO liveCMN 13.81 -0.78 -0.88 -0.39 -0.23 -0.10 -0.10 -0.12 -0.16 -0.15 -0.16 -0.15 -0.14
20:55:49.823 INFO liveCMN 11.58 -0.66 -0.60 -0.30 -0.12 -0.03 -0.02 -0.07 -0.12 -0.12 -0.12 -0.11 -0.13
20:55:50.114 INFO liveCMN 11.15 -0.74 -0.72 -0.33 -0.12 0.00 -0.01 -0.05 -0.11 -0.06 -0.10 -0.11 -0.10
20:55:50.729 INFO liveCMN 12.25 -0.87 -0.85 -0.39 -0.17 -0.03 -0.06 -0.06 -0.11 -0.07 -0.10 -0.11 -0.12
20:55:51.236 INFO liveCMN 13.46 -0.75 -0.88 -0.39 -0.15 -0.05 -0.07 -0.06 -0.11 -0.10 -0.12 -0.13 -0.13

So here it is not probably not a Out of Memory problem, but some other
kind ..

Could this be correlated to bad quality of the transcription?


Reply to this email directly or view it on GitHub
#14 (comment).

Sincerely, Alexander

@jonathanglasmeyer
Copy link
Author

Ok, let me rephrase it with an example:
Say you have two words A and B, with the following real start and stop times (in seconds):
A start=2, stop=2.2
B start=4, stop=4.2

So you have a speech pause between 2.2 and 4.
We would like to have this pause represented in the alignment.

But the actual alignment looks for example like this:
A start=2, stop=2.2
B start=2.2, stop=4.2

@nshmyrev
Copy link
Contributor

I can take a look

Btw, for better alignment quality you should better use en-us generic acoustic model:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/en-us.tar.gz/download

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants