Word-by-word tagging #120

brannondorsey · 2014-11-09T07:35:59Z

Hi there,

I am interested in identifying the precise in and out timestamps of specific words embedded in the closed caption data of an mpeg2 stream. It seems that with CCExtractor, only lines of text are indexed in this way. Is this a limitation of CCExtractor specifically, or the standards of CC in digital broadcast? If this functionality is not directly built into CCExtractor, would you have any suggestions as to how to extract and use this very specific data?

anshul1912 · 2014-11-10T08:32:50Z

It is possible only if the video have word by word timing inside, do you have some video where each words of some lines are shown at different time.

like below caption
"This is caption"
if first "This" on x second is shown
then "is" on x+1 second shown
then "caption" on x+2 second is shown

I have never seen a video like that, though people do make cc in such a way where they
show "This" then "This is" then "This is caption", here data is redundant but people's are using it for effect.

In closed caption timing is generally taken from PES packet which contain closed caption, so if your each word is in different PES packet then you can get that timing, I don't think there are any sane closed caption encoder who display one word at a time with each frame, that decrease readability of those statement and it would not be useful too.

can you elaborate why are you interested in identifying the precise in and out timestamps of specific words

cfsmp3 · 2014-11-10T11:00:38Z

For captions transmitted in roll-up we could have word-by-word timing
(since characters are displayed as received); however in roll-up, which is
used mostly for newscasts and other content transcribed in real time,
there's no lipsync (captions are at least a couple seconds behind audio) so
there's no value in doing that either.

On Sun, Nov 9, 2014 at 8:35 AM, Brannon Dorsey notifications@github.com
wrote:

Hi there,

I am interested in identifying the precise in and out timestamps of
specific words embedded in the closed caption data of an mpeg2 stream. It
seems that with CCExtractor, only lines of text are indexed in this way. Is
this a limitation of CCExtractor specifically, or the standards of CC in
digital broadcast? If this functionality is not directly built into
CCExtractor, would you have any suggestions as to how to extract and use
this very specific data?

—
Reply to this email directly or view it on GitHub
#120.

brannondorsey · 2014-11-12T01:35:37Z

Hi all,
Thank you both for your timely responses! It seems as if no closed captioning will provide me with the level of control that I am looking for in tagging words in television programs. I am looking to create a database of precise in and out points of words in network TV and Movies. That said, CCExtractor is a really fine piece of software.

cfsmp3 added the invalid label Nov 11, 2014

cfsmp3 closed this as completed Nov 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word-by-word tagging #120

Word-by-word tagging #120

brannondorsey commented Nov 9, 2014

anshul1912 commented Nov 10, 2014

cfsmp3 commented Nov 10, 2014

brannondorsey commented Nov 12, 2014

Word-by-word tagging #120

Word-by-word tagging #120

Comments

brannondorsey commented Nov 9, 2014

anshul1912 commented Nov 10, 2014

cfsmp3 commented Nov 10, 2014

brannondorsey commented Nov 12, 2014