Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different subtitle outputs with CLI commands #33

Open
GalenMarek14 opened this issue Jan 1, 2024 · 1 comment
Open

Different subtitle outputs with CLI commands #33

GalenMarek14 opened this issue Jan 1, 2024 · 1 comment

Comments

@GalenMarek14
Copy link

Is there a way to edit subtitle outputs with CLI commands? It would be very good to have formats like this:

1
00:00:00,030 --> 00:00:00,070
<font color="#00ff00">The</font> first sentence.

2
00:00:00,070 --> 00:00:00,080
The first sentence.

3
00:00:00,080 --> 00:00:00,450
The <font color="#00ff00">first</font> sentence.

4
00:00:00,450 --> 00:00:00,530
The first sentence.

5
00:00:00,530 --> 00:00:01,100
The first <font color="#00ff00">sentence</font>.

6
00:00:01,740 --> 00:00:01,780
<font color="#00ff00">The</font> second sentence.

7
00:00:01,780 --> 00:00:01,800
The second sentence.

8
00:00:01,800 --> 00:00:02,250
The <font color="#00ff00">second</font> sentence.

9
00:00:02,250 --> 00:00:02,260
The second sentence.

10
00:00:02,260 --> 00:00:02,800
The second <font color="#00ff00">sentence</font>.

So far the only method I can think of is converting JSON files but it's a bit hard for me as a non-coder.

@rotemdan
Copy link
Member

rotemdan commented Mar 11, 2024

There's no standardized format (that I know of) for word-level subtitles, unfortunately.

The auto-subtitles from YouTube internally use both a custom JSON format like:

	"events": [
		{
			"tStartMs": 0,
			"dDurationMs": 502120,
			"id": 1,
			"wpWinPosId": 1,
			"wsWinStyleId": 1
		},
		{
			"tStartMs": 120,
			"dDurationMs": 7239,
			"wWinId": 1,
			"segs": [
				{
					"utf8": "great",
					"acAsrConf": 0
				},
				{
					"utf8": " paper",
					"tOffsetMs": 400,
					"acAsrConf": 0
				},
				{
					"utf8": " today",
					"tOffsetMs": 760,
					"acAsrConf": 0
				},
				{
					"utf8": " fellow",
					"tOffsetMs": 1240,
					"acAsrConf": 0
				},
				{
					"utf8": " Scholars",
					"tOffsetMs": 1640,
					"acAsrConf": 0
				},
				{
					"utf8": " stable",
					"tOffsetMs": 2519,
					"acAsrConf": 0
				}
			]
		},
		{
			"tStartMs": 3149,
			"dDurationMs": 4210,
			"wWinId": 1,
			"aAppend": 1,
			"segs": [
				{
					"utf8": "\n"
				}
			]
		},
		{
			"tStartMs": 3159,
			"dDurationMs": 6841,
			"wWinId": 1,
			"segs": [
				{
					"utf8": "diffusion",
					"acAsrConf": 0
				},
				{
					"utf8": " XL",
					"tOffsetMs": 800,
					"acAsrConf": 0
				},
				{
					"utf8": " turbo",
					"tOffsetMs": 1761,
					"acAsrConf": 0
				},
				{
					"utf8": " why",
					"tOffsetMs": 2761,
					"acAsrConf": 0
				},
				{
					"utf8": " well",
					"tOffsetMs": 3441,
					"acAsrConf": 0
				},
				{
					"utf8": " because",
					"tOffsetMs": 3881,
					"acAsrConf": 0
				}
			]
		},

And also extend the VTT subtitle format using special word timestamp tags:

WEBVTT
Kind: captions
Language: en

00:00:00.120 --> 00:00:03.149 align:start position:0%
 
great<00:00:00.520><c> paper</c><00:00:00.880><c> today</c><00:00:01.360><c> fellow</c><00:00:01.760><c> Scholars</c><00:00:02.639><c> stable</c>

00:00:03.149 --> 00:00:03.159 align:start position:0%
great paper today fellow Scholars stable
 

00:00:03.159 --> 00:00:07.349 align:start position:0%
great paper today fellow Scholars stable
diffusion<00:00:03.959><c> XL</c><00:00:04.920><c> turbo</c><00:00:05.920><c> why</c><00:00:06.600><c> well</c><00:00:07.040><c> because</c>

00:00:07.349 --> 00:00:07.359 align:start position:0%
diffusion XL turbo why well because

These are internal formats they use, which I fetched using a special downloader like youtube-dl, but are otherwise not publicly accessible.

I don't know of any software that actually supports these formats for viewing, so I'm not sure what would be the benefit to support them or try to imitate them (However, it could support reading and converting them in the future - but remember that they can only be fetched using special downloaders and not by the official YouTube API, so the priority to implement this is currently low).

The JSON format produced by Echogarden contains a lot of extra linguistic information, like phonetic pronunciation and sub-word timing, and also includes word offsets to the original raw text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants