Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diarization: add diarization support for all current output types #1031

Merged
merged 2 commits into from
Jun 25, 2023

Conversation

colinc
Copy link
Contributor

@colinc colinc commented Jun 19, 2023

Addressing #1020 with a first pass at getting diarization labeling working for all current output types.

This extracts all diarization code into its own, reusable function (I'd welcome any improvements to the function name: estimate_diarization_speaker 😅).

This updates all output formats to include diarization labeling, when applicable. For intermediate formats (JSON, CSV, etc) I think it makes sense to only include the speaker ID, instead of a formatted string (i.e. "1" vs "(speaker 1)"), to allow the end program or system to determine how they want to handle the speaker label. For all other more "final formats" the formatted speaker string is included.

Examples of all output formats from a 1 minute excerpt of Lex Fridman's interview of Andrej Karpathy:

text file

(speaker 0) Then these neural nets take on pretty surprising magical
(speaker 0) properties.
(speaker 0) I think it's kind of interesting how much you can get out
(speaker 0) of even very simple mathematical
(speaker 0) formalism.
(speaker 1) When your brain right now is talking, is it doing next word
(speaker 1) prediction or is it doing
(speaker 1) something more interesting?
(speaker 0) Well, it's definitely some kind of a generative model that
(speaker 0)'s a GPT-like and prompted by you.
(speaker 0) So you're giving me a prompt and I'm kind of like
(speaker 0) responding to it in a generative way.
(speaker 1) And by yourself perhaps a little bit?
(speaker 1) Like are you adding extra prompts from your own memory
(speaker 1) inside your head?
(speaker 0) Hmm.
(speaker 1) Or no?
(speaker 0) Well, it definitely feels like you're referencing some kind
(speaker 0) of a declarative structure of like
(speaker 0) memory and so on.
(speaker 0) And then you're putting that together with your prompt and
(speaker 0) giving away some answers.
(speaker 1) How much of what you just said has been said by you before?
(speaker 0) Nothing basically, right?
(speaker 1) No, but if you actually look at all the words you've ever
(speaker 1) said in your life and you do a
(speaker 1) search you'll probably said a lot of the same words in the
(speaker 1) same order before.
(speaker 0) Yeah, could be.

vtt file WEBVTT

00:00:00.000 --> 00:00:03.820
<v Speaker0> Then these neural nets take on pretty surprising magical

00:00:03.820 --> 00:00:04.700
<v Speaker0> properties.

00:00:04.700 --> 00:00:06.990
<v Speaker0> I think it's kind of interesting how much you can get out

00:00:06.990 --> 00:00:08.440
<v Speaker0> of even very simple mathematical

00:00:08.440 --> 00:00:09.440
<v Speaker0> formalism.

00:00:09.440 --> 00:00:12.870
<v Speaker1> When your brain right now is talking, is it doing next word

00:00:12.870 --> 00:00:14.380
<v Speaker1> prediction or is it doing

00:00:14.380 --> 00:00:15.380
<v Speaker1> something more interesting?

00:00:15.380 --> 00:00:18.520
<v Speaker0> Well, it's definitely some kind of a generative model that

00:00:18.520 --> 00:00:20.520
<v Speaker0>'s a GPT-like and prompted by you.

00:00:20.520 --> 00:00:23.360
<v Speaker0> So you're giving me a prompt and I'm kind of like

00:00:23.360 --> 00:00:25.580
<v Speaker0> responding to it in a generative way.

00:00:25.580 --> 00:00:27.740
<v Speaker1> And by yourself perhaps a little bit?

00:00:27.740 --> 00:00:31.460
<v Speaker1> Like are you adding extra prompts from your own memory

00:00:31.460 --> 00:00:32.440
<v Speaker1> inside your head?

00:00:32.440 --> 00:00:33.440
<v Speaker0> Hmm.

00:00:33.440 --> 00:00:34.440
<v Speaker1> Or no?

00:00:34.440 --> 00:00:36.520
<v Speaker0> Well, it definitely feels like you're referencing some kind

00:00:36.520 --> 00:00:37.680
<v Speaker0> of a declarative structure of like

00:00:37.680 --> 00:00:39.240
<v Speaker0> memory and so on.

00:00:39.240 --> 00:00:42.400
<v Speaker0> And then you're putting that together with your prompt and

00:00:42.400 --> 00:00:43.960
<v Speaker0> giving away some answers.

00:00:43.960 --> 00:00:49.100
<v Speaker1> How much of what you just said has been said by you before?

00:00:49.100 --> 00:00:50.400
<v Speaker0> Nothing basically, right?

00:00:50.400 --> 00:00:53.270
<v Speaker1> No, but if you actually look at all the words you've ever

00:00:53.270 --> 00:00:54.840
<v Speaker1> said in your life and you do a

00:00:54.840 --> 00:00:59.000
<v Speaker1> search you'll probably said a lot of the same words in the

00:00:59.000 --> 00:01:00.380
<v Speaker1> same order before.

00:01:00.380 --> 00:01:03.500
<v Speaker0> Yeah, could be.

srt file

1
00:00:00,000 --> 00:00:03,820
(speaker 0) Then these neural nets take on pretty surprising magical

2
00:00:03,820 --> 00:00:04,700
(speaker 0) properties.

3
00:00:04,700 --> 00:00:06,990
(speaker 0) I think it's kind of interesting how much you can get out

4
00:00:06,990 --> 00:00:08,440
(speaker 0) of even very simple mathematical

5
00:00:08,440 --> 00:00:09,440
(speaker 0) formalism.

6
00:00:09,440 --> 00:00:12,870
(speaker 1) When your brain right now is talking, is it doing next word

7
00:00:12,870 --> 00:00:14,380
(speaker 1) prediction or is it doing

8
00:00:14,380 --> 00:00:15,380
(speaker 1) something more interesting?

9
00:00:15,380 --> 00:00:18,520
(speaker 0) Well, it's definitely some kind of a generative model that

10
00:00:18,520 --> 00:00:20,520
(speaker 0)'s a GPT-like and prompted by you.

11
00:00:20,520 --> 00:00:23,360
(speaker 0) So you're giving me a prompt and I'm kind of like

12
00:00:23,360 --> 00:00:25,580
(speaker 0) responding to it in a generative way.

13
00:00:25,580 --> 00:00:27,740
(speaker 1) And by yourself perhaps a little bit?

14
00:00:27,740 --> 00:00:31,460
(speaker 1) Like are you adding extra prompts from your own memory

15
00:00:31,460 --> 00:00:32,440
(speaker 1) inside your head?

16
00:00:32,440 --> 00:00:33,440
(speaker 0) Hmm.

17
00:00:33,440 --> 00:00:34,440
(speaker 1) Or no?

18
00:00:34,440 --> 00:00:36,520
(speaker 0) Well, it definitely feels like you're referencing some kind

19
00:00:36,520 --> 00:00:37,680
(speaker 0) of a declarative structure of like

20
00:00:37,680 --> 00:00:39,240
(speaker 0) memory and so on.

21
00:00:39,240 --> 00:00:42,400
(speaker 0) And then you're putting that together with your prompt and

22
00:00:42,400 --> 00:00:43,960
(speaker 0) giving away some answers.

23
00:00:43,960 --> 00:00:49,100
(speaker 1) How much of what you just said has been said by you before?

24
00:00:49,100 --> 00:00:50,400
(speaker 0) Nothing basically, right?

25
00:00:50,400 --> 00:00:53,270
(speaker 1) No, but if you actually look at all the words you've ever

26
00:00:53,270 --> 00:00:54,840
(speaker 1) said in your life and you do a

27
00:00:54,840 --> 00:00:59,000
(speaker 1) search you'll probably said a lot of the same words in the

28
00:00:59,000 --> 00:01:00,380
(speaker 1) same order before.

29
00:01:00,380 --> 00:01:03,500
(speaker 0) Yeah, could be.

lrc video

[by:whisper.cpp]
[00:00.00](speaker 0) Then these neural nets take on pretty surprising magical
[00:03.82](speaker 0) properties.
[00:04.70](speaker 0) I think it's kind of interesting how much you can get out
[00:06.99](speaker 0) of even very simple mathematical
[00:08.44](speaker 0) formalism.
[00:09.44](speaker 1) When your brain right now is talking, is it doing next word
[00:12.87](speaker 1) prediction or is it doing
[00:14.38](speaker 1) something more interesting?
[00:15.38](speaker 0) Well, it's definitely some kind of a generative model that
[00:18.52](speaker 0)'s a GPT-like and prompted by you.
[00:20.52](speaker 0) So you're giving me a prompt and I'm kind of like
[00:23.36](speaker 0) responding to it in a generative way.
[00:25.58](speaker 1) And by yourself perhaps a little bit?
[00:27.74](speaker 1) Like are you adding extra prompts from your own memory
[00:31.46](speaker 1) inside your head?
[00:32.44](speaker 0) Hmm.
[00:33.44](speaker 1) Or no?
[00:34.44](speaker 0) Well, it definitely feels like you're referencing some kind
[00:36.52](speaker 0) of a declarative structure of like
[00:37.68](speaker 0) memory and so on.
[00:39.24](speaker 0) And then you're putting that together with your prompt and
[00:42.40](speaker 0) giving away some answers.
[00:43.96](speaker 1) How much of what you just said has been said by you before?
[00:49.10](speaker 0) Nothing basically, right?
[00:50.40](speaker 1) No, but if you actually look at all the words you've ever
[00:53.27](speaker 1) said in your life and you do a
[00:54.84](speaker 1) search you'll probably said a lot of the same words in the
[00:59.00](speaker 1) same order before.
[01:00.38](speaker 0) Yeah, could be.

karaoke video
diarize_sample_rework.mp4
csv file

start,end,speaker,text
0,3820,0," Then these neural nets take on pretty surprising magical"
3820,4700,0," properties."
4700,6990,0," I think it's kind of interesting how much you can get out"
6990,8440,0," of even very simple mathematical"
8440,9440,0," formalism."
9440,12870,1," When your brain right now is talking, is it doing next word"
12870,14380,1," prediction or is it doing"
14380,15380,1," something more interesting?"
15380,18520,0," Well, it's definitely some kind of a generative model that"
18520,20520,0,"'s a GPT-like and prompted by you."
20520,23360,0," So you're giving me a prompt and I'm kind of like"
23360,25580,0," responding to it in a generative way."
25580,27740,1," And by yourself perhaps a little bit?"
27740,31460,1," Like are you adding extra prompts from your own memory"
31460,32440,1," inside your head?"
32440,33440,0," Hmm."
33440,34440,1," Or no?"
34440,36520,0," Well, it definitely feels like you're referencing some kind"
36520,37680,0," of a declarative structure of like"
37680,39240,0," memory and so on."
39240,42400,0," And then you're putting that together with your prompt and"
42400,43960,0," giving away some answers."
43960,49100,1," How much of what you just said has been said by you before?"
49100,50400,0," Nothing basically, right?"
50400,53270,1," No, but if you actually look at all the words you've ever"
53270,54840,1," said in your life and you do a"
54840,59000,1," search you'll probably said a lot of the same words in the"
59000,60380,1," same order before."
60380,63500,0," Yeah, could be."

json file
{
	"systeminfo": "AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "/Users/administrator/Code/whisper.cpp/models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:03,820"
			},
			"offsets": {
				"from": 0,
				"to": 3820
			},
			"text": " Then these neural nets take on pretty surprising magical",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:03,820",
				"to": "00:00:04,700"
			},
			"offsets": {
				"from": 3820,
				"to": 4700
			},
			"text": " properties.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:04,700",
				"to": "00:00:06,990"
			},
			"offsets": {
				"from": 4700,
				"to": 6990
			},
			"text": " I think it's kind of interesting how much you can get out",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:06,990",
				"to": "00:00:08,440"
			},
			"offsets": {
				"from": 6990,
				"to": 8440
			},
			"text": " of even very simple mathematical",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:08,440",
				"to": "00:00:09,440"
			},
			"offsets": {
				"from": 8440,
				"to": 9440
			},
			"text": " formalism.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:09,440",
				"to": "00:00:12,870"
			},
			"offsets": {
				"from": 9440,
				"to": 12870
			},
			"text": " When your brain right now is talking, is it doing next word",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:12,870",
				"to": "00:00:14,380"
			},
			"offsets": {
				"from": 12870,
				"to": 14380
			},
			"text": " prediction or is it doing",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:14,380",
				"to": "00:00:15,380"
			},
			"offsets": {
				"from": 14380,
				"to": 15380
			},
			"text": " something more interesting?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:15,380",
				"to": "00:00:18,520"
			},
			"offsets": {
				"from": 15380,
				"to": 18520
			},
			"text": " Well, it's definitely some kind of a generative model that",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:18,520",
				"to": "00:00:20,520"
			},
			"offsets": {
				"from": 18520,
				"to": 20520
			},
			"text": "'s a GPT-like and prompted by you.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:20,520",
				"to": "00:00:23,360"
			},
			"offsets": {
				"from": 20520,
				"to": 23360
			},
			"text": " So you're giving me a prompt and I'm kind of like",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:23,360",
				"to": "00:00:25,580"
			},
			"offsets": {
				"from": 23360,
				"to": 25580
			},
			"text": " responding to it in a generative way.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:25,580",
				"to": "00:00:27,740"
			},
			"offsets": {
				"from": 25580,
				"to": 27740
			},
			"text": " And by yourself perhaps a little bit?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:27,740",
				"to": "00:00:31,460"
			},
			"offsets": {
				"from": 27740,
				"to": 31460
			},
			"text": " Like are you adding extra prompts from your own memory",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:31,460",
				"to": "00:00:32,440"
			},
			"offsets": {
				"from": 31460,
				"to": 32440
			},
			"text": " inside your head?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:32,440",
				"to": "00:00:33,440"
			},
			"offsets": {
				"from": 32440,
				"to": 33440
			},
			"text": " Hmm.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:33,440",
				"to": "00:00:34,440"
			},
			"offsets": {
				"from": 33440,
				"to": 34440
			},
			"text": " Or no?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:34,440",
				"to": "00:00:36,520"
			},
			"offsets": {
				"from": 34440,
				"to": 36520
			},
			"text": " Well, it definitely feels like you're referencing some kind",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:36,520",
				"to": "00:00:37,680"
			},
			"offsets": {
				"from": 36520,
				"to": 37680
			},
			"text": " of a declarative structure of like",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:37,680",
				"to": "00:00:39,240"
			},
			"offsets": {
				"from": 37680,
				"to": 39240
			},
			"text": " memory and so on.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:39,240",
				"to": "00:00:42,400"
			},
			"offsets": {
				"from": 39240,
				"to": 42400
			},
			"text": " And then you're putting that together with your prompt and",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:42,400",
				"to": "00:00:43,960"
			},
			"offsets": {
				"from": 42400,
				"to": 43960
			},
			"text": " giving away some answers.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:43,960",
				"to": "00:00:49,100"
			},
			"offsets": {
				"from": 43960,
				"to": 49100
			},
			"text": " How much of what you just said has been said by you before?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:49,100",
				"to": "00:00:50,400"
			},
			"offsets": {
				"from": 49100,
				"to": 50400
			},
			"text": " Nothing basically, right?",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:50,400",
				"to": "00:00:53,270"
			},
			"offsets": {
				"from": 50400,
				"to": 53270
			},
			"text": " No, but if you actually look at all the words you've ever",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:53,270",
				"to": "00:00:54,840"
			},
			"offsets": {
				"from": 53270,
				"to": 54840
			},
			"text": " said in your life and you do a",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:54,840",
				"to": "00:00:59,000"
			},
			"offsets": {
				"from": 54840,
				"to": 59000
			},
			"text": " search you'll probably said a lot of the same words in the",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:59,000",
				"to": "00:01:00,380"
			},
			"offsets": {
				"from": 59000,
				"to": 60380
			},
			"text": " same order before.",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:01:00,380",
				"to": "00:01:03,500"
			},
			"offsets": {
				"from": 60380,
				"to": 63500
			},
			"text": " Yeah, could be.",
			"speaker": "0"
		}
	]
}

@ggerganov ggerganov merged commit 14baf2e into ggerganov:master Jun 25, 2023
18 checks passed
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…v#1031)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…v#1031)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
…v#1031)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants