diarization: add diarization support for all current output types #1031

colinc · 2023-06-19T00:21:39Z

Addressing #1020 with a first pass at getting diarization labeling working for all current output types.

This extracts all diarization code into its own, reusable function (I'd welcome any improvements to the function name: estimate_diarization_speaker 😅).

This updates all output formats to include diarization labeling, when applicable. For intermediate formats (JSON, CSV, etc) I think it makes sense to only include the speaker ID, instead of a formatted string (i.e. "1" vs "(speaker 1)"), to allow the end program or system to determine how they want to handle the speaker label. For all other more "final formats" the formatted speaker string is included.

Examples of all output formats from a 1 minute excerpt of Lex Fridman's interview of Andrej Karpathy:

text file

(speaker 0) Then these neural nets take on pretty surprising magical
(speaker 0) properties.
(speaker 0) I think it's kind of interesting how much you can get out
(speaker 0) of even very simple mathematical
(speaker 0) formalism.
(speaker 1) When your brain right now is talking, is it doing next word
(speaker 1) prediction or is it doing
(speaker 1) something more interesting?
(speaker 0) Well, it's definitely some kind of a generative model that
(speaker 0)'s a GPT-like and prompted by you.
(speaker 0) So you're giving me a prompt and I'm kind of like
(speaker 0) responding to it in a generative way.
(speaker 1) And by yourself perhaps a little bit?
(speaker 1) Like are you adding extra prompts from your own memory
(speaker 1) inside your head?
(speaker 0) Hmm.
(speaker 1) Or no?
(speaker 0) Well, it definitely feels like you're referencing some kind
(speaker 0) of a declarative structure of like
(speaker 0) memory and so on.
(speaker 0) And then you're putting that together with your prompt and
(speaker 0) giving away some answers.
(speaker 1) How much of what you just said has been said by you before?
(speaker 0) Nothing basically, right?
(speaker 1) No, but if you actually look at all the words you've ever
(speaker 1) said in your life and you do a
(speaker 1) search you'll probably said a lot of the same words in the
(speaker 1) same order before.
(speaker 0) Yeah, could be.

vtt file

WEBVTT

00:00:00.000 --> 00:00:03.820
<v Speaker0> Then these neural nets take on pretty surprising magical

00:00:03.820 --> 00:00:04.700
<v Speaker0> properties.

00:00:04.700 --> 00:00:06.990
<v Speaker0> I think it's kind of interesting how much you can get out

00:00:06.990 --> 00:00:08.440
<v Speaker0> of even very simple mathematical

00:00:08.440 --> 00:00:09.440
<v Speaker0> formalism.

00:00:09.440 --> 00:00:12.870
<v Speaker1> When your brain right now is talking, is it doing next word

00:00:12.870 --> 00:00:14.380
<v Speaker1> prediction or is it doing

00:00:14.380 --> 00:00:15.380
<v Speaker1> something more interesting?

00:00:15.380 --> 00:00:18.520
<v Speaker0> Well, it's definitely some kind of a generative model that

00:00:18.520 --> 00:00:20.520
<v Speaker0>'s a GPT-like and prompted by you.

00:00:20.520 --> 00:00:23.360
<v Speaker0> So you're giving me a prompt and I'm kind of like

00:00:23.360 --> 00:00:25.580
<v Speaker0> responding to it in a generative way.

00:00:25.580 --> 00:00:27.740
<v Speaker1> And by yourself perhaps a little bit?

00:00:27.740 --> 00:00:31.460
<v Speaker1> Like are you adding extra prompts from your own memory

00:00:31.460 --> 00:00:32.440
<v Speaker1> inside your head?

00:00:32.440 --> 00:00:33.440
<v Speaker0> Hmm.

00:00:33.440 --> 00:00:34.440
<v Speaker1> Or no?

00:00:34.440 --> 00:00:36.520
<v Speaker0> Well, it definitely feels like you're referencing some kind

00:00:36.520 --> 00:00:37.680
<v Speaker0> of a declarative structure of like

00:00:37.680 --> 00:00:39.240
<v Speaker0> memory and so on.

00:00:39.240 --> 00:00:42.400
<v Speaker0> And then you're putting that together with your prompt and

00:00:42.400 --> 00:00:43.960
<v Speaker0> giving away some answers.

00:00:43.960 --> 00:00:49.100
<v Speaker1> How much of what you just said has been said by you before?

00:00:49.100 --> 00:00:50.400
<v Speaker0> Nothing basically, right?

00:00:50.400 --> 00:00:53.270
<v Speaker1> No, but if you actually look at all the words you've ever

00:00:53.270 --> 00:00:54.840
<v Speaker1> said in your life and you do a

00:00:54.840 --> 00:00:59.000
<v Speaker1> search you'll probably said a lot of the same words in the

00:00:59.000 --> 00:01:00.380
<v Speaker1> same order before.

00:01:00.380 --> 00:01:03.500
<v Speaker0> Yeah, could be.

srt file

1
00:00:00,000 --> 00:00:03,820
(speaker 0) Then these neural nets take on pretty surprising magical

2
00:00:03,820 --> 00:00:04,700
(speaker 0) properties.

3
00:00:04,700 --> 00:00:06,990
(speaker 0) I think it's kind of interesting how much you can get out

4
00:00:06,990 --> 00:00:08,440
(speaker 0) of even very simple mathematical

5
00:00:08,440 --> 00:00:09,440
(speaker 0) formalism.

6
00:00:09,440 --> 00:00:12,870
(speaker 1) When your brain right now is talking, is it doing next word

7
00:00:12,870 --> 00:00:14,380
(speaker 1) prediction or is it doing

8
00:00:14,380 --> 00:00:15,380
(speaker 1) something more interesting?

9
00:00:15,380 --> 00:00:18,520
(speaker 0) Well, it's definitely some kind of a generative model that

10
00:00:18,520 --> 00:00:20,520
(speaker 0)'s a GPT-like and prompted by you.

11
00:00:20,520 --> 00:00:23,360
(speaker 0) So you're giving me a prompt and I'm kind of like

12
00:00:23,360 --> 00:00:25,580
(speaker 0) responding to it in a generative way.

13
00:00:25,580 --> 00:00:27,740
(speaker 1) And by yourself perhaps a little bit?

14
00:00:27,740 --> 00:00:31,460
(speaker 1) Like are you adding extra prompts from your own memory

15
00:00:31,460 --> 00:00:32,440
(speaker 1) inside your head?

16
00:00:32,440 --> 00:00:33,440
(speaker 0) Hmm.

17
00:00:33,440 --> 00:00:34,440
(speaker 1) Or no?

18
00:00:34,440 --> 00:00:36,520
(speaker 0) Well, it definitely feels like you're referencing some kind

19
00:00:36,520 --> 00:00:37,680
(speaker 0) of a declarative structure of like

20
00:00:37,680 --> 00:00:39,240
(speaker 0) memory and so on.

21
00:00:39,240 --> 00:00:42,400
(speaker 0) And then you're putting that together with your prompt and

22
00:00:42,400 --> 00:00:43,960
(speaker 0) giving away some answers.

23
00:00:43,960 --> 00:00:49,100
(speaker 1) How much of what you just said has been said by you before?

24
00:00:49,100 --> 00:00:50,400
(speaker 0) Nothing basically, right?

25
00:00:50,400 --> 00:00:53,270
(speaker 1) No, but if you actually look at all the words you've ever

26
00:00:53,270 --> 00:00:54,840
(speaker 1) said in your life and you do a

27
00:00:54,840 --> 00:00:59,000
(speaker 1) search you'll probably said a lot of the same words in the

28
00:00:59,000 --> 00:01:00,380
(speaker 1) same order before.

29
00:01:00,380 --> 00:01:03,500
(speaker 0) Yeah, could be.

lrc video

[by:whisper.cpp]
[00:00.00](speaker 0) Then these neural nets take on pretty surprising magical
[00:03.82](speaker 0) properties.
[00:04.70](speaker 0) I think it's kind of interesting how much you can get out
[00:06.99](speaker 0) of even very simple mathematical
[00:08.44](speaker 0) formalism.
[00:09.44](speaker 1) When your brain right now is talking, is it doing next word
[00:12.87](speaker 1) prediction or is it doing
[00:14.38](speaker 1) something more interesting?
[00:15.38](speaker 0) Well, it's definitely some kind of a generative model that
[00:18.52](speaker 0)'s a GPT-like and prompted by you.
[00:20.52](speaker 0) So you're giving me a prompt and I'm kind of like
[00:23.36](speaker 0) responding to it in a generative way.
[00:25.58](speaker 1) And by yourself perhaps a little bit?
[00:27.74](speaker 1) Like are you adding extra prompts from your own memory
[00:31.46](speaker 1) inside your head?
[00:32.44](speaker 0) Hmm.
[00:33.44](speaker 1) Or no?
[00:34.44](speaker 0) Well, it definitely feels like you're referencing some kind
[00:36.52](speaker 0) of a declarative structure of like
[00:37.68](speaker 0) memory and so on.
[00:39.24](speaker 0) And then you're putting that together with your prompt and
[00:42.40](speaker 0) giving away some answers.
[00:43.96](speaker 1) How much of what you just said has been said by you before?
[00:49.10](speaker 0) Nothing basically, right?
[00:50.40](speaker 1) No, but if you actually look at all the words you've ever
[00:53.27](speaker 1) said in your life and you do a
[00:54.84](speaker 1) search you'll probably said a lot of the same words in the
[00:59.00](speaker 1) same order before.
[01:00.38](speaker 0) Yeah, could be.

karaoke video

diarize_sample_rework.mp4

csv file

start,end,speaker,text
0,3820,0," Then these neural nets take on pretty surprising magical"
3820,4700,0," properties."
4700,6990,0," I think it's kind of interesting how much you can get out"
6990,8440,0," of even very simple mathematical"
8440,9440,0," formalism."
9440,12870,1," When your brain right now is talking, is it doing next word"
12870,14380,1," prediction or is it doing"
14380,15380,1," something more interesting?"
15380,18520,0," Well, it's definitely some kind of a generative model that"
18520,20520,0,"'s a GPT-like and prompted by you."
20520,23360,0," So you're giving me a prompt and I'm kind of like"
23360,25580,0," responding to it in a generative way."
25580,27740,1," And by yourself perhaps a little bit?"
27740,31460,1," Like are you adding extra prompts from your own memory"
31460,32440,1," inside your head?"
32440,33440,0," Hmm."
33440,34440,1," Or no?"
34440,36520,0," Well, it definitely feels like you're referencing some kind"
36520,37680,0," of a declarative structure of like"
37680,39240,0," memory and so on."
39240,42400,0," And then you're putting that together with your prompt and"
42400,43960,0," giving away some answers."
43960,49100,1," How much of what you just said has been said by you before?"
49100,50400,0," Nothing basically, right?"
50400,53270,1," No, but if you actually look at all the words you've ever"
53270,54840,1," said in your life and you do a"
54840,59000,1," search you'll probably said a lot of the same words in the"
59000,60380,1," same order before."
60380,63500,0," Yeah, could be."

json file

{
	"systeminfo": "AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "/Users/administrator/Code/whisper.cpp/models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:03,820"
			},
			"offsets": {
				"from": 0,
				"to": 3820
			},
			"text": " Then these neural nets take on pretty surprising magical",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:03,820",
				"to": "00:00:04,700"
			},
			"offsets": {
				"from": 3820,
				"to": 4700
			},
			"text": " properties.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:04,700",
				"to": "00:00:06,990"
			},
			"offsets": {
				"from": 4700,
				"to": 6990
			},
			"text": " I think it's kind of interesting how much you can get out",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:06,990",
				"to": "00:00:08,440"
			},
			"offsets": {
				"from": 6990,
				"to": 8440
			},
			"text": " of even very simple mathematical",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:08,440",
				"to": "00:00:09,440"
			},
			"offsets": {
				"from": 8440,
				"to": 9440
			},
			"text": " formalism.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:09,440",
				"to": "00:00:12,870"
			},
			"offsets": {
				"from": 9440,
				"to": 12870
			},
			"text": " When your brain right now is talking, is it doing next word",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:12,870",
				"to": "00:00:14,380"
			},
			"offsets": {
				"from": 12870,
				"to": 14380
			},
			"text": " prediction or is it doing",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:14,380",
				"to": "00:00:15,380"
			},
			"offsets": {
				"from": 14380,
				"to": 15380
			},
			"text": " something more interesting?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:15,380",
				"to": "00:00:18,520"
			},
			"offsets": {
				"from": 15380,
				"to": 18520
			},
			"text": " Well, it's definitely some kind of a generative model that",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:18,520",
				"to": "00:00:20,520"
			},
			"offsets": {
				"from": 18520,
				"to": 20520
			},
			"text": "'s a GPT-like and prompted by you.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:20,520",
				"to": "00:00:23,360"
			},
			"offsets": {
				"from": 20520,
				"to": 23360
			},
			"text": " So you're giving me a prompt and I'm kind of like",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:23,360",
				"to": "00:00:25,580"
			},
			"offsets": {
				"from": 23360,
				"to": 25580
			},
			"text": " responding to it in a generative way.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:25,580",
				"to": "00:00:27,740"
			},
			"offsets": {
				"from": 25580,
				"to": 27740
			},
			"text": " And by yourself perhaps a little bit?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:27,740",
				"to": "00:00:31,460"
			},
			"offsets": {
				"from": 27740,
				"to": 31460
			},
			"text": " Like are you adding extra prompts from your own memory",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:31,460",
				"to": "00:00:32,440"
			},
			"offsets": {
				"from": 31460,
				"to": 32440
			},
			"text": " inside your head?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:32,440",
				"to": "00:00:33,440"
			},
			"offsets": {
				"from": 32440,
				"to": 33440
			},
			"text": " Hmm.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:33,440",
				"to": "00:00:34,440"
			},
			"offsets": {
				"from": 33440,
				"to": 34440
			},
			"text": " Or no?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:34,440",
				"to": "00:00:36,520"
			},
			"offsets": {
				"from": 34440,
				"to": 36520
			},
			"text": " Well, it definitely feels like you're referencing some kind",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:36,520",
				"to": "00:00:37,680"
			},
			"offsets": {
				"from": 36520,
				"to": 37680
			},
			"text": " of a declarative structure of like",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:37,680",
				"to": "00:00:39,240"
			},
			"offsets": {
				"from": 37680,
				"to": 39240
			},
			"text": " memory and so on.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:39,240",
				"to": "00:00:42,400"
			},
			"offsets": {
				"from": 39240,
				"to": 42400
			},
			"text": " And then you're putting that together with your prompt and",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:42,400",
				"to": "00:00:43,960"
			},
			"offsets": {
				"from": 42400,
				"to": 43960
			},
			"text": " giving away some answers.",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:43,960",
				"to": "00:00:49,100"
			},
			"offsets": {
				"from": 43960,
				"to": 49100
			},
			"text": " How much of what you just said has been said by you before?",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:49,100",
				"to": "00:00:50,400"
			},
			"offsets": {
				"from": 49100,
				"to": 50400
			},
			"text": " Nothing basically, right?",
			"speaker": "0"
		},
		{
			"timestamps": {
				"from": "00:00:50,400",
				"to": "00:00:53,270"
			},
			"offsets": {
				"from": 50400,
				"to": 53270
			},
			"text": " No, but if you actually look at all the words you've ever",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:53,270",
				"to": "00:00:54,840"
			},
			"offsets": {
				"from": 53270,
				"to": 54840
			},
			"text": " said in your life and you do a",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:54,840",
				"to": "00:00:59,000"
			},
			"offsets": {
				"from": 54840,
				"to": 59000
			},
			"text": " search you'll probably said a lot of the same words in the",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:00:59,000",
				"to": "00:01:00,380"
			},
			"offsets": {
				"from": 59000,
				"to": 60380
			},
			"text": " same order before.",
			"speaker": "1"
		},
		{
			"timestamps": {
				"from": "00:01:00,380",
				"to": "00:01:03,500"
			},
			"offsets": {
				"from": 60380,
				"to": 63500
			},
			"text": " Yeah, could be.",
			"speaker": "0"
		}
	]
}

…v#1031) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

colinc and others added 2 commits June 18, 2023 16:35

diarization: add diarization support for all current output types

240aaf2

Merge branch 'master' into master

275c940

ggerganov approved these changes Jun 25, 2023

View reviewed changes

ggerganov merged commit 14baf2e into ggerganov:master Jun 25, 2023
18 checks passed

akashmjn mentioned this pull request Jun 27, 2023

whisper : mark speakers/voices (diarization) #64

Open

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : add diarization support for all current output types (ggergano…

36fe344

…v#1031) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : add diarization support for all current output types (ggergano…

f8b085c

…v#1031) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

main : add diarization support for all current output types (ggergano…

2f67d0e

…v#1031) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diarization: add diarization support for all current output types #1031

diarization: add diarization support for all current output types #1031

colinc commented Jun 19, 2023

diarization: add diarization support for all current output types #1031

diarization: add diarization support for all current output types #1031

Conversation

colinc commented Jun 19, 2023