diarization: add diarization support for all current output types #1031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addressing #1020 with a first pass at getting diarization labeling working for all current output types.
This extracts all diarization code into its own, reusable function (I'd welcome any improvements to the function name: estimate_diarization_speaker 😅).
This updates all output formats to include diarization labeling, when applicable. For intermediate formats (JSON, CSV, etc) I think it makes sense to only include the speaker ID, instead of a formatted string (i.e. "1" vs "(speaker 1)"), to allow the end program or system to determine how they want to handle the speaker label. For all other more "final formats" the formatted speaker string is included.
Examples of all output formats from a 1 minute excerpt of Lex Fridman's interview of Andrej Karpathy:
text file
(speaker 0) Then these neural nets take on pretty surprising magical
(speaker 0) properties.
(speaker 0) I think it's kind of interesting how much you can get out
(speaker 0) of even very simple mathematical
(speaker 0) formalism.
(speaker 1) When your brain right now is talking, is it doing next word
(speaker 1) prediction or is it doing
(speaker 1) something more interesting?
(speaker 0) Well, it's definitely some kind of a generative model that
(speaker 0)'s a GPT-like and prompted by you.
(speaker 0) So you're giving me a prompt and I'm kind of like
(speaker 0) responding to it in a generative way.
(speaker 1) And by yourself perhaps a little bit?
(speaker 1) Like are you adding extra prompts from your own memory
(speaker 1) inside your head?
(speaker 0) Hmm.
(speaker 1) Or no?
(speaker 0) Well, it definitely feels like you're referencing some kind
(speaker 0) of a declarative structure of like
(speaker 0) memory and so on.
(speaker 0) And then you're putting that together with your prompt and
(speaker 0) giving away some answers.
(speaker 1) How much of what you just said has been said by you before?
(speaker 0) Nothing basically, right?
(speaker 1) No, but if you actually look at all the words you've ever
(speaker 1) said in your life and you do a
(speaker 1) search you'll probably said a lot of the same words in the
(speaker 1) same order before.
(speaker 0) Yeah, could be.
vtt file
WEBVTT00:00:00.000 --> 00:00:03.820
<v Speaker0> Then these neural nets take on pretty surprising magical
00:00:03.820 --> 00:00:04.700
<v Speaker0> properties.
00:00:04.700 --> 00:00:06.990
<v Speaker0> I think it's kind of interesting how much you can get out
00:00:06.990 --> 00:00:08.440
<v Speaker0> of even very simple mathematical
00:00:08.440 --> 00:00:09.440
<v Speaker0> formalism.
00:00:09.440 --> 00:00:12.870
<v Speaker1> When your brain right now is talking, is it doing next word
00:00:12.870 --> 00:00:14.380
<v Speaker1> prediction or is it doing
00:00:14.380 --> 00:00:15.380
<v Speaker1> something more interesting?
00:00:15.380 --> 00:00:18.520
<v Speaker0> Well, it's definitely some kind of a generative model that
00:00:18.520 --> 00:00:20.520
<v Speaker0>'s a GPT-like and prompted by you.
00:00:20.520 --> 00:00:23.360
<v Speaker0> So you're giving me a prompt and I'm kind of like
00:00:23.360 --> 00:00:25.580
<v Speaker0> responding to it in a generative way.
00:00:25.580 --> 00:00:27.740
<v Speaker1> And by yourself perhaps a little bit?
00:00:27.740 --> 00:00:31.460
<v Speaker1> Like are you adding extra prompts from your own memory
00:00:31.460 --> 00:00:32.440
<v Speaker1> inside your head?
00:00:32.440 --> 00:00:33.440
<v Speaker0> Hmm.
00:00:33.440 --> 00:00:34.440
<v Speaker1> Or no?
00:00:34.440 --> 00:00:36.520
<v Speaker0> Well, it definitely feels like you're referencing some kind
00:00:36.520 --> 00:00:37.680
<v Speaker0> of a declarative structure of like
00:00:37.680 --> 00:00:39.240
<v Speaker0> memory and so on.
00:00:39.240 --> 00:00:42.400
<v Speaker0> And then you're putting that together with your prompt and
00:00:42.400 --> 00:00:43.960
<v Speaker0> giving away some answers.
00:00:43.960 --> 00:00:49.100
<v Speaker1> How much of what you just said has been said by you before?
00:00:49.100 --> 00:00:50.400
<v Speaker0> Nothing basically, right?
00:00:50.400 --> 00:00:53.270
<v Speaker1> No, but if you actually look at all the words you've ever
00:00:53.270 --> 00:00:54.840
<v Speaker1> said in your life and you do a
00:00:54.840 --> 00:00:59.000
<v Speaker1> search you'll probably said a lot of the same words in the
00:00:59.000 --> 00:01:00.380
<v Speaker1> same order before.
00:01:00.380 --> 00:01:03.500
<v Speaker0> Yeah, could be.
srt file
1
00:00:00,000 --> 00:00:03,820
(speaker 0) Then these neural nets take on pretty surprising magical
2
00:00:03,820 --> 00:00:04,700
(speaker 0) properties.
3
00:00:04,700 --> 00:00:06,990
(speaker 0) I think it's kind of interesting how much you can get out
4
00:00:06,990 --> 00:00:08,440
(speaker 0) of even very simple mathematical
5
00:00:08,440 --> 00:00:09,440
(speaker 0) formalism.
6
00:00:09,440 --> 00:00:12,870
(speaker 1) When your brain right now is talking, is it doing next word
7
00:00:12,870 --> 00:00:14,380
(speaker 1) prediction or is it doing
8
00:00:14,380 --> 00:00:15,380
(speaker 1) something more interesting?
9
00:00:15,380 --> 00:00:18,520
(speaker 0) Well, it's definitely some kind of a generative model that
10
00:00:18,520 --> 00:00:20,520
(speaker 0)'s a GPT-like and prompted by you.
11
00:00:20,520 --> 00:00:23,360
(speaker 0) So you're giving me a prompt and I'm kind of like
12
00:00:23,360 --> 00:00:25,580
(speaker 0) responding to it in a generative way.
13
00:00:25,580 --> 00:00:27,740
(speaker 1) And by yourself perhaps a little bit?
14
00:00:27,740 --> 00:00:31,460
(speaker 1) Like are you adding extra prompts from your own memory
15
00:00:31,460 --> 00:00:32,440
(speaker 1) inside your head?
16
00:00:32,440 --> 00:00:33,440
(speaker 0) Hmm.
17
00:00:33,440 --> 00:00:34,440
(speaker 1) Or no?
18
00:00:34,440 --> 00:00:36,520
(speaker 0) Well, it definitely feels like you're referencing some kind
19
00:00:36,520 --> 00:00:37,680
(speaker 0) of a declarative structure of like
20
00:00:37,680 --> 00:00:39,240
(speaker 0) memory and so on.
21
00:00:39,240 --> 00:00:42,400
(speaker 0) And then you're putting that together with your prompt and
22
00:00:42,400 --> 00:00:43,960
(speaker 0) giving away some answers.
23
00:00:43,960 --> 00:00:49,100
(speaker 1) How much of what you just said has been said by you before?
24
00:00:49,100 --> 00:00:50,400
(speaker 0) Nothing basically, right?
25
00:00:50,400 --> 00:00:53,270
(speaker 1) No, but if you actually look at all the words you've ever
26
00:00:53,270 --> 00:00:54,840
(speaker 1) said in your life and you do a
27
00:00:54,840 --> 00:00:59,000
(speaker 1) search you'll probably said a lot of the same words in the
28
00:00:59,000 --> 00:01:00,380
(speaker 1) same order before.
29
00:01:00,380 --> 00:01:03,500
(speaker 0) Yeah, could be.
lrc video
[by:whisper.cpp]
[00:00.00](speaker 0) Then these neural nets take on pretty surprising magical
[00:03.82](speaker 0) properties.
[00:04.70](speaker 0) I think it's kind of interesting how much you can get out
[00:06.99](speaker 0) of even very simple mathematical
[00:08.44](speaker 0) formalism.
[00:09.44](speaker 1) When your brain right now is talking, is it doing next word
[00:12.87](speaker 1) prediction or is it doing
[00:14.38](speaker 1) something more interesting?
[00:15.38](speaker 0) Well, it's definitely some kind of a generative model that
[00:18.52](speaker 0)'s a GPT-like and prompted by you.
[00:20.52](speaker 0) So you're giving me a prompt and I'm kind of like
[00:23.36](speaker 0) responding to it in a generative way.
[00:25.58](speaker 1) And by yourself perhaps a little bit?
[00:27.74](speaker 1) Like are you adding extra prompts from your own memory
[00:31.46](speaker 1) inside your head?
[00:32.44](speaker 0) Hmm.
[00:33.44](speaker 1) Or no?
[00:34.44](speaker 0) Well, it definitely feels like you're referencing some kind
[00:36.52](speaker 0) of a declarative structure of like
[00:37.68](speaker 0) memory and so on.
[00:39.24](speaker 0) And then you're putting that together with your prompt and
[00:42.40](speaker 0) giving away some answers.
[00:43.96](speaker 1) How much of what you just said has been said by you before?
[00:49.10](speaker 0) Nothing basically, right?
[00:50.40](speaker 1) No, but if you actually look at all the words you've ever
[00:53.27](speaker 1) said in your life and you do a
[00:54.84](speaker 1) search you'll probably said a lot of the same words in the
[00:59.00](speaker 1) same order before.
[01:00.38](speaker 0) Yeah, could be.
karaoke video
diarize_sample_rework.mp4
csv file
start,end,speaker,text
0,3820,0," Then these neural nets take on pretty surprising magical"
3820,4700,0," properties."
4700,6990,0," I think it's kind of interesting how much you can get out"
6990,8440,0," of even very simple mathematical"
8440,9440,0," formalism."
9440,12870,1," When your brain right now is talking, is it doing next word"
12870,14380,1," prediction or is it doing"
14380,15380,1," something more interesting?"
15380,18520,0," Well, it's definitely some kind of a generative model that"
18520,20520,0,"'s a GPT-like and prompted by you."
20520,23360,0," So you're giving me a prompt and I'm kind of like"
23360,25580,0," responding to it in a generative way."
25580,27740,1," And by yourself perhaps a little bit?"
27740,31460,1," Like are you adding extra prompts from your own memory"
31460,32440,1," inside your head?"
32440,33440,0," Hmm."
33440,34440,1," Or no?"
34440,36520,0," Well, it definitely feels like you're referencing some kind"
36520,37680,0," of a declarative structure of like"
37680,39240,0," memory and so on."
39240,42400,0," And then you're putting that together with your prompt and"
42400,43960,0," giving away some answers."
43960,49100,1," How much of what you just said has been said by you before?"
49100,50400,0," Nothing basically, right?"
50400,53270,1," No, but if you actually look at all the words you've ever"
53270,54840,1," said in your life and you do a"
54840,59000,1," search you'll probably said a lot of the same words in the"
59000,60380,1," same order before."
60380,63500,0," Yeah, could be."
json file