Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

Merged
merged 17 commits into from Jul 4, 2023

Conversation

akashmjn
Copy link
Contributor

@akashmjn akashmjn commented Jun 27, 2023

As discussed in #64, this PR adds experimental support for local diarization (marking of speaker turns) via integration of checkpoints from this project https://github.com/akashmjn/tinydiarize/tree/main.

This is an early functional prototype done for the small.en models.

@ggerganov - this should be functionally done save for the last two points on the checklist, for which i'd appreciate some comments on the right way to expose this.

(also please excuse my C++ , I haven't written a lot of it, so this is heavily copilot-assisted 😉 )

Screenshot 2023-05-27 at 7 15 46 AM

Example usage

make
./models/download-ggml-model.sh small.en-tdrz

make samples
./main -m models/ggml-small.en-tdrz.bin -f samples/a13.wav

After running the above, you should see this:

Screenshot 2023-06-20 at 11 29 32 AM

JSON output contains an extra speaker_turn_next field for each segment with this information.

Example JSON output
{
	"systeminfo": "AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | COREML = 0 | ",
	"model": {
		"type": "small",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 768,
			"head": 12,
			"layer": 12
		},
		"text": {
			"ctx": 448,
			"state": 768,
			"head": 12,
			"layer": 12
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "models/whisper-small.en.tdrz/ggml-small.en-tdrz.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:03,800"
			},
			"offsets": {
				"from": 0,
				"to": 3800
			},
			"text": " Okay Houston, we've had a problem here. [SPEAKER TURN]"
			"speaker_turn_next": true
		},
                ...
	]
}

Checklist:

Some terminology context for the last two points: this is technically not complete diarization yet, but speaker segmentation https://www.perplexity.ai/search/d01e6743-d2dc-4f5e-b5c2-2bf2212068f7?s=u (which can be thought of as local diarization).
Also technically the stereo audio input used by the current --diarize flag is already diarized (as it is separated into individual channels), so the naming isn't strictly consistent here either?

@akashmjn akashmjn changed the title whisper: support speaker segmentation (local diarization) of mono audio via integration of tinydiarize whisper: support speaker segmentation (local diarization) of mono audio via tinydiarize Jun 27, 2023
@JianbangZ
Copy link

Does this support multi language or just English?

@Harith163
Copy link

Excited! Will this support multiple speaker labelling or will it just mark speaker turns?

@akashmjn
Copy link
Contributor Author

akashmjn commented Jun 30, 2023

Hi @Harith163 and @JianbangZ:

  • at the moment, just speaker turns and no clustering
  • this PR is merging a PoC done for the small.en models, so English-only

Both of these are doable I think, but are a little more involved and honestly depends on how the project evolves.

For multilingual - I think its easiest done by OpenAI themselves since ultimately that boils down to a reasonably multilingual finetuning dataset, and I'm pretty sure all released Whisper models had a final finetuning stage.

I'd say clustering has less dependencies and is a bit more tractable. I will sketch a rough plan for that once a few immediate things are done.

You can take a look at the immediate roadmap over at https://github.com/akashmjn/tinydiarize/tree/main#roadmap.

@akashmjn
Copy link
Contributor Author

In fact @ggerganov I notice that you've already implemented C-means by hand in cpp here #130 😅 . Once I free up a little, I'll try running some clustering experiments over on the python repo.

In the meantime if you are interested, this is the best method out there NME-SC:

@ggerganov
Copy link
Owner

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)

Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

whisper.cpp Outdated Show resolved Hide resolved
@akashmjn
Copy link
Contributor Author

akashmjn commented Jul 2, 2023

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)

Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

Sounds good! For the last two points on my checklist - for now, i'll wait for your review. I've left //TODO@Akash at places where the behaviour needs to be toggled. If you find it more efficient - free to directly modify the PR however you find it best to expose this feature.

I think it should just be clear to the user that this is an experimental feature and requires using a specific *.tdrz checkpoint.

@ggerganov
Copy link
Owner

I synced latests ggml from llama.cpp and tomorrow will add the config option for tinydiarize and merge

@ohmguru
Copy link

ohmguru commented Jul 3, 2023

Excited to see this PR merged. Noticed that this PR doesn't yet support the word-level timestamp flag. I wanted to flag that for consideration as Word level timestamps are quite helpful when building applications that show diarization output.

@ggerganov
Copy link
Owner

@akashmjn

This should be ready to merge now. Please take a look at my changes and let me know if you agree.
For now, lets leave the stereo "diairze" flag as it is - will rename it later to reflect what it actually does.

The most important change is that I added token_tdrz and kept token_solm as it is.

Also, you now have to add the -tdrz flag to explicitly enable speaker turn detection even when using tindiarize models.
The flag should not do anything if the model used is not a tinydiarize one.

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.800]   Okay Houston, we've had a problem here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:06.200]   This is Houston. Say again please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:08.260]   Uh Houston we've had a problem.
[00:00:08.260 --> 00:00:11.320]   We've had a main beam up on a volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:13.820]   Roger main beam interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.100]   Uh uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:18.020]   So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is without it:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.760]   Okay Houston, we've had a problem here.
[00:00:03.760 --> 00:00:08.340]   Uh Houston we've had a problem.
[00:00:08.340 --> 00:00:11.320]   We've had a main beam up on a volt.
[00:00:11.320 --> 00:00:13.760]   Roger main beam interval.
[00:00:13.760 --> 00:00:17.960]   So okay stand, by thirteen we're looking at it.
[00:00:17.960 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is word-level timestamps with speaker turn detection:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -ml 1 -sow -tdrz

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.060]  
[00:00:00.060 --> 00:00:00.500]   Okay
[00:00:00.500 --> 00:00:01.340]   Houston,
[00:00:01.340 --> 00:00:01.850]   we've
[00:00:01.850 --> 00:00:02.160]   had
[00:00:02.160 --> 00:00:02.260]   a
[00:00:02.260 --> 00:00:02.990]   problem
[00:00:02.990 --> 00:00:03.800]   here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:04.030]   This
[00:00:04.030 --> 00:00:04.140]   is
[00:00:04.140 --> 00:00:04.710]   Houston.
[00:00:04.710 --> 00:00:04.880]   Say
[00:00:04.880 --> 00:00:05.170]   again
[00:00:05.170 --> 00:00:06.200]   please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:06.340]   Uh
[00:00:06.340 --> 00:00:06.850]   Houston
[00:00:06.850 --> 00:00:07.210]   we've
[00:00:07.210 --> 00:00:07.430]   had
[00:00:07.430 --> 00:00:07.530]   a
[00:00:07.530 --> 00:00:08.260]   problem.
[00:00:08.260 --> 00:00:08.770]   We've
[00:00:08.770 --> 00:00:09.080]   had
[00:00:09.080 --> 00:00:09.180]   a
[00:00:09.180 --> 00:00:09.610]   main
[00:00:09.610 --> 00:00:10.000]   beam
[00:00:10.000 --> 00:00:10.200]   up
[00:00:10.200 --> 00:00:10.400]   on
[00:00:10.400 --> 00:00:10.500]   a
[00:00:10.500 --> 00:00:11.320]   volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:11.840]   Roger
[00:00:11.840 --> 00:00:12.250]   main
[00:00:12.250 --> 00:00:12.740]   beam
[00:00:12.740 --> 00:00:13.820]   interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.080]   Uh
[00:00:15.080 --> 00:00:15.100]   uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:15.230]   So
[00:00:15.230 --> 00:00:15.500]   okay
[00:00:15.500 --> 00:00:15.970]   stand,
[00:00:15.970 --> 00:00:16.100]   by
[00:00:16.100 --> 00:00:16.660]   thirteen
[00:00:16.660 --> 00:00:16.980]   we're
[00:00:16.980 --> 00:00:17.460]   looking
[00:00:17.460 --> 00:00:17.610]   at
[00:00:17.610 --> 00:00:18.020]   it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:18.570]   Okay
[00:00:18.570 --> 00:00:18.840]   uh
[00:00:18.840 --> 00:00:19.530]   right
[00:00:19.530 --> 00:00:19.940]   now
[00:00:19.940 --> 00:00:20.210]   uh
[00:00:20.210 --> 00:00:21.170]   Houston
[00:00:21.170 --> 00:00:21.580]   the
[00:00:21.580 --> 00:00:21.850]   uh
[00:00:21.850 --> 00:00:22.810]   voltage
[00:00:22.810 --> 00:00:23.080]   is
[00:00:23.080 --> 00:00:23.400]   uh
[00:00:23.400 --> 00:00:23.730]   is
[00:00:23.730 --> 00:00:24.810]   looking
[00:00:24.810 --> 00:00:25.440]   good
[00:00:25.440 --> 00:00:25.740]   um.
[00:00:27.620 --> 00:00:27.670]  
[00:00:27.670 --> 00:00:27.840]   And
[00:00:27.840 --> 00:00:27.980]   we
[00:00:27.980 --> 00:00:28.210]   had
[00:00:28.210 --> 00:00:28.270]   a
[00:00:28.270 --> 00:00:28.340]   a
[00:00:28.340 --> 00:00:28.780]   pretty
[00:00:28.780 --> 00:00:29.150]   large
[00:00:29.150 --> 00:00:29.440]   bank
[00:00:29.440 --> 00:00:29.580]   or
[00:00:29.580 --> 00:00:29.940]   so.

Copy link
Contributor Author

@akashmjn akashmjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments relating to some tricky token ID stuff

whisper.cpp Outdated Show resolved Hide resolved
whisper.cpp Outdated Show resolved Hide resolved
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023
Enables tinydiarize models ggerganov/whisper.cpp#1058
@tingyuchang
Copy link

@karolszafranski I think no need any special settings, set tdrz_enable to true and you can get data from whisper_full_get_segment_speaker_turn_next in each segment

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…dio via tinydiarize (ggerganov#1058)

* add HuggingFace mirror to download  ggml model

* support tdrz via simple hack overriding solm tokens

* fix incorrect translate/transcribe token_ids that are not static const

* add apollo 13 sample for tdrz demo

* render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token

* extend whisper_segment with speaker_turn_next field and save in json output

* fix failing go build

* slipped in some python syntax whoops

* whisper : finalize tinydiarize support (add flag + fixes)

* whisper : tdrz support for word-level timestamps (respect max_len)

* java : try to fix tests after adding tdrz_enable flag

* main : remove TODO leftover

* java : fix params order list after adding "tdrz_enable"

* whisper : fix solm and add nosp token

* main : print tinydiarize help

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…dio via tinydiarize (ggerganov#1058)

* add HuggingFace mirror to download  ggml model

* support tdrz via simple hack overriding solm tokens

* fix incorrect translate/transcribe token_ids that are not static const

* add apollo 13 sample for tdrz demo

* render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token

* extend whisper_segment with speaker_turn_next field and save in json output

* fix failing go build

* slipped in some python syntax whoops

* whisper : finalize tinydiarize support (add flag + fixes)

* whisper : tdrz support for word-level timestamps (respect max_len)

* java : try to fix tests after adding tdrz_enable flag

* main : remove TODO leftover

* java : fix params order list after adding "tdrz_enable"

* whisper : fix solm and add nosp token

* main : print tinydiarize help

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
@khimaros
Copy link

khimaros commented Dec 1, 2023

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
…dio via tinydiarize (ggerganov#1058)

* add HuggingFace mirror to download  ggml model

* support tdrz via simple hack overriding solm tokens

* fix incorrect translate/transcribe token_ids that are not static const

* add apollo 13 sample for tdrz demo

* render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token

* extend whisper_segment with speaker_turn_next field and save in json output

* fix failing go build

* slipped in some python syntax whoops

* whisper : finalize tinydiarize support (add flag + fixes)

* whisper : tdrz support for word-level timestamps (respect max_len)

* java : try to fix tests after adding tdrz_enable flag

* main : remove TODO leftover

* java : fix params order list after adding "tdrz_enable"

* whisper : fix solm and add nosp token

* main : print tinydiarize help

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
@rben01
Copy link

rben01 commented Jan 31, 2024

Is there a way to use this with coreml models?

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-small.en-tdrz.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3 (small)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   464.64 MiB, (  466.27 / 10922.67)
whisper_model_load:    Metal buffer size =   487.20 MB
whisper_model_load: model size    =  487.00 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    47.25 MiB, (  513.52 / 10922.67)
whisper_init_state: kv self size  =   49.55 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    52.73 MiB, (  566.25 / 10922.67)
whisper_init_state: kv cross size =   55.30 MB
whisper_init_state: loading Core ML model from './models/ggml-small.en-tdrz-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: failed to load Core ML model from './models/ggml-small.en-tdrz-encoder.mlmodelc'
ggml_metal_free: deallocating
error: failed to initialize whisper context

@barolo
Copy link

barolo commented Feb 9, 2024

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

It also happens with the small model, on its own or when pushed via ">>" prompt. Unfortunately, for the life of me I cannot combine it with my other prompt which resulted with proper quote-unquote behavior., i.e.

Knock on the door and I had to be like, "Oh my God, please, is there anybody in there?"
And she was like, "Okay, let's see how this goes"

And quotes only happen when using -oved GPU [unfortunately it hallucinates a lot], where -oved CPU is much likely to trigger ">>" diarizations on its own.
This is so weird...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet