Speaker Diarization #45

yehiaabdelm · 2023-09-14T17:47:19Z

I want to add speaker diarization. Wondering how much I would have to change (existing code), also if you can point me to how you would approach this since I'm still getting used to the code base.

makaveli10 · 2023-09-18T11:55:36Z

@yehiaabdelm To start with, i think it could be added similar to how Voice Activity Detection is done i.e. before the transcription to figure out the speaker id for the segment and send it back to the client.

zoq · 2023-09-19T09:09:02Z

Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting.

yehiaabdelm · 2023-09-29T13:50:22Z

Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time.

zoq · 2023-10-02T15:13:40Z

Yes, I'll clean up the code and share it.

yehiaabdelm · 2023-10-10T17:33:31Z

Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update.

diarization.py

from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree


from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]


class Diarization():
    def __init__(self):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)

    def transform_diarization_output(self, diarization):
        l = []
        for segment, speaker in diarization.itertracks():
            l.append({"start": segment.start,
                     "end": segment.end, "speaker": speaker})
        return l

    def process(self, waveform, sample_rate):
        # convert samples to tensor
        audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
        # run diarization model on tensor
        diarization = self.pipeline(
            {"waveform": audio_tensor, "sample_rate": sample_rate})
        # convert output to list of dicts
        diarization = self.transform_diarization_output(diarization)
        return diarization

    def join_transcript_with_diarization(self, transcript, diarization):

        diarization_tree = IntervalTree()
        # Add diarization to interval tree
        for dia in diarization:
            diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])

        joined = []
        for seg in transcript:
            interval_start = seg['start']
            interval_end = seg['end']
            # Get overlapping diarization
            overlaps = diarization_tree[interval_start:interval_end]
            speakers = {overlap.data for overlap in overlaps}
            # Add to result
            joined.append({
                'start': interval_start,
                'end': interval_end,
                'speakers': list(speakers),
                'text': seg['text']
            })

        return joined

makaveli10 · 2023-10-11T09:05:43Z

@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help.

yehiaabdelm · 2023-11-04T16:26:55Z

Yup, that would be super helpful.

makaveli10 · 2023-11-06T09:42:44Z

@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD.

yehiaabdelm · 2023-11-07T00:27:16Z

True, but I need the VAD.

makaveli10 · 2023-11-20T11:20:38Z

@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD.

makaveli10 · 2023-11-30T06:01:59Z

Closing due to inactivity. Feel free to re-open the issue.

makaveli10 closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Diarization #45

Speaker Diarization #45

yehiaabdelm commented Sep 14, 2023

makaveli10 commented Sep 18, 2023

zoq commented Sep 19, 2023

yehiaabdelm commented Sep 29, 2023

zoq commented Oct 2, 2023

yehiaabdelm commented Oct 10, 2023

makaveli10 commented Oct 11, 2023 •

edited

yehiaabdelm commented Nov 4, 2023

makaveli10 commented Nov 6, 2023

yehiaabdelm commented Nov 7, 2023

makaveli10 commented Nov 20, 2023

makaveli10 commented Nov 30, 2023

Speaker Diarization #45

Speaker Diarization #45

Comments

yehiaabdelm commented Sep 14, 2023

makaveli10 commented Sep 18, 2023

zoq commented Sep 19, 2023

yehiaabdelm commented Sep 29, 2023

zoq commented Oct 2, 2023

yehiaabdelm commented Oct 10, 2023

makaveli10 commented Oct 11, 2023 • edited

yehiaabdelm commented Nov 4, 2023

makaveli10 commented Nov 6, 2023

yehiaabdelm commented Nov 7, 2023

makaveli10 commented Nov 20, 2023

makaveli10 commented Nov 30, 2023

makaveli10 commented Oct 11, 2023 •

edited