Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Diarization #45

Closed
yehiaabdelm opened this issue Sep 14, 2023 · 11 comments
Closed

Speaker Diarization #45

yehiaabdelm opened this issue Sep 14, 2023 · 11 comments

Comments

@yehiaabdelm
Copy link
Contributor

I want to add speaker diarization. Wondering how much I would have to change (existing code), also if you can point me to how you would approach this since I'm still getting used to the code base.

@makaveli10
Copy link
Collaborator

@yehiaabdelm To start with, i think it could be added similar to how Voice Activity Detection is done i.e. before the transcription to figure out the speaker id for the segment and send it back to the client.

@zoq
Copy link
Contributor

zoq commented Sep 19, 2023

Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting.

@yehiaabdelm
Copy link
Contributor Author

Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time.

@zoq
Copy link
Contributor

zoq commented Oct 2, 2023

Yes, I'll clean up the code and share it.

@yehiaabdelm
Copy link
Contributor Author

Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update.

diarization.py

from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree


from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]


class Diarization():
    def __init__(self):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)

    def transform_diarization_output(self, diarization):
        l = []
        for segment, speaker in diarization.itertracks():
            l.append({"start": segment.start,
                     "end": segment.end, "speaker": speaker})
        return l

    def process(self, waveform, sample_rate):
        # convert samples to tensor
        audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
        # run diarization model on tensor
        diarization = self.pipeline(
            {"waveform": audio_tensor, "sample_rate": sample_rate})
        # convert output to list of dicts
        diarization = self.transform_diarization_output(diarization)
        return diarization

    def join_transcript_with_diarization(self, transcript, diarization):

        diarization_tree = IntervalTree()
        # Add diarization to interval tree
        for dia in diarization:
            diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])

        joined = []
        for seg in transcript:
            interval_start = seg['start']
            interval_end = seg['end']
            # Get overlapping diarization
            overlaps = diarization_tree[interval_start:interval_end]
            speakers = {overlap.data for overlap in overlaps}
            # Add to result
            joined.append({
                'start': interval_start,
                'end': interval_end,
                'speakers': list(speakers),
                'text': seg['text']
            })

        return joined

@makaveli10
Copy link
Collaborator

makaveli10 commented Oct 11, 2023

@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help.

@yehiaabdelm
Copy link
Contributor Author

Yup, that would be super helpful.

@makaveli10
Copy link
Collaborator

@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD.

@yehiaabdelm
Copy link
Contributor Author

True, but I need the VAD.

@makaveli10
Copy link
Collaborator

@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD.

@makaveli10
Copy link
Collaborator

Closing due to inactivity. Feel free to re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants