New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speaker Diarization #45
Comments
@yehiaabdelm To start with, i think it could be added similar to how |
Adding on that, we do have some code that perform offline diarization, I guess you need live diarization? I'm trying to think about how we could take the offline code we have and apply it to a live setting. |
Can you please share the offline code? I want to try and add it. @makaveli10 Yes I think it can be done like the vad model, but I think we would be processing the whole audio stream every time, right? It needs to segment/cluster the recording every time. |
Yes, I'll clean up the code and share it. |
Gave it a shot using pyannote, but it slows down transcription until it becomes unusable because I'm passing the whole stream of audio every time. I think we need to store speaker embeddings somehow and have some sort of sliding window. Thought I'd provide an update. diarization.py from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]
class Diarization():
def __init__(self):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)
def transform_diarization_output(self, diarization):
l = []
for segment, speaker in diarization.itertracks():
l.append({"start": segment.start,
"end": segment.end, "speaker": speaker})
return l
def process(self, waveform, sample_rate):
# convert samples to tensor
audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
# run diarization model on tensor
diarization = self.pipeline(
{"waveform": audio_tensor, "sample_rate": sample_rate})
# convert output to list of dicts
diarization = self.transform_diarization_output(diarization)
return diarization
def join_transcript_with_diarization(self, transcript, diarization):
diarization_tree = IntervalTree()
# Add diarization to interval tree
for dia in diarization:
diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])
joined = []
for seg in transcript:
interval_start = seg['start']
interval_end = seg['end']
# Get overlapping diarization
overlaps = diarization_tree[interval_start:interval_end]
speakers = {overlap.data for overlap in overlaps}
# Add to result
joined.append({
'start': interval_start,
'end': interval_end,
'speakers': list(speakers),
'text': seg['text']
})
return joined
|
@yehiaabdelm looks good for a start, we can process only the new segment we receive from whisper-live transcription pipeline. But in that case, we would need to have the correct timestamps from whisper because right now with VAD we dont process when the audio frames are below speech threshold so the timestamps are off. i think i can fix it if it would help. |
Yup, that would be super helpful. |
@yehiaabdelm I can take a look but not sure when. Although, the way you could quickly solve it is by removing VAD. |
True, but I need the VAD. |
@yehiaabdelm Sorry for the late response. #64 should give you correct timestamps with VAD. |
Closing due to inactivity. Feel free to re-open the issue. |
I want to add speaker diarization. Wondering how much I would have to change (existing code), also if you can point me to how you would approach this since I'm still getting used to the code base.
The text was updated successfully, but these errors were encountered: