Load the pre-trained speaker diarization pipeline locally on our device:

In [1]:
from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization@2.1", use_auth_token="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"
)

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
  if ismodule(module) and hasattr(module, '__file__'):
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\Eric\.cache\torch\pyannote\models--pyannote--segmentation\snapshots\c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b\pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.6.0+cpu. Bad things might happen unless you revert torch to 1.x.


INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder


Load the mp3 that we have created with 3 speakers in 4 tracks all concatenated together.

In [2]:
import soundfile as sf

filename = "combined_smaller_sample_2"
combined_sample, combined_sample_rate = sf.read(f'combined/{filename}.wav')

print(combined_sample_rate)
print(combined_sample)


16000
[ 3.05175781e-05  9.15527344e-05  3.05175781e-05 ...  1.52587891e-04
  1.52587891e-04 -3.05175781e-05]


We can listen to the audio to see what it sounds like:

In [3]:
from IPython.display import Audio

Audio(combined_sample, rate=combined_sample_rate)

Note that pyannote.audio expects the audio input to be a PyTorch tensor of shape (channels, seq_len),
so we need to perform this conversion prior to running the model:

In [4]:
import torch

input_tensor = torch.from_numpy(combined_sample[None, :]).float()
outputs = diarization_pipeline(
    {"waveform": input_tensor, "sample_rate": combined_sample_rate}
)


# annotation_dict_list = list();
# for segment in outputs.itersegments():
#         annotation_dict_list.append({
#                 "segment": segment,
#                 "track": outputs.get_tracks(segment),
#                 "label": outputs.get_labels(segment)
#         })

# outputs.for_json()["content"]       
# for annotation in annotation_dict_list:
#         print(annotation)


diarization_output = []
for segment, track, label in outputs.itertracks(yield_label=True):
    diarization_output.append({'segment': {'start': segment.start, 'end': segment.end},
                        'track': track,
                        'label': label})
    
for segment in diarization_output:
    print(segment)

{'segment': {'start': 0.45284375, 'end': 0.5372187500000001}, 'track': 'A', 'label': 'SPEAKER_01'}
{'segment': {'start': 0.5372187500000001, 'end': 8.02971875}, 'track': 'B', 'label': 'SPEAKER_00'}
{'segment': {'start': 9.160343750000003, 'end': 9.177218750000002}, 'track': 'C', 'label': 'SPEAKER_00'}
{'segment': {'start': 9.177218750000002, 'end': 17.36159375}, 'track': 'D', 'label': 'SPEAKER_02'}
{'segment': {'start': 17.98596875, 'end': 23.43659375}, 'track': 'E', 'label': 'SPEAKER_01'}


Use the Whisper model for our speech transcription system.

In [5]:
from transformers import pipeline

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
)

Device set to use cpu


Get the transcription for our sample audio, returning the segment level timestamps as well so that we know the start / end times for each segment.

Whiper does not work well with audio files longer than 30 seconds.

In [6]:
asr_output = asr_pipeline(
    combined_sample.copy(),
    generate_kwargs={"max_new_tokens": 256},
    return_timestamps=True,
)

print(asr_output)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


{'text': " Yet, he is of all creatures the most formatively armed. What then is the devilfish? It is the sea vampire. To make sure this sad part of my story, we went the way of Hall sailors. The punch was made and I was made half drunk with it. At about two o'clock we heard the loud cry of sailho from a loft.", 'chunks': [{'timestamp': (0.0, 4.52), 'text': ' Yet, he is of all creatures the most formatively armed.'}, {'timestamp': (4.52, 6.44), 'text': ' What then is the devilfish?'}, {'timestamp': (6.44, 9.24), 'text': ' It is the sea vampire.'}, {'timestamp': (9.24, 14.24), 'text': ' To make sure this sad part of my story, we went the way of Hall sailors.'}, {'timestamp': (14.24, 18.04), 'text': ' The punch was made and I was made half drunk with it.'}, {'timestamp': (18.04, 23.4), 'text': " At about two o'clock we heard the loud cry of sailho from a loft."}]}


Find the closest alignment between diarization and transcription timestamps by minimising the absolute distance between both.

In [None]:
from speechbox_trycatch_upto_idx import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

Pass the audio file to the composite pipeline and see what we get out:

In [None]:
# ASRDiarizationPipeline calls both models again. The boxing part has been stripped out of ASRDiarizationPipeline into box.ipynb.
# If hooking this up to the application will want to either use the box.ipynb method or use the below without calling the above.
final_output = pipeline(combined_sample.copy())

print(final_output)

[{'speaker': 'SPEAKER_01', 'text': ' Yet, he is of all creatures the most formatively armed.', 'timestamp': (0.0, 4.52)}, {'speaker': 'SPEAKER_00', 'text': ' What then is the devilfish? It is the sea vampire.', 'timestamp': (4.52, 9.24)}, {'speaker': 'SPEAKER_02', 'text': ' To make sure this sad part of my story, we went the way of Hall sailors. The punch was made and I was made half drunk with it.', 'timestamp': (9.24, 18.04)}, {'speaker': 'SPEAKER_01', 'text': " At about two o'clock we heard the loud cry of sailho from a loft.", 'timestamp': (18.04, 23.4)}]


In [9]:
import json

all_output = {
    "speakers": final_output,
    "asr": asr_output,
    "diarization": diarization_output
}

with open(f"output/{filename}.json", "w") as f:
    json.dump(all_output, f, indent=4)