Load the pre-trained speaker diarization pipeline locally on our device:

In [30]:
from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization@2.1", use_auth_token="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"
)

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\Eric\.cache\torch\pyannote\models--pyannote--segmentation\snapshots\c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b\pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.6.0+cpu. Bad things might happen unless you revert torch to 1.x.


INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder


Load the mp3 that we have created with 3 speakers in 4 tracks all concatenated together.

In [31]:
import soundfile as sf

filename = "combined_smaller"
combined_sample, combined_sample_rate = sf.read(f'combined/{filename}.wav')

print(combined_sample_rate)
print(combined_sample)


16000
[ 3.05175781e-05  1.52587891e-04  6.10351562e-05 ... -1.12915039e-03
 -7.62939453e-04 -8.85009766e-04]


We can listen to the audio to see what it sounds like:

In [32]:
from IPython.display import Audio

Audio(combined_sample, rate=combined_sample_rate)

Note that pyannote.audio expects the audio input to be a PyTorch tensor of shape (channels, seq_len),
so we need to perform this conversion prior to running the model:

In [33]:
import torch

input_tensor = torch.from_numpy(combined_sample[None, :]).float()
outputs = diarization_pipeline(
    {"waveform": input_tensor, "sample_rate": combined_sample_rate}
)


# annotation_dict_list = list();
# for segment in outputs.itersegments():
#         annotation_dict_list.append({
#                 "segment": segment,
#                 "track": outputs.get_tracks(segment),
#                 "label": outputs.get_labels(segment)
#         })

# outputs.for_json()["content"]       
# for annotation in annotation_dict_list:
#         print(annotation)


diarization_output = []
for segment, track, label in outputs.itertracks(yield_label=True):
    diarization_output.append({'segment': {'start': segment.start, 'end': segment.end},
                        'track': track,
                        'label': label})
    
for segment in diarization_output:
    print(segment)

{'segment': {'start': 0.58784375, 'end': 6.13971875}, 'track': 'A', 'label': 'SPEAKER_00'}
{'segment': {'start': 4.806593750000001, 'end': 5.14409375}, 'track': 'B', 'label': 'SPEAKER_01'}
{'segment': {'start': 6.764093750000001, 'end': 9.970343750000001}, 'track': 'C', 'label': 'SPEAKER_01'}
{'segment': {'start': 10.662218750000001, 'end': 10.67909375}, 'track': 'D', 'label': 'SPEAKER_01'}
{'segment': {'start': 10.67909375, 'end': 15.859718750000003}, 'track': 'E', 'label': 'SPEAKER_02'}
{'segment': {'start': 16.28159375, 'end': 20.88846875}, 'track': 'F', 'label': 'SPEAKER_00'}


Use the Whisper model for our speech transcription system.

In [34]:
from transformers import pipeline

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
)

Device set to use cpu


Get the transcription for our sample audio, returning the segment level timestamps as well so that we know the start / end times for each segment.

Whiper does not work well with audio files longer than 30 seconds.

In [35]:
asr_output = asr_pipeline(
    combined_sample.copy(),
    generate_kwargs={"max_new_tokens": 256},
    return_timestamps=True,
)

print(asr_output)

{'text': ' Explosions rock the enemy line, providing much needed cover for the Az and his team. The truth is that, as there is ample testimony, and I look to the land where the smoke we had seen three-quarters of an hour ago. As they reached the extraction point, they heard the roar of engines overhead.', 'chunks': [{'timestamp': (0.0, 6.8), 'text': ' Explosions rock the enemy line, providing much needed cover for the Az and his team.'}, {'timestamp': (6.8, 13.32), 'text': ' The truth is that, as there is ample testimony, and I look to the land where the smoke we had'}, {'timestamp': (13.32, 16.4), 'text': ' seen three-quarters of an hour ago.'}, {'timestamp': (16.4, 20.84), 'text': ' As they reached the extraction point, they heard the roar of engines overhead.'}]}


Find the closest alignment between diarization and transcription timestamps by minimising the absolute distance between both.

In [36]:
from speechbox_trycatch_upto_idx import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

Pass the audio file to the composite pipeline and see what we get out:

In [37]:
# ASRDiarizationPipeline calls both models again. The boxing part has been stripped out of ASRDiarizationPipeline into box.ipynb.
# If hooking this up to the application will want to either use the box.ipynb method or use the below without calling the above.
final_output = pipeline(combined_sample.copy())

print(final_output)

[{'speaker': 'SPEAKER_00', 'text': ' Explosions rock the enemy line, providing much needed cover for the Az and his team.', 'timestamp': (0.0, 6.8)}, {'speaker': 'SPEAKER_01', 'text': ' The truth is that, as there is ample testimony, and I look to the land where the smoke we had', 'timestamp': (6.8, 13.32)}, {'speaker': 'SPEAKER_02', 'text': ' seen three-quarters of an hour ago.', 'timestamp': (13.32, 16.4)}, {'speaker': 'SPEAKER_00', 'text': ' As they reached the extraction point, they heard the roar of engines overhead.', 'timestamp': (16.4, 20.84)}]


In [38]:
import json

all_output = {
    "speakers": final_output,
    "asr": asr_output["chunks"],
    "diarization": diarization_output
}

with open(f"output/{filename}.json", "w") as f:
    json.dump(all_output, f, indent=4)