# OpenAI Whisper Demo

modified from:
[Whisper Github](https://github.com/openai/whisper#python-usage)

Use MP3 file extracted from brief "laughter lift" video here: 
[Wedding Receptions, North London Pubs and French Toast - Laughter Lift](https://www.youtube.com/watch?v=1R29NlHOoGA)

In [40]:
# Imports
from dotenv import find_dotenv
from dotenv import dotenv_values
config = dotenv_values(find_dotenv())
# NOTE: empty `.env` file was added beneath `src` directory. Ignored by gitignore rules.
import os
import sys
sys.path.append(os.path.dirname(find_dotenv()))
from notebooks.notebook_utils import DevData

# ----------------------------------------
import jiwer
import whisper

In [2]:
# define paths
external_dir = DevData().external_dir
mp3_file = os.path.join(external_dir, "Laughter_Lift.mp3")
print(f"mp3_file exists: {os.path.exists(mp3_file)}")

mp3_file exists: True


Demo the `base` model

In [None]:
bs_model = whisper.load_model("base")
base_result = bs_model.transcribe(mp3_file, fp16=False)
print(base_result["text"])

Demo the `small` model

In [None]:
sm_model = whisper.load_model("small")
sm_result = sm_model.transcribe(mp3_file, fp16=False)
print(sm_result["text"])


Demo the `medium` model

In [None]:
md_model = whisper.load_model("medium")
md_result = md_model.transcribe(mp3_file, fp16=False)
print(md_result["text"])

## Analysis of results

- Use `jiwer` package to compute WER

In [None]:
correct = "More with Emily Watson in take two, ads in a minute. But first, it's time once again, very, very good news everybody, we step into our laughter lift. Huzzah. Shazam. Here we go. Hey Mark, seeing as you loved the noble gases joke so much last week, I was going to tell you another one, but all the good ones are gone. I got that, I got that, because argon's a gas. It is, it's a noble gas. I got it, okay. Anyway, here's another sciencey one for you. are you ready? I was out at a pub with rooms in showbiz North London on Saturday for a wedding reception. A neutron walked in. How much for a pint, he said. For you, no charge, said the barman. And I get that as well, because a neutron has no charge. Subatomic particle with no charge. And then a photon walked in. I'd like a room for the night, please, he said. Certainly, do you have any luggage we can take up to the room? Asked the receptionist. No, said the photon. I'm traveling light. Because a photon is a... It is a light particle. Quantum of light. It was in fact... It's an education today. It's not funny, but it's an education. It was in fact cousin Cecil's wedding to his delightful Parisian fiancee Noémie on Saturday. At the reception, I raised my champagne glass and said, A dish of sliced bread soaked in beaten eggs and often milk or cream, then pan fried. Alternative names and variants include eggy bread, Bombay toast, gypsy toast, and poor nights of Windsor. It was a French toast. The evening did not end well. I got the bar bill and had a massive row with the bar staff. I argued with my cashier that the bill was £70.20, not £7,000... £7,020. He didn't get the point. Anyway, what have we got? You've got that as well. Yes. Yes, got that. What's still to come? Dungeons and Dragons. Okay, back after this. Unless you're a vanguardista, in which case you definitely don't have a nickname that everyone else knows apart from you and your service will not be interrupted.Thanks very much for watching this video I hope you enjoyed watching it. While you're here, check out all the other videos because they're cool too, aren't they? Yeah, and if you want to keep up to date with everything Kermode and Mayo's take, then check out our social channels. I mean, why wouldn't you? I mean, I would. I have done. Excellent."

base_wer = wer(correct, base_result["text"])
sm_wer = wer(correct, sm_result["text"])
md_wer = wer(correct, md_result["text"])
print(f"base WER:\t{base_wer}")
print(f"small WER:\t{sm_wer}")
print(f"medium WER:\t{md_wer}")


## Findings

`base` model is not satisfactory. The punctuation and sentences are problematic and a number of words are wrong.
`medium` model is much much better. It seems to have correct sentences/punctuation. Most of the problems in `base` are resolved. It even got the French name Noémie correct. One thing `medium` got wrong that `base` got right was "showbiz north London" as opposed to "Chobhub north London". Another thing `medium` got wrong was "Vanguard Easter" instead of "vanguardista". But this isn't a common term so is forgivable. I think both of these examples (and other Wittertainment jargon) could be improved with model fine-tuning.

### Running time
| model  | time   |
| -----  | ------ |
| base   | 0m 32s |
| medium | 4m 56s |

### Guessing at full processing time
Note: the youtube video the audio was extracted from is only 2:08 minutes. The full Take 1 podcast is generally a bit over an hour. If the Whisper processing time scales linearly this would mean approximately 11 hours for ONE podcast! n.b. the running times here are local using CPU. This suggests the necessity of using GPU processing.

### Word Error Rate (WER)
The WER metric demonstrates a marked improvement using the `medium` vs `base` model. 


------

# Diarization using WhisperX

IMPORTANT
Note the Requirements

Requirements
- Install pyannote.audio 3.0 with pip install pyannote.audio
- Accept pyannote/segmentation-3.0 user conditions
- Accept pyannote/speaker-diarization-3.0 user conditions
- Create access token at hf.co/settings/tokens.


In [None]:
from whisperx import load_align_model, align
from whisperx.diarize import DiarizationPipeline, assign_word_speakers

In [47]:
# load huggingface token from .env file
HUGGINGFACE_TOKEN = config["HUGGINGFACE_TOKEN"]

In [None]:

diarization_pipeline = DiarizationPipeline(use_auth_token=os.environ["HUGGINGFACE_TOKEN"])
diarization_result = diarization_pipeline(mp3_file)

In [None]:
# ----------  take 2 ----------

In [None]:
# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token=os.environ["HUGGINGFACE_TOKEN"])

# run the pipeline on an audio file
diarization = pipeline(mp3_file)

In [None]:
outfile = os.path.join(external_dir, "audio_out.rttm")
# dump the diarization output to disk using RTTM format
with open(outfile, "w") as rttm:
   diarization.write_rttm(rttm)

In [None]:
# ----------  take 3 ----------
# putting it all together

In [None]:
from whisperx import load_align_model, align
from whisperx.diarize import DiarizationPipeline, assign_word_speakers

model = whisper.load_model("medium")
result = model.transcribe(mp3_file, fp16=False)

language_code = result["language"]
segments = result["segments"]


device: str = "cpu"
model_a, metadata = load_align_model(language_code=language_code, device=device)
aligned_segments = align(segments, model_a, metadata, mp3_file, device)


diarization_pipeline = DiarizationPipeline(use_auth_token=os.environ["HUGGINGFACE_TOKEN"])
diarization_result = diarization_pipeline(mp3_file)
diarization_result

In [4]:
# ---------- Take 4 ----------

In [6]:
import whisperx
import gc 

device = "cpu" 
audio_file = mp3_file
batch_size = 1 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
# model = whisperx.load_model("medium", device, compute_type=compute_type)
model = whisper.load_model("medium")

audio = whisperx.load_audio(audio_file)
# result = model.transcribe(audio, batch_size=batch_size)
result = model.transcribe(mp3_file, fp16=False)
print(result["segments"]) # before alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model



[{'id': 0, 'seek': 0, 'start': 0.0, 'end': 3.04, 'text': ' More with Emily Watson in take two, ads in a minute.', 'tokens': [50364, 5048, 365, 15034, 25640, 294, 747, 732, 11, 10342, 294, 257, 3456, 13, 50516], 'temperature': 0.0, 'avg_logprob': -0.1925884839650747, 'compression_ratio': 1.6193771626297577, 'no_speech_prob': 0.45980846881866455}, {'id': 1, 'seek': 0, 'start': 3.04, 'end': 8.4, 'text': " But first, it's time once again, very, very good news everybody, we step into our laughter lift.", 'tokens': [50516, 583, 700, 11, 309, 311, 565, 1564, 797, 11, 588, 11, 588, 665, 2583, 2201, 11, 321, 1823, 666, 527, 13092, 5533, 13, 50784], 'temperature': 0.0, 'avg_logprob': -0.1925884839650747, 'compression_ratio': 1.6193771626297577, 'no_speech_prob': 0.45980846881866455}, {'id': 2, 'seek': 0, 'start': 8.4, 'end': 10.56, 'text': ' Huzzah. Shazam.', 'tokens': [50784, 389, 16740, 545, 13, 1160, 921, 335, 13, 50892], 'temperature': 0.0, 'avg_logprob': -0.1925884839650747, 'compression_ra

In [7]:
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

Failed to align segment (" £7,020."): no characters in this segment found in model dictionary, resorting to original...
[{'start': 0.322, 'end': 3.0, 'text': ' More with Emily Watson in take two, ads in a minute.', 'words': [{'word': 'More', 'start': 0.322, 'end': 0.483, 'score': 0.829}, {'word': 'with', 'start': 0.503, 'end': 0.604, 'score': 0.807}, {'word': 'Emily', 'start': 0.664, 'end': 0.886, 'score': 0.527}, {'word': 'Watson', 'start': 0.926, 'end': 1.349, 'score': 0.609}, {'word': 'in', 'start': 1.49, 'end': 1.671, 'score': 0.798}, {'word': 'take', 'start': 1.731, 'end': 1.973, 'score': 0.976}, {'word': 'two,', 'start': 2.054, 'end': 2.275, 'score': 0.539}, {'word': 'ads', 'start': 2.416, 'end': 2.617, 'score': 0.805}, {'word': 'in', 'start': 2.657, 'end': 2.738, 'score': 0.808}, {'word': 'a', 'start': 2.758, 'end': 2.778, 'score': 0.979}, {'word': 'minute.', 'start': 2.819, 'end': 3.0, 'score': 0.844}]}, {'start': 3.421, 'end': 8.28, 'text': " But first, it's time once again, v

In [28]:
# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=os.environ["HUGGINGFACE_TOKEN"], device=device)

# add min/max number of speakers if known
# diarize_segments = diarize_model(audio)
diarize_segments = diarize_model(audio, min_speakers=2)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
# print(diarize_segments)

print(f'Speakers: {list(set([item["speaker"] for item in result["segments"]]))}')
print(result["segments"]) # segments are now assigned speaker IDs

Speakers: ['SPEAKER_00', 'SPEAKER_01']
[{'start': 0.322, 'end': 3.0, 'text': ' More with Emily Watson in take two, ads in a minute.', 'words': [{'word': 'More', 'start': 0.322, 'end': 0.483, 'score': 0.829, 'speaker': 'SPEAKER_00'}, {'word': 'with', 'start': 0.503, 'end': 0.604, 'score': 0.807, 'speaker': 'SPEAKER_00'}, {'word': 'Emily', 'start': 0.664, 'end': 0.886, 'score': 0.527, 'speaker': 'SPEAKER_00'}, {'word': 'Watson', 'start': 0.926, 'end': 1.349, 'score': 0.609, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 1.49, 'end': 1.671, 'score': 0.798, 'speaker': 'SPEAKER_00'}, {'word': 'take', 'start': 1.731, 'end': 1.973, 'score': 0.976, 'speaker': 'SPEAKER_00'}, {'word': 'two,', 'start': 2.054, 'end': 2.275, 'score': 0.539, 'speaker': 'SPEAKER_00'}, {'word': 'ads', 'start': 2.416, 'end': 2.617, 'score': 0.805, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 2.657, 'end': 2.738, 'score': 0.808, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 2.758, 'end': 2.778, 'score': 0.979

In [29]:
for i, item in enumerate(result["segments"]):
  print(f'{item["speaker"]}:  {item["text"]}')


SPEAKER_00:   More with Emily Watson in take two, ads in a minute.
SPEAKER_00:   But first, it's time once again, very, very good news everybody, we step into our laughter lift.
SPEAKER_00:   Huzzah.
SPEAKER_00:  Shazam.
SPEAKER_00:   Here we go.
SPEAKER_00:   Hey Mark, seeing as you loved the noble gases joke so much last week,
SPEAKER_00:   I was going to tell you another one, but all the good ones are gone.
SPEAKER_01:   I got that, I got that, because argon's a gas.
SPEAKER_01:   It is, it's a noble gas.
SPEAKER_00:   I got it, okay.
SPEAKER_00:   Anyway, here's another science you want for you, are you ready?
SPEAKER_00:   I was out at a pub with rooms in Chobhub's North London on Saturday for a wedding reception.
SPEAKER_00:   A neutron walked in.
SPEAKER_00:   How much for a pint, he said.
SPEAKER_00:   For you, no charge, said the barman.
SPEAKER_00:   And I get that as well, because a neutron has no charge.
SPEAKER_00:   Subatomic particle with no charge.
SPEAKER_00:   And the

In [24]:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.0",
    use_auth_token=os.environ["HUGGINGFACE_TOKEN"])

# send pipeline to GPU (when available)
import torch
pipeline.to(torch.device("cpu"))

# apply pretrained pipeline
diarization = pipeline(mp3_file)

# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}: {}")
# start=0.2s stop=1.5s speaker_0
# start=1.8s stop=3.9s speaker_1
# start=4.2s stop=5.7s speaker_0

start=0.0s stop=9.2s speaker_SPEAKER_00
start=9.8s stop=11.4s speaker_SPEAKER_00
start=12.1s stop=20.2s speaker_SPEAKER_00
start=21.1s stop=21.2s speaker_SPEAKER_00
start=21.2s stop=24.8s speaker_SPEAKER_01
start=23.8s stop=52.4s speaker_SPEAKER_00
start=53.2s stop=54.2s speaker_SPEAKER_00
start=53.6s stop=56.9s speaker_SPEAKER_01
start=54.6s stop=56.7s speaker_SPEAKER_00
start=57.6s stop=57.6s speaker_SPEAKER_01
start=57.6s stop=57.8s speaker_SPEAKER_00
start=57.8s stop=62.7s speaker_SPEAKER_01
start=59.0s stop=59.4s speaker_SPEAKER_00
start=60.6s stop=61.0s speaker_SPEAKER_00
start=61.6s stop=82.7s speaker_SPEAKER_00
start=83.6s stop=96.2s speaker_SPEAKER_00
start=97.2s stop=109.8s speaker_SPEAKER_00
start=97.5s stop=99.5s speaker_SPEAKER_01
start=100.3s stop=102.8s speaker_SPEAKER_01
start=110.4s stop=118.6s speaker_SPEAKER_00
start=118.6s stop=127.3s speaker_SPEAKER_01


In [27]:
# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}: {dir(turn)}")

start=0.0s stop=9.2s speaker_SPEAKER_00: ['__and__', '__annotations__', '__bool__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__or__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '__xor__', '_repr_png_', '_str_helper', 'copy', 'duration', 'end', 'intersects', 'middle', 'overlaps', 'set_precision', 'start']
start=9.8s stop=11.4s speaker_SPEAKER_00: ['__and__', '__annotations__', '__bool__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__