# OpenAI Whisper Demo

modified from:
[Whisper Github](https://github.com/openai/whisper#python-usage)

Use MP3 file extracted from brief "laughter lift" video here: 
[Wedding Receptions, North London Pubs and French Toast - Laughter Lift](https://www.youtube.com/watch?v=1R29NlHOoGA)

In [1]:
# Imports
from dotenv import find_dotenv
# NOTE: empty `.env` file was added beneath `src` directory. Ignored by gitignore rules.
import os
import sys
sys.path.append(os.path.dirname(find_dotenv()))
from notebooks.notebook_utils import DevData

# ----------------------------------------
import jiwer
import whisper

In [2]:
# define paths
external_dir = DevData().external_dir
mp3_file = os.path.join(external_dir, "Laughter_Lift.mp3")
print(f"mp3_file exists: {os.path.exists(mp3_file)}")

mp3_file exists: True


Demo the `base` model

In [3]:
bs_model = whisper.load_model("base")
base_result = bs_model.transcribe(mp3_file, fp16=False)
print(base_result["text"])

 More with Emily Watson intake two adds in a minute, but first it's time once again very very good news everybody We step into our love to lift as are Shazam here we go Hey mark singers you love the noble gases joke so much last week I was gonna tell you another one, but all the good ones are gone I got that I got that because argons are gas it is is no gas. I got it. Okay. Anyway. Here's another science you want for you I was out at a pub with rooms in showbiz north slendon on Saturday for a wedding reception a neutron walked in How much for a pint? He said for you no charge said the barman and I get that as well because in neutron has no charge Cyber atomic particle with no charge and then a photon walked in. I'd like a room for the night. Please. He said Certainly do you have any luggage we can take up to the room ask the receptionist no said the photon. I'm traveling light Because the photon is like a little like particle quantum of light In it was in fact it's an education today. 

Demo the `small` model

In [5]:
sm_model = whisper.load_model("small")
sm_result = sm_model.transcribe(mp3_file, fp16=False)
print(sm_result["text"])


 14%|█████▍                                 | 64.0M/461M [02:51<17:46, 391kiB/s]


RuntimeError: Model has been downloaded but the SHA256 checksum does not not match. Please retry loading the model.

Demo the `medium` model

In [None]:
md_model = whisper.load_model("medium")
md_result = md_model.transcribe(mp3_file, fp16=False)
print(md_result["text"])

 More with Emily Watson in take two, ads in a minute. But first, it's time once again, very, very good news everybody, we step into our laughter lift. Huzzah. Shazam. Here we go. Hey Mark, seeing as you loved the noble gases joke so much last week, I was going to tell you another one, but all the good ones are gone. I got that, I got that, because argon's a gas. It is, it's a noble gas. I got it, okay. Anyway, here's another science you want for you, are you ready? I was out at a pub with rooms in Chobhub's North London on Saturday for a wedding reception. A neutron walked in. How much for a pint, he said. For you, no charge, said the barman. And I get that as well, because a neutron has no charge. Subatomic particle with no charge. And then a photon walked in. I'd like a room for the night, please, he said. Certainly, do you have any luggage we can take up to the room? Asked the receptionist. No, said the photon. I'm traveling light. Because a photon is a... It is a light particle. Qu

## Analysis of results

- Use `jiwer` package to compute WER

In [None]:
correct = "More with Emily Watson in take two, ads in a minute. But first, it's time once again, very, very good news everybody, we step into our laughter lift. Huzzah. Shazam. Here we go. Hey Mark, seeing as you loved the noble gases joke so much last week, I was going to tell you another one, but all the good ones are gone. I got that, I got that, because argon's a gas. It is, it's a noble gas. I got it, okay. Anyway, here's another sciencey one for you. are you ready? I was out at a pub with rooms in showbiz North London on Saturday for a wedding reception. A neutron walked in. How much for a pint, he said. For you, no charge, said the barman. And I get that as well, because a neutron has no charge. Subatomic particle with no charge. And then a photon walked in. I'd like a room for the night, please, he said. Certainly, do you have any luggage we can take up to the room? Asked the receptionist. No, said the photon. I'm traveling light. Because a photon is a... It is a light particle. Quantum of light. It was in fact... It's an education today. It's not funny, but it's an education. It was in fact cousin Cecil's wedding to his delightful Parisian fiancee Noémie on Saturday. At the reception, I raised my champagne glass and said, A dish of sliced bread soaked in beaten eggs and often milk or cream, then pan fried. Alternative names and variants include eggy bread, Bombay toast, gypsy toast, and poor nights of Windsor. It was a French toast. The evening did not end well. I got the bar bill and had a massive row with the bar staff. I argued with my cashier that the bill was £70.20, not £7,000... £7,020. He didn't get the point. Anyway, what have we got? You've got that as well. Yes. Yes, got that. What's still to come? Dungeons and Dragons. Okay, back after this. Unless you're a vanguardista, in which case you definitely don't have a nickname that everyone else knows apart from you and your service will not be interrupted.Thanks very much for watching this video I hope you enjoyed watching it. While you're here, check out all the other videos because they're cool too, aren't they? Yeah, and if you want to keep up to date with everything Kermode and Mayo's take, then check out our social channels. I mean, why wouldn't you? I mean, I would. I have done. Excellent."

base_wer = wer(correct, base_result["text"])
sm_wer = wer(correct, sm_result["text"])
md_wer = wer(correct, md_result["text"])
print(f"base WER:\t{base_wer}")
print(f"small WER:\t{sm_wer}")
print(f"medium WER:\t{md_wer}")


base WER:	0.4028103044496487
medium WER:	0.05620608899297424


## Findings

`base` model is not satisfactory. The punctuation and sentences are problematic and a number of words are wrong.
`medium` model is much much better. It seems to have correct sentences/punctuation. Most of the problems in `base` are resolved. It even got the French name Noémie correct. One thing `medium` got wrong that `base` got right was "showbiz north London" as opposed to "Chobhub north London". Another thing `medium` got wrong was "Vanguard Easter" instead of "vanguardista". But this isn't a common term so is forgivable. I think both of these examples (and other Wittertainment jargon) could be improved with model fine-tuning.

### Running time
| model  | time   |
| -----  | ------ |
| base   | 0m 32s |
| medium | 4m 56s |

### Guessing at full processing time
Note: the youtube video the audio was extracted from is only 2:08 minutes. The full Take 1 podcast is generally a bit over an hour. If the Whisper processing time scales linearly this would mean approximately 11 hours for ONE podcast! n.b. the running times here are local using CPU. This suggests the necessity of using GPU processing.

### Word Error Rate (WER)
The WER metric demonstrates a marked improvement using the `medium` vs `base` model. 


------

# Diarization using WhisperX

In [3]:
from whisperx import load_align_model, align
from whisperx.diarize import DiarizationPipeline, assign_word_speakers

In [7]:

HF_TOKEN = os.environ["HUGGINGFACE_TOKEN"]
diarization_pipeline = DiarizationPipeline(use_auth_token=HF_TOKEN)
diarization_result = diarization_pipeline(mp3_file)
diarization_result


Could not download 'pyannote/speaker-diarization-3.0' pipeline.
It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

   >>> Pipeline.from_pretrained('pyannote/speaker-diarization-3.0',
   ...                          use_auth_token=YOUR_AUTH_TOKEN)

If this still does not work, it might be because the pipeline is gated:
visit https://hf.co/pyannote/speaker-diarization-3.0 to accept the user conditions.


AttributeError: 'NoneType' object has no attribute 'to'