# OpenAI Whisper Notebook

## Section 1 - Notebook setup

The following command will pull and install the latest commit from [OpenAI's Whisper repository](https://github.com/openai/whisper) along with its Python dependencies.

In [None]:
pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-dgdftwci
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-dgdftwci
  Resolved https://github.com/openai/whisper.git to commit c09a7ae299c4c34c5839a76380ae407e7d785914
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken==0.3.1
  Downloading tiktoken-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Building wheels for collected packages:

You'll also want to set Colab's hardware accelerator to 'GPU'. You can do this by going to 'view resources' (available from the drop-down list next to the RAM/Disk bars) and then selecting 'change runtime type'.

## Section 2 - High level model access

### 2.1 - English to English Transcription

In this sub-section we'll upload one or more audio files containing English speech and transcribe the content of that audio into English text. So first things first, let's upload the audio:

In [None]:
from google.colab import files
import IPython.display as ipd

uploaded = files.upload() # run this to get an upload widget

ipd.Audio(filename="gt.wav")


Saving gt.wav to gt.wav


Next, we'll load Whisper and ask it to transcribe the audio file we just uploaded:

In [None]:
import whisper

model = whisper.load_model("base.en")
result = model.transcribe("gt.wav", language="en", fp16=False)

print(f"\n\nTranscribtion: {result['text']}")
print(f"Reference:     Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.")

100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 38.9MiB/s]




Transcribtion:  Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.
Reference:     Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.


### 2.2 Arabic to English Translation

In this sub-section we'll upload one or more audio files containing French speech and translate the content of that audio into English text. Let's upload the audio:

In [None]:
from google.colab import files
uploaded = files.upload() # run this to get an upload widget

ipd.Audio(filename="ar.m4a")

Saving ar.m4a to ar.m4a


Let's first see how Whisper fairs transcribing French speech to French text:

In [None]:
model = whisper.load_model("base")
result = model.transcribe("ar.m4a", language='ar', fp16=False)
print(result["text"])

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 321MiB/s]


 أهلا بكم جميعا في هذه المحاضرة


Now let's see how well it translates French speech to English text:

In [None]:
model = whisper.load_model("base")
result = model.transcribe("ar.m4a", language='ar', task='translate', fp16=False)
print(result["text"])

# `base` is not a good translation model

 Thank you for watching.


Let's try the same as above but on a slightly more accurate model:

In [None]:
model = whisper.load_model("small")
result = model.transcribe("ar.m4a", language='ar', task='translate', fp16=False)
print(result["text"])

100%|███████████████████████████████████████| 461M/461M [00:09<00:00, 48.4MiB/s]


 Welcome everyone in this lecture.


## Section 3 - Low level model access

Below we'll look at some low level Whisper access using `whisper.decode()` and `whisper.detect_language()`:

In [None]:
model = whisper.load_model('small')

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio('ar.m4a')
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

### 3.1 - Language detection

In [None]:
# detect the spoken language
_, probs = model.detect_language(mel)
lang = max(probs, key=probs.get)
prob = "{0:.0%}".format(max(probs.values()))

# print language that scored the highest liklihood
print(f'Detected language (and probability): {lang}', f'({prob})')

Detected language (and probability): ar (91%)


### 3.2 - French to English Translation

In [None]:
# decode the audio
options = whisper.DecodingOptions(language='ar', task='translate')
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

Welcome everyone in this lecture.
