<a href="https://colab.research.google.com/github/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install all the things

The commands below will install the Python packages needed to record audio snippets and use Whisper models for speech-to-text transcription.

In [1]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer
! pip install sounddevice wavio
! pip install ipywebrtc notebook

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-tupy8mha
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-tupy8mha
Collecting transformers>=4.19.0
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 5.1 MB/s 
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.0 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 70.0 MB/s 
Building wheels for collected packages: whisper
  Buildi

We also need the following in order to record audio from this notebook and process the resulting files. 

In [2]:
!apt install ffmpeg
!apt-get install libportaudio2

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  libportaudio2
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 64.6 kB of archives.
After this operation, 215 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Fetched 64.6 kB in 0s (344 kB/s)
Selecting previously unselected package libportaudio2:

In [3]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

# Record some sound!

> Indented block

First, we need to enable some Colab widgets so that we can make an audio recording. 

In [4]:
from google.colab import output
output.enable_custom_widget_manager()

### Time to record! 

Press the circle button and then start speaking. It may not look it, but it will be recording. Click the circle button again when you are finished recording your audio snippet. 

In [5]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

Next we must conver our recording into a format that PyTorch can understand. 

In [7]:
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic
sig, sr = torchaudio.load("my_recording.wav")
print(sig.shape)
Audio(data=sig, rate=sr)

torch.Size([1, 440640])


Now we're ready to get on with the machine learning!

# Load Whisper model

In [8]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 45.6MiB/s]


Model is English-only and has 71,825,408 parameters.


In [9]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

### Process our recording through the Whisper model

In [10]:
audio = whisper.load_audio("my_recording.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
result = model.decode(mel, options)

In [11]:
result.text

"I'm recording my audio snippet. This is audio snippet number 1,264,374."

How well did Whisper do??