# Five Minutes With AI: Whisper

Today we are going to use a very new, very accurate, very useful new speech recognizer called [Whisper](https://openai.com/blog/whisper/).



## Before we get started, what even is this thing you are showing me?



This is a *jupyter notebook*. It works like a lab book (in the natural sciences). 

A notebook is a list of *cells*. Each cell can be one of:
* code - in this case, python code
* mark down - text: your thoughts about what you are doing

To "execute" (run) a code cell, use Shift+Enter (hold down the Shift and Enter keys at once). 

If you'd like help with notebooks, we are here.

## Why do you like Whisper?




Here are some things we like about Whisper:
* It can be used to transcribe speech of varying length.
* It is **very accurate**.
* It is pretty fast.
* It is **multilingual**!
* You don't need to first become a computer scientist and figure out how to get your speech into the right "shape"; some thoughtful engineers have done that already.


## What do you not like about Whisper?



Here are some things we don't like about Whisper:
* It outputs transcripts in ~30 second segments. So it might cut off the speaker mid-turn, mid-utterance or mid-word.
* It does not provide phone- or word-level time alignments. So if you need that, this is not for you (but we know of things that might work for you! Come see us!).
* It does run faster on GPUs.
* In the [preprint describing Whisper](https://arxiv.org/pdf/2212.04356.pdf), the authors say they trained on 680,000 hours of speech+text collected from the internet. However, they don't indicate *which speech*, *from where*, *transcribed by whom*, or *with whose consent*. It's possible, for example, that they collected all the audio books from Audible (with or without consent) or speeches from Colby. 

[AI researchers really, really, really, really, really need to learn how to be more careful with data and more clearly document the sources, nature and permissions in data they use.]


## Tell me more about how Whisper works



Whisper uses **transformer** models (just like DALL-E, ChatGPT, etc). In the case of Whisper, the models are trained on 30-second chunks of speech. The input is the speech chunk (transformed into a special kind of representation called a log-Mel spectrogram, so that it can be treated kind of like a picture!) and the output is the transcript. 

Whisper comes with multiple *models* (any modern AI will be a code shell plus one or more model fillings, just like a pie pan can be filled with multiple kinds of pie). [Here](https://github.com/openai/whisper/blob/main/model-card.md) they are. Today we will use the English "base" model which does pretty well on English.

## Okay, let's get to it!

### First, we install whisper.

In [None]:
!pip install -U openai-whisper


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai-whisper
  Downloading openai-whisper-20230124.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.19.0
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86

### Second, we use whisper to recognize some speech

For this purpose, we need an audio file! I will use a recording of Martin Luther King's most famous speech, downloaded from [here](https://ia800207.us.archive.org/29/items/MLKDream/MLKDream.wav).

In [None]:
import whisper

model = whisper.load_model("base.en")
result = model.transcribe("https://ia800207.us.archive.org/29/items/MLKDream/MLKDream.wav")
print(result["text"])

100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 53.7MiB/s]


 I have the pleasure to present to you Dr. Martin Luther King, Jr. I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation. Five score years ago, a great American in whose symbolic shadow we stand today signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity. But one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languished in the corners of American society and finds himself in exile in his own land. And so we'

## I want to transcribe my own voice!



There's a company called huggingface that host certain types of AI model and application. They are based out of Brooklyn, NY; they have never made a profit; they give away almost everything they do; and for this they have received more than [USD$160m in funding](https://www.crunchbase.com/organization/hugging-face). There are good reasons why I think they are worth this money, though! And mostly they are physicists by training, which is interesting *and* explains a lot about how they work.

The kind folks at huggingface host the whisper models so you can try them yourself with your own speech:
* [English base model](https://huggingface.co/openai/whisper-base.en)
* [Multilingual medium model](https://huggingface.co/openai/whisper-medium)
* [Multilingual large model](https://huggingface.co/openai/whisper-large)

# Show Me More!



If you like Whisper, we have some other great resources coming up for you!
* If you have *video*, we can add pose detection and person tracking.
* For the *audio*, we can add extraction of acoustic/prosodic features.
* Once you have a *transcript*, we can add NLP to identify sentiment, named entities, and more.

# The Tax

If you use Whisper, please **let us know**. We want to work with you! We want to know what works and what doesn't! We want to understand your joys and your concerns.