# Building a Simple ASR Application with Whisper 🤫💬
---

This notebook uses [Hugging Face](https://huggingface.co/docs/transformers/model_doc/whisper) and [Gradio](https://gradio.app/) to build a simple demo.

Note that this application also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system:

```bash
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
```

You can also check out the [demo](https://huggingface.co/spaces/openai/whisper) hosted on the [Hugging Face Spaces](https://huggingface.co/spaces/launch).

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="openai/whisper-small.en")  

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

In [None]:
iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), # swap source with "upload"
    outputs="text",
    title="Whisper App",
    description="Realtime demo for automatic speech recognition using a Whisper model.",
)

iface.launch(debug=True)

## Model Information
---

You can also checkout Whisper's official [github repo](https://github.com/openai/whisper) for a more comprehensive (non-Hugging Face) tutorial.

## Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed. 


|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.

Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the `large-v2` model (The smaller the numbers, the better the performance).

![WER breakdown by language](https://raw.githubusercontent.com/openai/whisper/main/language-breakdown.svg)