### Build a demo with Gradio

Now that we’ve fine-tuned a Whisper model for Dhivehi speech recognition, let’s go ahead and build a Gradio demo to showcase it to the community!

The first thing to do is load up the fine-tuned checkpoint using the pipeline() class - this is very familiar now from the section on pre-trained models. You can change the model_id to the namespace of your fine-tuned model on the Hugging Face Hub, or one of the pre-trained Whisper models to perform zero-shot speech recognition:

In [1]:
from transformers import pipeline

model_id = "sanchit-gandhi/whisper-small-dv"  # update with your model id
pipe = pipeline("automatic-speech-recognition", model=model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.60k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Secondly, we’ll define a function that takes the filepath for an audio input and passes it through the pipeline. Here, the pipeline automatically takes care of loading the audio file, resampling it to the correct sampling rate, and running inference with the model. We can then simply return the transcribed text as the output of the function. To ensure our model can handle audio inputs of arbitrary length, we’ll enable chunking as described in the section on pre-trained models:



In [2]:
def transcribe_speech(filepath):
    output = pipe(
        filepath,
        max_new_tokens=256,
        generate_kwargs={
            "task": "transcribe",
            "language": "sinhalese",
        },  # update with the language you've fine-tuned on
        chunk_length_s=30,
        batch_size=8,
    )
    return output["text"]

We’ll use the Gradio blocks feature to launch two tabs on our demo: one for microphone transcription, and the other for file upload.



In [3]:
!pip install -q gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.1/18.1 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.7/318.7 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.0/94.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
import gradio as gr

demo = gr.Blocks()

mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs=gr.Textbox(),
)

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload", type="filepath"),
    outputs=gr.Textbox(),
)

Finally, we launch the Gradio demo using the two blocks that we’ve just defined:



In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe, file_transcribe],
        ["Transcribe Microphone", "Transcribe Audio File"],
    )

demo.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://dfd9fbc000c7cae0ba.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Should you wish to host your demo on the Hugging Face Hub, you can use this Space as a template for your fine-tuned model.

Click the link to duplicate the template demo to your account: https://huggingface.co/spaces/course-demos/whisper-small?duplicate=true

We recommend giving your space a similar name to your fine-tuned model (e.g. whisper-small-dv-demo) and setting the visibility to “Public”.

Once you’ve duplicated the Space to your account, click “Files and versions” -> “app.py” -> “edit”. Then change the model identifier to your fine-tuned model (line 6). Scroll to the bottom of the page and click “Commit changes to main”. The demo will reboot, this time using your fine-tuned model. You can share this demo with your friends and family so that they can use the model that you’ve trained!

Checkout our video tutorial to get a better understanding of how to duplicate the Space 👉️ [YouTube Video](https://www.youtube.com/watch?v=VQYuvl6-9VE)

We look forward to seeing your demos on the Hub!

## Hands-on exercise (TODO)

In this unit, we explored the challenges of fine-tuning ASR models, acknowledging the time and resources required to fine-tune a model like Whisper (even a small checkpoint) on a new language. To provide a hands-on experience, we have designed an exercise that allows you to navigate the process of fine-tuning an ASR model while using a smaller dataset. The main goal of this exercise is to familiarize you with the process rather than expecting production-level results. We have intentionally set a low metric to ensure that even with limited resources, you should be able to achieve it.

Here are the instructions:

Fine-tune the ”openai/whisper-tiny” model using the American English (“en-US”) subset of the ”PolyAI/minds14” dataset.
Use the first 450 examples for training, and the rest for evaluation. Ensure you set num_proc=1 when pre-processing the dataset using the .map method (this will ensure your model is submitted correctly for assessment).
To evaluate the model, use the wer and wer_ortho metrics as described in this Unit. However, do not convert the metric into percentages by multiplying by 100 (E.g. if WER is 42%, we’ll expect to see the value of 0.42 in this exercise).
Once you have fine-tuned a model, make sure to upload it to the 🤗 Hub with the following kwargs:

```
kwargs = {
     "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}
```

You will pass this assignment if your model’s normalised WER (wer) is lower than 0.37.

Feel free to build a demo of your model, and share it on Discord! If you have questions, post them in the #audio-study-group channel.

## Supplemental reading and resources

This unit provided a hands-on introduction to speech recognition, one of the most popular tasks in the audio domain. Want to learn more? Here you will find additional resources that will help you deepen your understanding of the topics and enhance your learning experience.

- [Whisper Talk by Jong](https://www.youtube.com/live/fZMiD8sDzzg) by Jong Wook Kim: a presentation on the Whisper model, explaining the motivation, architecture, training and results, delivered by Whisper author Jong Wook Kim
- [End-to-End Speech Benchmark (ESB)](https://arxiv.org/abs/2210.13352): a paper that comprehensively argues for using the orthographic WER as opposed to the normalised WER for evaluating ASR systems and presents an accompanying benchmark
- [Fine-Tuning Whisper for Multilingual ASR](https://huggingface.co/blog/fine-tune-whisper): an in-depth blog post that explains how the Whisper model works in more detail, and the pre- and post-processing steps involved with the feature extractor and tokenizer
- [Fine-tuning MMS Adapter Models for Multi-Lingual ASR](https://huggingface.co/blog/mms_adapters): an end-to-end guide for fine-tuning Meta AI’s new MMS speech recognition models, freezing the base model weights and only fine-tuning a small number of adapter layers
- [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram): a blog post for combining CTC models with external language models (LMs) to combat spelling and punctuation errors