<a href="https://colab.research.google.com/github/Vaibhavs10/notebooks/blob/main/zero_to_asr_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero to Recognising Speech: Speed Run Edition ⚡️

Welcome to a one size fits all notebook that takes you from finding a `transformers` compatible speech recognition checkpoint to building an easy to use demo with it. 🏃‍♂️

## Setup environment

🤗 `transformers` was made to get you up and running with as minimal friction. To be able to transcribe audio from *any* compatible checkpoint on the hub all you need to install is transformers. Later on we'll use `gradio` to build a neat demo that you can share with your friends, colleagues and network! 💥

In [3]:
!pip install --quiet transformers gradio

## Transcription pipeline

Pipelines within 🤗 `transformers` are a unified API for all supported architectures. This helps you to use the same codebase and play with different architectures.

First, let's import pipeline from transformers.

In [6]:
import torch
from transformers import pipeline

Next, let's define some helper variables to help us easily switch environments and models!

In [8]:
device = 0 if torch.cuda.is_available() else "cpu" # make sure to select GPU runtime
MODEL_NAME = "openai/whisper-small"

Last, we define our pipeline, this creates a wrapper around our model's generate function. All we need to define is the pipeline with an appropriate `task` and `model`.

In [9]:
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
    return_timestamps=True
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Awesome! Let's take this out for a spin!

In [15]:
pipe("sentence-1-1.wav")



{'text': ' This is kind of the conventional wisdom that they stand to gain or to from this environment right now.',
 'chunks': [{'timestamp': (0.0, 5.84),
   'text': ' This is kind of the conventional wisdom that they stand to gain or to from this environment right now.'}]}

## Voila!

Within seconds we have the transcriptions along with their associated timestamps 💥

## Let's package it all up into an application now!

In [16]:
import torch

import gradio as gr
from transformers import pipeline

MODEL_NAME = "openai/whisper-small" #this always needs to stay in line 8 :D sorry for the hackiness
lang = "en"

device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)

def transcribe(microphone, file_upload):
    warn_output = ""
    if (microphone is not None) and (file_upload is not None):
        warn_output = (
            "WARNING: You've uploaded an audio file and used the microphone. "
            "The recorded file from the microphone will be used and the uploaded audio will be discarded.\n"
        )

    elif (microphone is None) and (file_upload is None):
        return "ERROR: You have to either use the microphone or upload an audio file"

    file = microphone if microphone is not None else file_upload

    text = pipe(file)["text"]

    return warn_output + text


demo = gr.Blocks()

mf_transcribe = gr.Interface(
    fn=transcribe,
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath", optional=True),
        gr.inputs.Audio(source="upload", type="filepath", optional=True),
    ],
    outputs="text",
    layout="horizontal",
    theme="huggingface",
    title="Whisper Demo: Transcribe Audio",
    description=(
        "Transcribe long-form microphone or audio inputs with the click of a button! Demo uses the the fine-tuned"
        f" checkpoint [{MODEL_NAME}](https://huggingface.co/{MODEL_NAME}) and 🤗 Transformers to transcribe audio files"
        " of arbitrary length."
    ),
    allow_flagging="never",
)


with demo:
    gr.TabbedInterface([mf_transcribe], ["Transcribe Audio"])

demo.launch(enable_queue=True)



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://49fce7475f25f3044d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




That's it! You can now take this exact code snippet over to https://hf.co/spaces and share your own demos! 

Try it! and tweet your results with me [@reach_vb](https://twitter.com/reach_vb)