# BUILDING NLP WEB APPS WITH GRADIO, HUGGING FACE TRANSFORMERS, AND FB'S WAV2VEC2

NLP is no longer limited to text inputs/outputs, and it is increasingly common to see models with image/audio/video-to-text capabilities, or vice versa. Multi-media web apps are far more complicated to build, but thankfully the latest version of Gradio supports these "mixed-media" web apps as well.

The notebook will deal with a quick example for a speech-to-text app with audio inputs and text output. [Gradio's documentation](https://gradio.app/docs) suggests that the library supports video and image inputs just as easily.

The key issue to note here is that both the Gradio and Transformers libraries are being updated at a fairly rapid clip, and certain models/features/classes may not be synced up to the latest versions. 

I've suppressed the warnings to keep the notebook clean, but if you comment out the warnings' filter, you'll notice the deprecation warnings for Wav2Vec2ForMaskedLM and Wav2Vec2Tokenizer. But their replacements don't work with Gradio. At some point, they'll sync up. For now, just keep that in mind in case the code breaks for you.

In [1]:
import gradio as gr
import librosa
import soundfile as sf
import torch
import warnings

from transformers import Wav2Vec2ForMaskedLM, Wav2Vec2Tokenizer

warnings.filterwarnings("ignore")

# 1. LOAD WAV2VEC2 MODEL, DEFINE SPEECH-TO-TEXT FUNCTION

There are about [400 versions of Wav2Vec2 models](https://huggingface.co/models?search=wav2vec2) on Hugging Face's model hub, so feel free to change one to suit your use case.

The asr_transcript function should allow you to process long audio clips, just keep in mind that excessively long clips will take a long time to process and could well crash the app. I've tested the app on clips up to about 5-plus minutes.

I've included 2 audio clips in this repo (in the data folder) - jfk.flac (62 seconds) and amanda_gorman.flac (5min 31s).

In [2]:
#load wav2vec2 tokenizer and model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

model = Wav2Vec2ForMaskedLM.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

# define speech-to-text function
def asr_transcript(audio_file):
    transcript = ""

    # Stream over 20 seconds chunks
    stream = librosa.stream(
        audio_file.name, block_length=20, frame_length=16000, hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.batch_decode(predicted_ids)[0]
        transcript += transcription.lower() + " "

    return transcript


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForMaskedLM were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 2. DEFINE GRADIO INTERFACE

In [3]:
gradio_ui = gr.Interface(
    fn=asr_transcript,
    title="Speech-to-Text with HuggingFace+Wav2Vec2",
    description="Upload an audio clip, and let AI do the hard work of transcribing",
    inputs=gr.inputs.Audio(label="Upload Audio File", type="file"),
    outputs=gr.outputs.Textbox(label="Auto-Transcript"),
)


## 2.1 LAUNCH

In [4]:
#gradio_ui.launch(share=True)
gradio_ui.launch()

Running locally at: http://127.0.0.1:7860/
To create a public link, set `share=True` in `launch()`.
Interface loading below...


(<Flask 'gradio.networking'>, 'http://127.0.0.1:7860/', None)