# TRANSCRIBING AUDIO FILES WITH WAV2VEC2

Transcribing audio files into text is among the most tedious and time consuming tasks in business and research. As such I was particularly keen to try [Hugging Face's implementation of Facebook's Wav2Vec2 model](https://huggingface.co/transformers/model_doc/wav2vec2.html) when it was released for transformers 4.3.

I was curious to see how well it would perform for short and long speeches, different accents and different "delivery formats" - be it formal speeches or a poetry recital. The three notebooks in this repo cover the results from my trials, ranging from a short audio snippet (62s) to a poetry recital (5minutes 34s) and a 12-minute-plus political speech.

The accents in these audio clips involve: White American, African American and Singaporean Chinese.

I find the results from Wav2Vec2 to be really impressive, and think this can help open up new ways to "chain link" NLP tasks directly from audio to textual analysis.

Long audio clips are very memory-intensive, however, and efforts to process audio files longer than 90s tend to crash normal work machines and even Colab Pro notebooks. I have a minor workaround in notebooks2.0/2.1 that are somewhat clumsy, but they get the job done within a reasonable period of time. Will figure out a more efficient way to do this at some point.


## REFERENCES

- Documentation of [Hugging Face's implementation of Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)

- Hosted inference [API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h)

- Paper on [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)


## RESULTS

- The output text files of the two longer trials can be downloaded [here](https://www.dropbox.com/s/zx4bfct1zhl18az/amanda_gorman.txt) and [here](https://www.dropbox.com/s/gu3e6ns4x4tty61/lhl_wef.txt)


## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)
- if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio. I used [Audacity](https://www.audacityteam.org/) to split up the audio files in this repo.


## MODELS

- There are several versions of the Wav2Vec2 model on Hugging Face's model hub. I haven't tried them out to see what the difference in output quality is like. Check them out [here](https://huggingface.co/models?search=wav2ve).

- This repo uses the [wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) model throughout.

# 1. TRANSCRIBE LONGER AUDIO CLIPS ON COLAB (PRO)

This is the second of two notebooks on transcribing longer audio clips with Wav2Vec2. With clips beyond 10 minutes (I've tried up to audio clips around 21 minutes), I find it better to run the notebook on Colab. 

If you have a Colab Pro account, switch to the "TPU" option so that you'll be allocated 35Gb of RAM, the max avail under the Pro settings ("GPU" setting gets you about 25Gb RAM).

For this trial, I picked a 12.5 minutes speech by [Singapore Prime Minister Lee Hsien Loong](https://www.youtube.com/watch?v=izrdoAm4_Gw) to see how the Wav2Vec2 model deals with an Asian accent, on top of the longer audio clips.

The original audio clip was split into 13 parts, with the first 12 all being a minute long. You can split the original speech into fewer but longer clips if you have access to better compute.

This took about 2 minutes 4s to run on Colab Pro, a fraction of what a human user would need. The model didn't struggle with the change in accent, though it ought to be noted that the Prime Minister is a very experienced public speaker. Results could vary if the trial was conducted with an Asian speaker who is less proficient in English and/or public speaking.

To run this, you'll have to upload the "lhl_wef" sub-folder in the "audio" folder to your G-drive and make any necessary changes to path etc. 

In [1]:
! pip install -q transformers

[K     |████████████████████████████████| 1.8MB 7.3MB/s 
[K     |████████████████████████████████| 3.2MB 23.0MB/s 
[K     |████████████████████████████████| 890kB 54.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
# load access to your G-drive where the audio files have to be stored

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
import librosa
import pandas as pd
import os
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [4]:
os.chdir("/content/drive/My Drive/Colab Notebooks")

In [5]:
#load tokenizer and pre-trained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=291.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=163.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=843.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=377667514.0, style=ProgressStyle(descri…




## 1.1 DEFINE FUNCTIONS TO TRANSCRIBE SHORTER AUDIO CLIPS ONE AT A TIME

The functions return a dataframe for ease of "transfer" to other NLP tasks.

## NOTE:

- Make sure the numbering of your split audio files start at "1", and not "01", or "001".

- Change the file path/names for the audio files as necessary

In [6]:
def split(x, y):
    speech = {}
    input_values = {}
    logits = {}
    predicted_ids = {}
    transcribe = {}
    for i in range(x, y+1):
        speech[i], rate = librosa.load(
            "lhl_wef/lhl_wef-%d.flac" % i, sr=16000
        )
        input_values[i] = tokenizer(speech[i], return_tensors="pt").input_values
        logits[i] = model(input_values[i]).logits
        predicted_ids[i] = torch.argmax(logits[i], dim=-1)
        transcribe[i] = tokenizer.decode(predicted_ids[i][0])
    return transcribe

In [7]:
def transcript(num_clips):
    trans = {}
    for j in range(1, num_clips):
        if num_clips - j > 0:
            trans[j] = pd.DataFrame.from_dict(
                split(j, j + 1), orient="index"
            ).rename(columns={0: "Transcribed_Text"})
        else:
            pass
    return (
        pd.concat(trans)
        .drop_duplicates(subset=["Transcribed_Text"])
        .reset_index(drop=True)
    )

## 1.2 TRANSCRIBE SHORTER CLIPS

Even on Colab Pro, the notebook crashes with individual clips longer than 90s. So manage the length of each clip as necessary.

In [8]:
%%time
df = transcript(num_clips = 13)

CPU times: user 20min 16s, sys: 2min 55s, total: 23min 11s
Wall time: 2min 4s


## 1.3 CHECK RESULTS

In [9]:
#checking that all 13 clips were transcribed
df.head()

Unnamed: 0,Transcribed_Text
0,IAM VERY HONOR TO SPEAK AT DISCLOSING ADDRESS ...
1,CRITICAL THAT VAXINES ARE RULED OUT QUICKLY AC...
2,UTIONS AND RULES AND NORMES WAS ERODING POPULO...
3,SO THAT ALL COUNTRIES AV SPECIALLY THE LEAST D...
4,NORMOUS STRESS ONLY UNPRECEDENTED LEVELS OF EM...


In [10]:
df.shape

(13, 1)

In [11]:
df["Transcribed_Text"].values

array(["IAM VERY HONOR TO SPEAK AT DISCLOSING ADDRESS AND I LIKE TO CONGRATULATE PROFESSOR SCHOEB YOURSELF AND THE WHOLE BLW  F TEAM FOR PUTTING TOGETHER A SUCCESSFUL PROGRAMM IT 'S BEEN A YEAR SINCE WE WERE ALL PHYSICALLY GATHERED IN DAVORCE FOR THE FIFTIETH ANNUAL MEETING OFER THE DBU F AT THAT TIME WE WERE JUST STARTING TO HEAR ABOUT THIS NEW VIRUS AND TRYING TO UNDERSTAND WHAT WAS HAPPENING NONE OF US ANTICIPATED HOW QUICKLY A FULL SCALE PANDAMIC WOULD BLOW UP AND DRAMATICALLY CHANGE OUR WORLD THE DESRUPTION TO LIVES AND LIVELIHOODS HAS BEEN MASSIVE AND UNPRECEDENTED THE VIRAS IS STILL RAGING IN MANY COUNTRIES IN THE DEVELOPED WORLD IN THE US AN EUROPE AND ALSO IN THE DEVELOPING WORLD IN AFRICA SOUTH AMERICA AND SOUTHASIA THANKFULLY WITH BAXINES BECOMING AVAILABLE THERE IS SOME LIGHT AT THE END OF THE TUNNEL IT IS NOW",
       "CRITICAL THAT VAXINES ARE RULED OUT QUICKLY ACROSS THE WORLD BUT EVEN WITH VAXINES THE PANDAMIC IS FAR FROM BEING QUELLED THE NEW VARIANCE DISCOVERED AND TH

In [14]:
#lhl_wef = df["Transcribed_Text"].apply(''.join)

#lhl_wef.to_csv("lhl_wef.txt", sep="\t", index=False)

## NOTE:

You'll have to clean up some of the transcribed text, but the amount of time saved here is pretty obvious.