# TRANSCRIBING AUDIO FILES WITH WAV2VEC2

Transcribing audio files into text is among the most tedious and time consuming tasks in business and research. As such I was particularly keen to try [Hugging Face's implementation of Facebook's Wav2Vec2 model](https://huggingface.co/transformers/model_doc/wav2vec2.html) when it was released for transformers 4.3.

I was curious to see how well it would perform for short and long speeches, different accents and different "delivery formats" - be it formal speeches or a poetry recital. The three notebooks in this repo cover the results from my trials, ranging from a short audio snippet (62s) to a poetry recital (5minutes 34s) and a 12-minute-plus political speech.

The accents in these audio clips involve: White American, African American and Singaporean Chinese.

I find the results from Wav2Vec2 to be really impressive, and think this can help open up new ways to "chain link" NLP tasks directly from audio to textual analysis.

Long audio clips are very memory-intensive, however, and efforts to process audio files longer than 90s tend to crash normal work machines and even Colab Pro notebooks. I have a minor workaround in notebooks2.0/2.1 that are somewhat clumsy, but they get the job done within a reasonable period of time. Will figure out a more efficient way to do this at some point.


## REFERENCES

- Documentation of [Hugging Face's implementation of Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)

- Hosted inference [API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h)

- Paper on [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)


## RESULTS

- The output text files of the two longer trials can be downloaded [here](https://www.dropbox.com/s/zx4bfct1zhl18az/amanda_gorman.txt) and [here](https://www.dropbox.com/s/gu3e6ns4x4tty61/lhl_wef.txt)


## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)
- if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio. I used [Audacity](https://www.audacityteam.org/) to split up the audio files in this repo.


## MODELS

- There are several versions of the Wav2Vec2 model on Hugging Face's model hub. I haven't tried them out to see what the difference in output quality is like. Check them out [here](https://huggingface.co/models?search=wav2ve).

- This repo uses the [wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) model throughout.

# 1. TRANSCRIBE POETRY RECITAL

This is the first of two notebooks on transcribing longer audio clips with Wav2Vec2. I wanted to try something different from a routine speech, so I picked [Amanda Gorman's](https://twitter.com/TheAmandaGorman) evocative poem at the inauguration of Joe Biden on Jan 20, 2021

In [2]:
import librosa
import pandas as pd
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [3]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

In [4]:
#load any audio file of your choice
def transcript(x, y):
    speech = {}
    input_values = {}
    logits = {}
    predicted_ids = {}
    transcribe = {}
    for i in range(x, y+1):
        speech[i], rate = librosa.load(
            "../audio/poet/amanda_gorman-%d.flac" % i, sr=16000
        )
        input_values[i] = tokenizer(speech[i], return_tensors="pt").input_values
        logits[i] = model(input_values[i]).logits
        predicted_ids[i] = torch.argmax(logits[i], dim=-1)
        transcribe[i] = tokenizer.decode(predicted_ids[i][0])
    return transcribe

In [5]:
def full_transcript(num_clips):
    trans = {}
    for j in range(1, num_clips):
        if num_clips - j > 0:
            trans[j] = pd.DataFrame.from_dict(
                transcript(j, j + 1), orient="index"
            ).rename(columns={0: "Transcribed_Text"})
        else:
            pass
    return (
        pd.concat(trans)
        .drop_duplicates(subset=["Transcribed_Text"])
        .reset_index(drop=True)
    )


In [6]:
%%time
df = full_transcript(num_clips = 10)

CPU times: user 5min 50s, sys: 28.7 s, total: 6min 18s
Wall time: 2min 20s


In [7]:
df.shape

(10, 1)

In [8]:
df.head(10)

Unnamed: 0,Transcribed_Text
0,MISTER PRESIDENT DOCTER BYDEN MADAM VICE PRESI...
1,IS ISN'T ALWAYS JUST IS AND YET THE DAWN IS HO...
2,FOR ONE AND YES WE ARE FAR FROM POLISHED FAR F...
3,PUT OUR FUTSURE FIRST WE MUST FIRST PUT OUR DI...
4,EVER AGAIN SO DIVISION SKIPSER TELLS US TO INV...
5,HAR IT WE'VE SEEN A FOREST THAT WOULD SHATTER ...
6,CEPTION WE DID NOT FEEL PREPARED TO BE THE EIR...
7,VOLENCE BUT BOLD FIERCE AND FREE WE WILL NOT B...
8,BETTER THAN ONE WE WERE LEFT WITH EVERY BREATH...
9,AND BEAUTIFUL WHEN DAY COMES WE STEP OUT OF TH...


In [9]:
df["Transcribed_Text"].values

array(["MISTER PRESIDENT DOCTER BYDEN MADAM VICE PRESIDENT MISTER MHOFF AMERICANS AND THE WORLD WHEN DAY COMES WE ASK OURSELVES WHERE CAN WE FIND LIGHT IN THIS NEVER ENDING SHADE THE LOSS WE CARRY A SEA WE MUST WADE WE BRAVE THE BELLY OF THE BEAST WE'VE LEARNED THAT QUIET ISN'T ALWAYS PEACE IN THE NORMS IN NOTIONS OF WHAT JUST",
       "IS ISN'T ALWAYS JUST IS AND YET THE DAWN IS HOURS BEFORE WE KNEW IT SOMEHOW WE DO IT SOMEHOW WE'VE WEATHERED AND WITNESSED A NATION THAT ISN'T BROKEN BUT SIMPLY UNFINISHED WE THE SUCCESSORS OF A COUNTRY AND A TIME WERE A SKINNY BLACK GIRL DESCENDED FROM SLAVES AND RAISED BY A SINGLE MOTHER CAN DREAM OF BECOMING PRESIDENT ONLY TO FIND HERSELF RECITING",
       "FOR ONE AND YES WE ARE FAR FROM POLISHED FAR FROM PESTIM BUT THAT DOESN'T MEAN WE ARE STRIVING TO FORM A UNION THAT IS PERFECT WE OR STRIVING TO FORGE OR UNION WITH PURPOSE TO COMPOSE A COUNTRY COMMITTED TO ALL CULTURES COLORS CHARACTERS AND CONDITIONS OF MAN AND SO WE LIFT OUR GAZES NOT TO WHAT S

In [10]:
range(10)

range(0, 10)

In [11]:
stop

NameError: name 'stop' is not defined

In [None]:
def full_text(a, b):
    trans = {}
    for j in range(a, a + b):
        trans[j] = pd.DataFrame.from_dict(
            transcript(j, j + b), orient="index"
        ).rename(columns={0: "Transcribed_Text"})
    return pd.concat(trans)

In [None]:
def full_trans(z):
    trans = {}
    for j in dicts:
        for k,v in j.iteritems():
            trans.setdefault(k, []).append(v)
    return "".join(trans)

In [None]:
%%time
transcripts1 = pd.DataFrame.from_dict(transcript(1,4), orient="index").rename(
    columns={0: "Transcribed_Text"}
)

transcripts2 = pd.DataFrame.from_dict(transcript(5,8), orient="index").rename(
    columns={0: "Transcribed_Text"}
)

transcripts3 = pd.DataFrame.from_dict(transcript(9,10), orient="index").rename(
    columns={0: "Transcribed_Text"}
)

In [None]:
transcripts = pd.concat(
    [
        transcripts1,
        transcripts2,
        transcripts3,
    ]
)

In [None]:
transcripts.shape

In [None]:
poet = transcripts["Transcribed_Text"].apply(''.join)

In [None]:
#poet.to_csv("../transcript/amanda_gorman.txt", sep="\t", index=False)

In [None]:
transcripts["Transcribed_Text"].values

In [None]:
# https://www.youtube.com/watch?v=LZ055ilIiN4