# NOTE: 

This is an alternative and in my view better approach to transcribing long audio clips. The code is taken from Lysandre Jik's (of Hugging Face) response to an issue on [Github](https://github.com/huggingface/transformers/issues/10366).

Unlike my original version in notebook2.0, the code here doesn't need you to manually split up your audio files. Just use Librosa to stream a specific chunk of the audio file to transcribe one at a time.

I kept both versions in the repo for reference. See [notebook2.0](https://github.com/chuachinhon/wav2vec2_transformers/blob/main/notebooks/2.0_wav2vec2_poetry.ipynb) as well for my original approach to this problem.


# TRANSCRIBING LONG AUDIO FILES WITH WAV2VEC2 - ALT VERSION


## REFERENCES

- Documentation of [Hugging Face's implementation of Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)

- Hosted inference [API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h)

- Paper on [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)


## RESULTS

- The output text files of the two longer trials can be downloaded [here](https://www.dropbox.com/s/aevfd7f6mwlk7ru/amanda_gorman_alt.txt?dl=0)


## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)
- if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio. I used [Audacity](https://www.audacityteam.org/) to split up the audio files in this repo.


## MODELS

- There are several versions of the Wav2Vec2 model on Hugging Face's model hub. Check them out [here](https://huggingface.co/models?search=wav2ve).

- For this notebook, I switched to the [wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-base-960h) model. In earlier notebooks, I used the base version.

# 1. TRANSCRIBE POETRY RECITAL

Source: [Amanda Gorman's](https://twitter.com/TheAmandaGorman) [evocative poem](https://www.youtube.com/watch?v=LZ055ilIiN4) at the inauguration of Joe Biden on Jan 20, 2021.

Length: 5 minute 34s

In [1]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

## 1.1 DEFINE FUNCTION TO TRANSCRIBE AUDIO CLIP IN 20-SECOND CHUNKS

You can change the "block_length" parameter to any value, technically speaking. But anything above a 60s block length results in considerable out-of-memory issues. 20/30-second blocks seem to make the most sense to me.

I'm setting block_length to 20s in this notebook.

In [2]:
# function adapted via: https://github.com/huggingface/transformers/issues/10366

def asr_transcript(tokenizer, model, audio_file):
    transcript = ""

    # Stream over 20 seconds chunks
    stream = librosa.stream(
        audio_file, block_length=20, frame_length=16000, hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.decode(predicted_ids[0])
        transcript += transcription.lower() + " "
        
    return transcript

## 1.2 LOAD CHOICE OF MODEL-TOKENIZER, AUDIO FILE AND CHECK RESULTS

In [3]:
#load tokenizer and pre-trained model
tokenizer_transcribe = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

model_transcribe = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

audio_file = "../audio/amanda_gorman.flac"

In [4]:
%%time
poet = asr_transcript(tokenizer_transcribe, model_transcribe, audio_file)

CPU times: user 8min 43s, sys: 24.4 s, total: 9min 8s
Wall time: 2min 50s


In [5]:
print(poet)

mister president doctor byden madam vice president mister mhof americans and the world when day comes we ask ourselves where can we find light in this never ending shade the loss we carry a sea we must wad we've braved the belly of the beast we've learned that quiet isn't always peace in the norms and notions of what just is isn't always just is and yet the n is ours before we knew it somehow we do it somehow we've weathered and witnessed a nation that isn't broken but simply unfinished we the successors of a country and a time were a skinny black girl descended from slaves and raised by a single mother can dream of becoming president only to find herself reciting for one and yes we are far from polished far from pestine but that doesn't mean we are striving to form a union that is perfect we are striving to forge our union with purpose to compose a country committed to all cultures colors characters and conditions of man and so we lift our gaze not to what stands between us but what s

In [6]:
# Output the transcript to a text file if you wish.

# with open("../transcripts/amanda_gorman_alt.txt", "w") as file:
#    file.write(poet)

## NOTE:

2 minutes 44s might seem like a long time to transcribe something like this, but if you were to do this manually, it would probably take you twice as long, at least.

Again, there are some obvious errors, but the results from the base model is pretty damn good if you ask me. As the models get more sophisticated, I'm pretty sure the quality of the output will get better.