# CHAIN-LINKING NLP TASKS WITH WAV2VEC2 & TRANSFORMERS: PART 2

In this notebook, let's try a different and arguably more difficult combination of tasks - speech-to-text with summarisation.

While there's been a fair amount of work done on auto-summarization in recent yers, the results are still not ready for prime-time, in my view. Summarisation is intrinsically tough of course - it is subjective, highly domain-specific, and is often more art than science.

To keep the tasks here more manageable, I picked a relatively short and focused audio clip - [Singapore Prime Minister Lee Hsien Loong speaking on populism](https://www.youtube.com/watch?v=4bUl9R2N90A) at a business conference in Oct 2019.

I tried two transformer models for summarisation - FB's Bart and Google's Pegasus. Not terribly satisfied with the summaries, but this is a pretty decent start, considering we are going from an audio clip directly to a text summary within minutes.


## RESULTS

The transcript is available via Dropbox for those who want to skip ahead to the result:
 - [Transcript of Mr Lee's comments](https://www.dropbox.com/s/y1xp27ktd41js1o/pm_populism.txt?dl=0)
 

## REFERENCES

- Documentation of [Hugging Face's implementation of Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)

- Hosted inference [API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h)

- Paper on [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)


## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)
- if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio. I used [Audacity](https://www.audacityteam.org/) to split up the audio files in this repo.


## MODELS

- There are several versions of the Wav2Vec2 model on Hugging Face's model hub. Check them out [here](https://huggingface.co/models?search=wav2ve).

- In this notebook, I'll be using the large [wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) model.

In [1]:
import librosa
import numpy as np
import pandas as pd
import re
import torch

from nltk.tokenize import sent_tokenize
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSeq2SeqLM,
    Wav2Vec2ForCTC,
    Wav2Vec2Tokenizer,
    pipeline,
)

## 1. TRANSCRIBE

More detailed notes on using Wav2Vec2 and my work-around can be found in notebooks 1 and 2. I'll skip over the details here.

Mr Lee's speech is about 4 minutes long, and is transcribed in 30s-chunks at a time.

## 1.1 DEFINE FUNCTION TO TRANSCRIBE AUDIO CLIP IN PRE-SET CHUNKS

You can change the "block_length" parameter to any value, technically speaking. But anything above a 60s block length results in considerable out-of-memory issues. 20/30-second blocks seem to make the most sense to me.

I decided to set the block_length to 30s in this notebook after a few trials. There's currently no good way to deal with punctuation in Wav2Vec2.

In [2]:
# function adapted via: https://github.com/huggingface/transformers/issues/10366

def asr_transcript(tokenizer, model, audio_file, clip_length):
    transcript = ""

    stream = librosa.stream(
        audio_file, block_length=clip_length, frame_length=16000, hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.decode(predicted_ids[0])
        transcript += transcription.lower() + " "
        
    return transcript

## 1.2 LOAD CHOICE OF MODEL-TOKENIZER, AUDIO FILE AND CHECK RESULTS

This took about to run on my late-2015 iMac. 

In [3]:
#load tokenizer and pre-trained model; there are several versions available
tokenizer_transcribe = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

model_transcribe = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

audio_file = "../audio/pm_populism.flac"

clip_length = 30 # Stream over 30-second chunks

In [4]:
%%time
pm = asr_transcript(tokenizer_transcribe, model_transcribe, audio_file, clip_length)

CPU times: user 7min 9s, sys: 49.1 s, total: 7min 58s
Wall time: 3min 19s


In [5]:
print(pm)

populism is an issue in many countries many develop countries particularly and i think they can see some of the reasons for it because there is a divide between the elites and the masses of the population the population pat population feel that the system has not been fair to them they have not got their share of the efforts they have put into it that they have been left behind at least relatively and not just that it has anequal but that they are not getting the respect and the status which they feel that they deserve as citizens in the country certainly i think that sentiment is very strong in donald trump supporters you would see that the people who voted for bexe in england you would see that in the yellow vest in france in the national front our national party supporters who workefoor money thepen and you see that in other countries too i would say even in hong kong that you could described the what the demonstrators and protestors want es a kind of populacm t system not working f

In [6]:
# Output the transcript to a text file if you wish.

#with open("../transcripts/pm_populism.txt", "w") as file:
#    file.write(pm)

# 2. SUMMARIZE

## 2.1 SUMMARY VIA FB-BART

In [7]:
%%time
nlp_summarizer = pipeline(
    "summarization", 
    model="facebook/bart-large-cnn", 
    tokenizer="facebook/bart-large-cnn", 
    framework="pt"
)

extr_summary = nlp_summarizer(pm)

CPU times: user 46.8 s, sys: 1.71 s, total: 48.5 s
Wall time: 29.7 s


In [8]:
extr_summary

[{'summary_text': 'Populism is an issue in many countries many develop countries particularly and i think they can see some of the reasons for it. There is a divide between the elites and the masses of the population the population pat population feel that the system has not been fair to them. In singapore we have tried very hard to do to avoid being in that position.'}]

## 2.2 SUMMARY VIA GOOGLE-PEGASUS

In [9]:
model_name = 'google/pegasus-large'

tokenizer_pegasus = AutoTokenizer.from_pretrained(model_name)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [10]:
%%time
batch = tokenizer_pegasus.prepare_seq2seq_batch(
    pm, truncation=True, padding="longest", return_tensors="pt"
)

translated = model_pegasus.generate(**batch)

abstr_summary = tokenizer_pegasus.batch_decode(
    translated, skip_special_tokens=True
)


CPU times: user 3min 46s, sys: 5.98 s, total: 3min 52s
Wall time: 1min 6s


In [11]:
abstr_summary

['populism is an issue in many countries many develop countries particularly and i think they can see some of the reasons for it because there is a divide between the elites and the masses of the population the population pat population feel that the system has not been fair to them they have not got their share of the efforts they have put into it that they have been left behind at least and not just that it has anequal but that they are not getting the respect and the status which they feel that they deserve as citizens in the country certainly i think that sentiment is very strong in donald trump supporters you would see that the people who voted for bexe in england you would see that in the yellow vest in france in the national front our national party supporters who workefoor money thepen and you see that in other countries too i would say even in hong kong that you could described the what the demonstrators and protestors want es a kind of populacm t system not working for me let