# CHAIN-LINKING NLP TASKS WITH WAV2VEC2 & TRANSFORMERS

Take an audio clip in English, transcibe it then translate it into another language, and finally apply a layer of sentiment analysis on the speech or try to summarise it. For a human, getting all these tasks done could well take hours, if not a full working day.

Even for folks familiar with NLP tools, you can't easily link up these disparate tasks until recently. But with [Hugging Face's implementation of Facebook's Wav2Vec2 model](https://huggingface.co/transformers/model_doc/wav2vec2.html) in transformers 4.3, "chain linking" NLP tasks directly from audio to text has become a reality - with caveats on the quality of the results.

The output, be it the transcribed or translated text, still require a certain amount of clean up by a human user. But the amount of time saved is not inconsiderable, in my view. And with Hugging Face's pipeline API and other transformer models capable of supporting a growing range of NLP tasks, the possible combinations and permutations of audio-to-text tasks are pretty mind-boggling.

This notebook demos a simple workflow to:
 - transcribe a longish English speech (~24 minutes)
 - translate it into Chinese
 - plot the 'sentiment structure' of the Engish speech.

I decided to use Biden's [first prime time speech](https://www.youtube.com/watch?v=JYBatFW-BP4) on Mar 11/12 2021 (depending on which time zone you are in). The audio clip was split in 71 20-second clips.


## RESULTS

The output files are available via Dropbox for those who want to skip ahead to the results:
 - [English transcript](https://www.dropbox.com/s/mc8mcav2qol9tw1/biden.txt?dl=0)
 - [Chinese translated text](https://www.dropbox.com/s/vakol0xw76c2lqx/biden_chinese.txt?dl=0)
 - [Full CSV with transcript, translated text and sentiment labels/scores](https://www.dropbox.com/s/0t3uyi1vcu7ti3x/biden_prime_time.csv?dl=0)



## REFERENCES

- Documentation of [Hugging Face's implementation of Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)

- Hosted inference [API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h)

- Paper on [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)



## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)
- if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio. I used [Audacity](https://www.audacityteam.org/) to split up the audio files in this repo.


## MODELS

- There are several versions of the Wav2Vec2 model on Hugging Face's model hub. Check them out [here](https://huggingface.co/models?search=wav2ve).

- In this notebook, I'll be using the large [wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) model.

In [1]:
import librosa
import matplotlib as mpl
import pandas as pd
import plotly
import plotly_express as px
import plotly.graph_objects as go
import numpy as np
import re
import torch

from nltk.tokenize import sent_tokenize
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Tokenizer,
    MarianMTModel,
    MarianTokenizer,
    pipeline,
)

mpl.rcParams["figure.dpi"] = 300
%matplotlib inline
%config InlineBackend.figure_format ='retina'

# 1. TRANSCRIBE

More detailed notes on using Wav2Vec2 and my work-around can be found in notebooks 1 and 2. I'll skip over the details here.

Biden's speech is about 24 minutes long, and broken into 71 clips of 20s each so as to keep the memory load manageable. Other issue is Wav2Vec2's inability to generate punctuation, so longer clips would be tricky for downstream tasks like summarization. But more on that later.

In [2]:
#load tokenizer and pre-trained model; there are several versions available

tokenizer_transcribe = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model_transcribe = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

## 1.1 DEFINE FUNCTIONS TO TRANSCRIBE SHORT AUDIO CLIPS ONE AT A TIME

The functions return a dataframe for ease of "transfer" to other NLP tasks.

## NOTE:

Make sure the numbering of your split audio files start at "1", and not "01", or "001".

In [3]:
def split(x, y):
    speech = {}
    input_values = {}
    logits = {}
    predicted_ids = {}
    transcribe = {}
    for i in range(x, y + 1):
        speech[i], rate = librosa.load(
            "../audio/biden/biden-%d.flac" % i, sr=16000
        )
        input_values[i] = tokenizer_transcribe(
            speech[i], return_tensors="pt"
        ).input_values
        logits[i] = model_transcribe(input_values[i]).logits
        predicted_ids[i] = torch.argmax(logits[i], dim=-1)
        transcribe[i] = tokenizer_transcribe.decode(predicted_ids[i][0])
    return transcribe


In [4]:
def transcript(num_clips):
    trans = {}
    for j in range(1, num_clips):
        if num_clips - j > 0:
            trans[j] = pd.DataFrame.from_dict(
                split(j, j + 1), orient="index"
            ).rename(columns={0: "Transcribed_Text"})
        else:
            pass
    return (
        pd.concat(trans)
        .drop_duplicates(subset=["Transcribed_Text"])
        .reset_index(drop=True)
    )


## 1.2 TRANSCRIBE SHORT CLIPS

This took 23min and 46s on my late-2015 iMac. Should be considerably faster on Colab or newer machines.

PS: You can choose not transcribe all the clips. But it's hard to tell beforehand what's in which clip.

In [5]:
%%time
df = transcript(num_clips = 71)

CPU times: user 1h 11min 48s, sys: 4min 49s, total: 1h 16min 37s
Wall time: 23min 46s


## 1.3 CHECK RESULTS

In [6]:
#checking that all clips were transcribed
df.shape

(71, 1)

In [7]:
df.tail()

Unnamed: 0,Transcribed_Text
66,WHAT WE ARE ABOUT TO GO THROUGH BUT NOW W'RE C...
67,FOR ALSO BOUND IN GETHER BY THE WHOPE AND THE ...
68,ONE AMERICA I BELIEVE WI CAN AND WE WILL WE'RE...
69,WILL COME OUT STRONGER WITH A RENEWED FAITH IN...
70,WE DOIT TOGETHER SO GOD BLESS YOU ALL PLEASE G...


In [8]:
# Output the transcript to a text file if you wish, or use it for other downstream NLP tasks.

#biden = df["Transcribed_Text"].apply(''.join)
#biden.to_csv("../transcripts/biden.txt", sep="\t", index=False)

# 2. TRANSLATE TO CHINESE

There are various ways to translate the transcribed text using transformer models, but in this case I'll use [Hugging Face's version of MarianMT](https://huggingface.co/transformers/model_doc/marian.html) to translate the speech into Chinese.

See my earlier [notebooks](https://github.com/chuachinhon/practical_nlp/tree/master/notebooks) on using MarianMT on speeches of varying lengths.

In [9]:
tokenizer_translate = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

model_translate = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-zh")


In [10]:
def clean_text(text):
    text = text.encode("ascii", errors="ignore").decode(
        "ascii"
    )  # remove non-ascii, Chinese characters
    text = text.lower()
    text = re.sub(r"\n", " ", text)
    text = text.strip(" ")
    text = re.sub(' +',' ', text).strip() # gets rid of multiple spaces and replace with a single
    return text

In [11]:
def translate(text):
    if text is None or text == "":
        return "Error",

    #batch input + sentence tokenization
    batch = tokenizer_translate.prepare_seq2seq_batch(sent_tokenize(text), return_tensors="pt")

    #run model
    translated = model_translate.generate(**batch)
    tgt_text = [tokenizer_translate.decode(t, skip_special_tokens=True) for t in translated]

    return " ".join(tgt_text)

In [12]:
%%time
df["Clean_Text"] = df["Transcribed_Text"].map(lambda text: clean_text(text))

df["Machine_Translation"] = df["Clean_Text"].map(lambda x: translate(x)).copy()


CPU times: user 3min 47s, sys: 1.69 s, total: 3min 49s
Wall time: 1min 1s


In [13]:
df["Machine_Translation"].values

array(['一年前的一年,被病毒击中,病毒遭到沉默,无声无声无声的传播,持续数日数星期,然后数月,无无声无息地传播了数日,导致更多的死亡人数增加,感染增加,压力增加,孤独照片和录像增加。',
       '和朋友一起过最后的生日 和大家族的最后假日 而对每个人都不同 我们都失去了一件 集体受苦受难的东西',
       '一年充满了生命的丧失 以及我们所有人的生命的丧失 但是在失去的一年中',
       '事实上,美国要做的事情 可能是美国最 的事情,我们做的, 而这就是我们已经做的 我们见过的前线 和被派遣工人 冒着生命危险 冒着生命危险,有时失去他们 拯救和帮助其他人 研究人员和科学家赛车',
       '你们当中这么多人 都认为这条路很坚固 我知道所有破碎的地方都很难熬',
       '已经死亡的美国人数量 已经死亡的美国人数量 已经死亡的美国人数量',
       '10次战争 一场战争 一场战争 两次战争 一场战争 一场战争 两次战争 一场战争 九十一次战争',
       '连葬礼都无法真正悲伤或愈合 甚至可以举行葬礼 但我也想着去年失去 不幸的意外或其他疾病的自然原因的其他人 他们也孤独地死去 他们也抛弃了爱人',
       '你知道,你经常听到我说 之前我谈论 最长的步行走来走去, 而你父母不能和 是一个短短的楼梯 飞到他孩子的卧室 说我很抱歉,但失去了我的工作 不能再在这里了',
       '就像我爸爸告诉我的 当他失去了工作的时候 他不得不做 同样的步行式补你 你失去了工作 你关闭了繁忙的繁忙 面对震动 饥饿失去了控制',
       '最糟糕的可能是失去希望 看着一代孩子 可能被放回 一年或更多 节食笔记的学校 因为失去学习 失去学习 这是生活的细节',
       '婚礼生日毕业典礼上 需要发生的所有事情 都会在第一天',
       '我们很多人的心理上都有一个可怕的 可怕的cas 在我们的心理上,因为我们是 从根本上来说,我们是人们 谁想成为无依无靠的人 与一个和Ande交谈 以相互拥抱,但这种暴力使我们和祖父母分离',
       '从未见过他们的孩子或孙孙子 父母从未见过他们的孩子 从未见过他们的孩子 从未见过他们的朋友',
       '我们彼此对对立 却把一个最容易拯救生命的面具 变成一个

## NOTE:
Results are pretty rough. Is it better to translate from scratch at all? Probably a good idea to clean up the transcribed text before launching into translation.

But if you are in a big hurry to get the audio file into another language in text format, well, this is an option.

In [14]:
# Output the translated speech to a text file if you wish.

#biden_translated = df["Machine_Translation"].apply(''.join)
#biden_translated.to_csv("../transcripts/biden_chinese.txt", sep="\t", index=False)

# 3. SENTIMENT ANALYSIS

Hugging Face's pipeline has made it very easy to generate results for sentiment analysis, but that's just half the mission. Figuring out a good way to visualize the results for immediate and long term analysis can be just as challenging.

I've been experimenting with these sentiment charts, using Plotly, to better understand the overall sentiment of a speech.

In [16]:
%%time
corpus = list(df['Clean_Text'].values)

nlp_sentiment = pipeline(
    "sentiment-analysis"
)


df["Sentiment"] = nlp_sentiment(corpus)

# The pipeline's sentiment analysis output consists of a label and a score
# I prefer to extract them into separate columns

df['Sentiment_Label'] = [x.get('label') for x in df['Sentiment']]

df['Sentiment_Score'] = [x.get('score') for x in df['Sentiment']]

CPU times: user 8.92 s, sys: 352 ms, total: 9.27 s
Wall time: 12.9 s


In [17]:
df['Sentiment_Label'].value_counts()

POSITIVE    37
NEGATIVE    34
Name: Sentiment_Label, dtype: int64

## NOTE: 

Biden's speech seems rather even-handed. Not excessively positive or rah-rah, even though he could have chosen to do that.

## 3.1 PLOT SENTIMENT CHART

In [18]:
# We won't need all the cols for the chart, so let's narrow down the selection

cols = ["Clean_Text", "Sentiment_Label", "Sentiment_Score"]

df_sentiment = df[cols].copy()

In [19]:
# Tweaking the sentiment score column for visualisation
# Absolute value of the score is unchanged, merely the direction so that
# the resulting chart is clearer on a divergent axis

df_sentiment["Sentiment_Score"] = np.where(
    df_sentiment["Sentiment_Label"] == "NEGATIVE", -(df_sentiment["Sentiment_Score"]), df_sentiment["Sentiment_Score"]
)

In [20]:
df_sentiment.head()

Unnamed: 0,Clean_Text,Sentiment_Label,Sentiment_Score
0,a year ago were hit with a virus that was met ...,NEGATIVE,-0.995461
1,nineteen feel like they were taken in another ...,NEGATIVE,-0.986311
2,a year filled with the loss of life and the lo...,POSITIVE,0.965319
3,american thing to do in fact it may be the mos...,POSITIVE,0.991791
4,for a vaccine and so many of you ad heming way...,NEGATIVE,-0.967911


In [25]:
# I've experimented with various plots and settled on Plotly's Heatmap

fig = go.Figure(
    data=go.Heatmap(
        z=df_sentiment["Sentiment_Score"],
        x=df_sentiment.index,
        y=df_sentiment["Sentiment_Label"],
        colorscale=px.colors.sequential.RdBu,
    )
)

fig.update_layout(
    title=go.layout.Title(
        text="Sentiment Analysis of Biden's First Prime-Time Speech (2021)"
    ),
    autosize=False,
    width=1200,
    height=600,
)

#fig.update_layout(yaxis_autorange = "reversed")

fig.show()

In [22]:
cols_final = [
    "Transcribed_Text",
    "Clean_Text",
    "Machine_Translation",
    "Sentiment_Label",
    "Sentiment_Score",
]

df_final = df[cols_final]


# 4. FINAL OUTPUT

Here's the final dataframe with the transcribed + translated text + sentiment labels/scores. Not entirely ready for prime time, but the possibilities are pretty exciting.

In [26]:
df_final.head(10)

Unnamed: 0,Transcribed_Text,Clean_Text,Machine_Translation,Sentiment_Label,Sentiment_Score
0,A YEAR AGO WERE HIT WITH A VIRUS THAT WAS MET ...,a year ago were hit with a virus that was met ...,"一年前的一年,被病毒击中,病毒遭到沉默,无声无声无声的传播,持续数日数星期,然后数月,无无声...",NEGATIVE,0.995461
1,NINETEEN FEEL LIKE THEY WERE TAKEN IN ANOTHER ...,nineteen feel like they were taken in another ...,和朋友一起过最后的生日 和大家族的最后假日 而对每个人都不同 我们都失去了一件 集体受苦受难的东西,NEGATIVE,0.986311
2,A YEAR FILLED WITH THE LOSS OF LIFE AND THE LO...,a year filled with the loss of life and the lo...,一年充满了生命的丧失 以及我们所有人的生命的丧失 但是在失去的一年中,POSITIVE,0.965319
3,AMERICAN THING TO DO IN FACT IT MAY BE THE MOS...,american thing to do in fact it may be the mos...,"事实上,美国要做的事情 可能是美国最 的事情,我们做的, 而这就是我们已经做的 我们见过的前...",POSITIVE,0.991791
4,FOR A VACCINE AND SO MANY OF YOU AD HEMING WAY...,for a vaccine and so many of you ad heming way...,你们当中这么多人 都认为这条路很坚固 我知道所有破碎的地方都很难熬,NEGATIVE,0.967911
5,TIT THE NUMBER OF AMERICANS WHO HAVE DIED FOMD...,tit the number of americans who have died fomd...,已经死亡的美国人数量 已经死亡的美国人数量 已经死亡的美国人数量,NEGATIVE,0.997795
6,TEN WAR WAR ONE WAR WAR TWO VIA NA WAR AND NIN...,ten war war one war war two via na war and nin...,10次战争 一场战争 一场战争 两次战争 一场战争 一场战争 两次战争 一场战争 九十一次战争,POSITIVE,0.982951
7,UABLE TO TRULY GRIEVE OR HEAL EVEN TO HAVE A F...,uable to truly grieve or heal even to have a f...,连葬礼都无法真正悲伤或愈合 甚至可以举行葬礼 但我也想着去年失去 不幸的意外或其他疾病的自然...,NEGATIVE,0.873095
8,FO'RE HURTING BADLY YOU KNOW YOU'VE OFTEN HEAR...,fo're hurting badly you know you've often hear...,"你知道,你经常听到我说 之前我谈论 最长的步行走来走去, 而你父母不能和 是一个短短的楼梯 ...",NEGATIVE,0.998996
9,LIKE MY DAD TOLD ME WHEN HE LOST THE JOB AN CR...,like my dad told me when he lost the job an cr...,就像我爸爸告诉我的 当他失去了工作的时候 他不得不做 同样的步行式补你 你失去了工作 你关闭...,NEGATIVE,0.999683


In [None]:
# df_final.to_csv("../transcripts/biden_prime_time.csv", index=False)