Clayton Cohn <br>
14 Nov 2022 <br>
DS5899: Transformers <br>
Vanderbilt University <br>

## <center> Audeering Processing

This notebook uses the PyTorch HuggingFace Transformers ```pipeline``` c/o [Audeering](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) on HuggingFace. The model classifies audio across three dimensions: arousal, dominance, and valence. The author of this notebook thanks the contributor for making his or her code puplicly available.

This notebook was created by Clayton Cohn for his DS5899: Transformers class project on evaluating "educational" emotion detection in speech via Audeering.

## Import Data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Text

In [2]:
TEXT_PATH = "drive/My Drive/DS5899_Transformers/transcriptions.csv"
AUDIO_PATH = "drive/My Drive/DS5899_Transformers/wav/"

In [3]:
import pandas as pd

df = pd.read_csv(TEXT_PATH, header=0).sort_values(by ='file' )
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,file,text
0,confusion_0.wav,This just isn't making sense to me.
1,confusion_1.wav,I'm just not getting it.
2,confusion_2.wav,I don't think I understand what you mean.
3,confusion_3.wav,Why won't the code run?
4,confusion_4.wav,"Wait, so what are you talking about?"
5,eureka_0.wav,"Aha, finally!"
6,eureka_1.wav,"Ohhh, yeah that makes so much sense!"
7,eureka_2.wav,"Of course, now I get it!"
8,eureka_3.wav,That was so easy!
9,eureka_4.wav,I can't believe I didn't figure that out earli...


### Audio

In [4]:
from os import listdir
from os.path import isfile, join

AUDIO_FILES = sorted([f for f in listdir(AUDIO_PATH) \
  if isfile(join(AUDIO_PATH, f))])

AUDIO_FILES, len(AUDIO_FILES)

(['confusion_0.wav',
  'confusion_1.wav',
  'confusion_2.wav',
  'confusion_3.wav',
  'confusion_4.wav',
  'eureka_0.wav',
  'eureka_1.wav',
  'eureka_2.wav',
  'eureka_3.wav',
  'eureka_4.wav',
  'frustration_0.wav',
  'frustration_1.wav',
  'frustration_2.wav',
  'frustration_3.wav',
  'frustration_4.wav',
  'neutral_0.wav',
  'neutral_1.wav',
  'neutral_2.wav',
  'neutral_3.wav',
  'neutral_4.wav'],
 20)

In [5]:
assert list(df["file"]) == AUDIO_FILES

## Prepare Environment

In [6]:
!pip install torch --quiet
!pip install transformers --quiet
!pip install torchaudio --quiet
!pip install librosa --quiet
!pip install pydub --quiet

In [7]:
import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2Model,
    Wav2Vec2PreTrainedModel,
)
import torchaudio
from pydub import AudioSegment
from pydub.silence import split_on_silence
import pandas as pd
import IPython.display as ipd
import librosa

## Set Up Model

The ```RegressionHead``` stacks on top of the Transformer to output magnitude ranges, normalized to be between 0 and 1, for the dimensions, arousal, dominance, and valence. 

In [8]:
class RegressionHead(nn.Module):
    r"""Classification head."""

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):

        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x

```EmotionModel``` is the pretrained Wav2Vec2 model with the regression head on top.

In [9]:
class EmotionModel(Wav2Vec2PreTrainedModel):
    r"""Speech emotion classifier."""

    def __init__(self, config):

        super().__init__(config)

        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = RegressionHead(config)
        self.init_weights()

    def forward(
            self,
            input_values,
    ):

        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits = self.classifier(hidden_states)

        return hidden_states, logits

We will now prepare to instantiate the model. 

The ```DEVICE``` constant is our device (either cpu or gpu). 

The ```MODEL_NAME``` is the fine-tuned model from the HuggingFace hub.

The ```PROCESSOR``` is the Wav2Vec2 processor that wraps the feature extractor and tokenizer. 

In [10]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# DEVICE = torch.device("cpu")

MODEL_NAME = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'

PROCESSOR = Wav2Vec2Processor.from_pretrained(MODEL_NAME)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Instantiate the model with the fine-tuned Audeering model, and transfer it to the CPU if we have one.

In [11]:
model = EmotionModel.from_pretrained(MODEL_NAME)
model.to(DEVICE);

## Prep for Processing

Define the ```SAMPLING_RATE``` constant. Wav2Vec2 requires, 16000.

In [12]:
SAMPLING_RATE = 16000

Test that the shape is correct by using ```librosa``` to import one of our chunks and making it an np.array for Wav2Vec2.

In [13]:
test_audio, _ = librosa.load(AUDIO_PATH+AUDIO_FILES[0], sr = 16000)
test_audio = np.array([test_audio])
test_audio.shape

(1, 33977)

In [14]:
test_audio

array([[1.0534357e-04, 6.1323291e-05, 4.7091438e-05, ..., 6.3422631e-04,
        5.4774678e-04, 0.0000000e+00]], dtype=float32)

Define the processing function that will run the instances through the model.

In [15]:
def process(
    x: np.ndarray,
    sampling_rate: int,
    embeddings: bool = False,) -> np.ndarray:
    r"""Predict emotions or extract embeddings from raw audio signal."""

    # run through processor to normalize signal
    # always returns a batch, so we just get the first entry
    # then we put it on the device
    y = PROCESSOR(x, sampling_rate=SAMPLING_RATE)
    y = y['input_values'][0]
    y = torch.from_numpy(y).to(DEVICE)

    # run through model
    with torch.no_grad():
        y = model(y)[0 if embeddings else 1]

    # convert to numpy
    y = y.detach().cpu().numpy()

    return y

Test the processing function.

In [16]:
process(test_audio, SAMPLING_RATE, embeddings=False)

array([[0.49106094, 0.5793368 , 0.1995954 ]], dtype=float32)

## Process Audio

Put all the audio through the processor and put it in an array to later use in a DataFrame.

In [17]:
def predict_adv(path, files, rate):
  # file_name, audio, arousal, dominance, valence
  arr = [["file_name", "audio", "arousal", "dominance", "valence"]]

  for i in range(len(files)):
    fil = files[i]
    audio = AudioSegment.from_wav(path+fil)
    audio = audio.set_frame_rate(rate)
    audio_arr, _ = librosa.load(path+fil, sr = rate)
    audio_arr = np.array([audio_arr])
    arousal, dominance, valence = process(audio_arr, rate, \
                                          embeddings=False)[0]
    a = [fil, audio, arousal, dominance, valence]
    arr.append(a)

  # Prints final row
  print(fil, f"\narousal: {arousal},", f"dominance: {dominance},", \
        f"valence: {valence}")
  return arr

Process the audio from both clips.

In [18]:
arr = predict_adv(AUDIO_PATH, AUDIO_FILES, SAMPLING_RATE)

neutral_4.wav 
arousal: 0.4385491609573364, dominance: 0.5383925437927246, valence: 0.5630528330802917


Put everything into DataFrames.

In [19]:
df_audio = pd.DataFrame(data=arr[1:], columns=arr[0])
df_audio.head(1)

Unnamed: 0,file_name,audio,arousal,dominance,valence
0,confusion_0.wav,(((<pydub.audio_segment.AudioSegment object at...,0.491061,0.579337,0.199595


## Display

Display everything using iPython.

In [20]:
df_audio['audio'] = df_audio['audio'].apply(lambda x:x._repr_html_().replace('\n', '').strip())
df_html = df_audio.to_html(escape=False, index=False)
ipd.display(ipd.HTML(df_html))

file_name,audio,arousal,dominance,valence
confusion_0.wav,Your browser does not support the audio element.,0.491061,0.579337,0.199595
confusion_1.wav,Your browser does not support the audio element.,0.496632,0.564744,0.250415
confusion_2.wav,Your browser does not support the audio element.,0.340029,0.463059,0.223187
confusion_3.wav,Your browser does not support the audio element.,0.476211,0.533171,0.327067
confusion_4.wav,Your browser does not support the audio element.,0.452598,0.498192,0.518139
eureka_0.wav,Your browser does not support the audio element.,0.779292,0.827495,0.942782
eureka_1.wav,Your browser does not support the audio element.,0.614347,0.624977,0.468896
eureka_2.wav,Your browser does not support the audio element.,0.594661,0.659573,0.474535
eureka_3.wav,Your browser does not support the audio element.,0.732855,0.750316,0.576097
eureka_4.wav,Your browser does not support the audio element.,0.644227,0.692528,0.307335


## Merge

We will now merge the text and audio DataFrame into one.

In [21]:
df["arousal"] = df_audio["arousal"]
df["dominance"] = df_audio["dominance"]
df["valence"] = df_audio["valence"]
df

Unnamed: 0,file,text,arousal,dominance,valence
0,confusion_0.wav,This just isn't making sense to me.,0.491061,0.579337,0.199595
1,confusion_1.wav,I'm just not getting it.,0.496632,0.564744,0.250415
2,confusion_2.wav,I don't think I understand what you mean.,0.340029,0.463059,0.223187
3,confusion_3.wav,Why won't the code run?,0.476211,0.533171,0.327067
4,confusion_4.wav,"Wait, so what are you talking about?",0.452598,0.498192,0.518139
5,eureka_0.wav,"Aha, finally!",0.779292,0.827495,0.942782
6,eureka_1.wav,"Ohhh, yeah that makes so much sense!",0.614347,0.624977,0.468896
7,eureka_2.wav,"Of course, now I get it!",0.594661,0.659573,0.474535
8,eureka_3.wav,That was so easy!,0.732855,0.750316,0.576097
9,eureka_4.wav,I can't believe I didn't figure that out earli...,0.644227,0.692528,0.307335


Add the labels for classification later.

In [22]:
df["label"] = df["file"].str.slice(0,-6)
df

Unnamed: 0,file,text,arousal,dominance,valence,label
0,confusion_0.wav,This just isn't making sense to me.,0.491061,0.579337,0.199595,confusion
1,confusion_1.wav,I'm just not getting it.,0.496632,0.564744,0.250415,confusion
2,confusion_2.wav,I don't think I understand what you mean.,0.340029,0.463059,0.223187,confusion
3,confusion_3.wav,Why won't the code run?,0.476211,0.533171,0.327067,confusion
4,confusion_4.wav,"Wait, so what are you talking about?",0.452598,0.498192,0.518139,confusion
5,eureka_0.wav,"Aha, finally!",0.779292,0.827495,0.942782,eureka
6,eureka_1.wav,"Ohhh, yeah that makes so much sense!",0.614347,0.624977,0.468896,eureka
7,eureka_2.wav,"Of course, now I get it!",0.594661,0.659573,0.474535,eureka
8,eureka_3.wav,That was so easy!,0.732855,0.750316,0.576097,eureka
9,eureka_4.wav,I can't believe I didn't figure that out earli...,0.644227,0.692528,0.307335,eureka


Confirm four distinct labels.

In [23]:
assert set(df["label"]) == {"confusion", "eureka", "frustration", "neutral"}

## Save

In [24]:
OUT_FILE = "drive/My Drive/DS5899_Transformers/emotions.csv"

In [25]:
df.to_csv(path_or_buf=OUT_FILE, index=False, encoding='utf-8')