<a href="https://colab.research.google.com/github/ericphann/voice-sentiment-analysis/blob/main/speech_to_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Eric Phann / Jessica Ricks  
DSBA 6156: Applied Machine Learning

# 📝 Approach #1: Speech-to-Text / Transcription

This notebook will walk through how to transcribe audio samples, embed them, and classify the sentiment. We will then analyze the performance of the approach given some examples.

## 1. Installing dependencies

First we need to install our dependencies. We recommend using a virtual environment or Colab to better manage dependencies and storage.

In [2]:
! pip install librosa transformers torch soundfile nltk



In [3]:
import librosa
import soundfile as sf
import torch
from transformers import pipeline, Wav2Vec2ForCTC, Wav2Vec2Processor
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

It is recommended to use a GPU with Cuda to speed up voice to text transcriptions!

In [4]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU detected:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("No GPU detected, using CPU.")

GPU detected: Tesla T4


## 2. Load & Transcribe Audio

For this experiment, we will use a pre-trained model. Here we are using [Wav2Vec2-Large-960h](https://huggingface.co/facebook/wav2vec2-large-960h).

In [6]:
# Load a pre-trained model for audio-speech transcription
transcription_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
transcription_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h").to(device)  # Move model to GPU if available

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
# This is our function for transcribing audio

def transcribe_audio(file_path):
    # Load and process audio
    audio_input, sr = librosa.load(file_path, sr=16000)
    input_values = transcription_processor(audio_input, return_tensors="pt", padding="longest", sampling_rate=sr).input_values.to(device) # Move inputs to GPU if available

    # Predict and decode
    with torch.no_grad():
        logits = transcription_model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    return transcription_processor.batch_decode(predicted_ids)[0]

Let's import an audio dataset from Kaggle.

In [8]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/uwrfkaggler/ravdess-emotional-speech-audio?dataset_version_number=1...


100%|██████████| 429M/429M [00:03<00:00, 120MB/s] 

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1


Let's test the transcribe function using an example audio file.

In [34]:
audio_file = "/root/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1/Actor_01/03-01-01-01-01-01-01.wav"  # Replace with your file
transcription = transcribe_audio(audio_file)
print("Transcription:", transcription)

Transcription: KIDS ARE TALKING BY THE DOOR


Great -- this is the transcription we are expecting from the example audio!  
The dataset comprises of only two phrases said in different ways by different actors:


*   "Kids are talking by the door"
*   "Dogs are sitting by the door"



## 3. Analyzing the Transcription

We can analyze the transcriptions 1 of 2 ways:


1.   Raw text sentiment analysis -- simple, not so intensive
2.   Text embedding sentiment analysis -- nuanced, more intensive

To give a high-level example, take the phrase "it was not bad." The first approach might classify it as __negative__ due to the keyword "bad" dragging the overall sentiment down. However, the latter approach might successfully classify the phrase as __positive__.

We can take a look at both approaches. However, we don't expect too much to differ between the two approaches. This is because the dataset has an emphasis on deriving sentiment through _speech_. This is a dimension that neither approach can really capture well.



### 3. a. Raw Text

In [76]:
# NLTK Sentiment Analyzer
sia = SentimentIntensityAnalyzer()

# This is our function for raw text sentiment analysis
def sentiment_analysis(text):
    sentiment_scores = sia.polarity_scores(text)
    sentiment = 'Positive' if sentiment_scores['compound'] > 0 else 'Negative' if sentiment_scores['compound'] < 0 else 'Neutral'
    return sentiment, sentiment_scores

# Predict sentiment of the given transcription
sentiment, scores = sentiment_analysis(transcription)
print(transcription)
print(f"Sentiment: {sentiment}\nScores: {scores}")

KIDS ARE TALKING BY THE DOOR
Sentiment: Neutral
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


### 3. b. Text Embedding (Transformers)

This is how we would go about embedding the transcribed text.  
Let's use __all-MiniLM-L6-v2__ to tokenize and embed our transcription. This model is good for short sentences and paragraphs.

In [17]:
from transformers import AutoTokenizer, AutoModel
import numpy as np

# Load pre-trained model for embeddings
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
embedding_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# This is our function for turning text into an embedding
def embed_text(text):
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        model_output = embedding_model(**encoded_input)
    embeddings = model_output.last_hidden_state.mean(dim=1)  # Pooling
    return embeddings.squeeze().numpy()

# Predict sentiment of the given transcription
embedding = embed_text(transcription)
print("Text Embedding Shape:", embedding.shape)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Text Embedding Shape: (384,)


From here we would train a classification model (e.g., LogReg, SVM, NN) using these embeddings as input features. However, that is outside the scope of this experiment.  

We can just use a transformer pipeline to do all of the processing steps (tokenizing/embedding) for us.  
Let's use
**DistilBERT base uncased finetuned SST-2**!

In [77]:
from transformers import pipeline

# Load pre-trained sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", return_all_scores=True)

# Predict sentiment directly
result = sentiment_pipeline(transcription)
print(transcription)
print(result)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


KIDS ARE TALKING BY THE DOOR
[[{'label': 'NEGATIVE', 'score': 0.019331783056259155}, {'label': 'POSITIVE', 'score': 0.9806681871414185}]]




Interesting! The transformer model classified the phrase "KIDS ARE TALKING BY THE DOOR" as positive, with a score of 98%.

## Evaluation

First, we need to preprocess the RAVDESS dataset so we can use it to evaluate our approach for binary sentiment classification.  

At a high-level, the RAVDESS dataset has actors say the same phrase in different emotions. For more info regarding the dataset and file naming conventions please see the [RAVDESS page](https://zenodo.org/records/1188976).  
These emotions are coded as the third element in the file name as follows:
* 01 = neutral
* 02 = calm
* 03 = happy
* 04 = sad
* 05 = angry
* 06 = fearful
* 07 = disgust
* 08 = surprised  

So, the emotion for 02-01-06-01-02-01-12.mp4 is _fearful_.



In [19]:
def extract_emotion(file_name):
    emotion_map = {
        "01": "Neutral",
        "02": "Calm",
        "03": "Happy",
        "04": "Sad",
        "05": "Angry",
        "06": "Fearful",
        "07": "Disgust",
        "08": "Surprised",
    }
    emotion_code = file_name.split("-")[2]  # Third element in file name
    return emotion_map[emotion_code]

To be able to use these labels for sentiment analysis, we will need to classify them as positive, negative, or neutral as follows:


*   Positive: happy, surprised
*   Neutral: neutral, calm
* Negative: sad, angry, fearful, disgust  

These groupings are purely based on intuition.



In [20]:
def map_to_sentiment(emotion):
    if emotion in ["Happy", "Surprised"]:
        return "Positive"
    elif emotion in ["Sad", "Angry", "Fearful", "Disgust"]:
        return "Negative"
    else:
        return "Neutral"

Now, we will use these functions to create a list of dictionaries with the file name, it's transcription, the true "baseline" emotion (given from RAVDESS), the true "baseline" sentiment (grouped by us as above), and the _predicted_ sentiment from our models (either raw text or embedding pipeline). We will look at all the speech and audio-only files.

In [35]:
import os
import pandas as pd
!pip install tqdm
from tqdm import tqdm

results = []

audio_dir = "/root/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1"
for subdir, dirs, files in tqdm(list(os.walk(audio_dir))):
    for file in files:
        if file.endswith(".wav"):
            path = os.path.join(subdir, file)
            transcription = transcribe_audio(path)
            emotion = extract_emotion(file)
            sentiment_label = map_to_sentiment(emotion)

            # Run sentiment analysis using NLTK raw text model
            sentiment, scores = sentiment_analysis(transcription)
            predicted_sentiment_raw = sentiment

            # Run sentiment analysis using sentiment pipeline
            sentiment_result = sentiment_pipeline(transcription)[0]
            predicted_sentiment_embedding = sentiment_result['label']

            # Save results
            results.append({
                "File": file,
                "Transcription": transcription,
                "True Emotion": emotion,
                "True Sentiment": sentiment_label,
                "Predicted Sentiment (Raw Text)": predicted_sentiment_raw,
                "Predicted Sentiment (Embedding)": predicted_sentiment_embedding
            })

df = pd.DataFrame(results)
df.to_csv("ravdess_sentiment_results.csv", index=False)



100%|██████████| 50/50 [04:55<00:00,  5.92s/it]


In [60]:
# Need to fix the classes of the embedding to match the others
df["Predicted Sentiment (Embedding)"] = df["Predicted Sentiment (Embedding)"].apply(str.title)

Let's take a look at our table!

In [55]:
pd.set_option('display.max_rows', None)
df.head(10)

Unnamed: 0,File,Transcription,True Emotion,True Sentiment,Predicted Sentiment (Raw Text),Predicted Sentiment (Embedding)
0,03-01-02-02-01-01-03.wav,KIDS ARE TALKING BY THE DOOR,Calm,Neutral,Neutral,Positive
1,03-01-06-02-02-02-03.wav,JOMS ARE SITTING BY THE DOOR,Fearful,Negative,Neutral,Negative
2,03-01-08-02-01-02-03.wav,KIDS ARE TALKING BY THE DOOR,Surprised,Positive,Neutral,Positive
3,03-01-08-02-02-02-03.wav,DOGS ARE SITTING BY THE DOOR,Surprised,Positive,Neutral,Negative
4,03-01-08-01-01-02-03.wav,KIDS ARE TALKING BY THE DOOR,Surprised,Positive,Neutral,Positive
5,03-01-07-01-02-02-03.wav,DOGS ARE SITTING BY THE DOOR,Disgust,Negative,Neutral,Negative
6,03-01-05-01-01-02-03.wav,KIDS ARE TALKING BY THE DOOR,Angry,Negative,Neutral,Positive
7,03-01-04-01-01-01-03.wav,KIDS ARE TALKING BY THE DOOR,Sad,Negative,Neutral,Positive
8,03-01-05-02-02-01-03.wav,DOGS ARE SITTING BY THE DOOR,Angry,Negative,Neutral,Negative
9,03-01-08-02-01-01-03.wav,KINS ARE TALKING BY THE DOOR,Surprised,Positive,Neutral,Positive


Now, let's look at the results of each model.

In [58]:
from sklearn.metrics import classification_report

# Compare true vs. predicted raw sentiment and true vs. predicted embedding sentiment
true_sentiments = df["True Sentiment"]
predicted_sentiments_raw = df["Predicted Sentiment (Raw Text)"]
predicted_sentiment_embeddings = df["Predicted Sentiment (Embedding)"]

In [59]:
print("True vs. Predicted Sentiment (Raw Text)")
print(classification_report(true_sentiments, predicted_sentiments_raw))
print("True vs. Predicted Sentiment (Embedding)")
print(classification_report(true_sentiments, predicted_sentiment_embeddings))

True vs. Predicted Sentiment (Raw Text)
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00      1536
     Neutral       0.20      1.00      0.33       576
    Positive       0.00      0.00      0.00       768

    accuracy                           0.20      2880
   macro avg       0.07      0.33      0.11      2880
weighted avg       0.04      0.20      0.07      2880

True vs. Predicted Sentiment (Embedding)
              precision    recall  f1-score   support

    Negative       0.54      0.57      0.55      1536
     Neutral       0.00      0.00      0.00       576
    Positive       0.27      0.45      0.34       768

    accuracy                           0.42      2880
   macro avg       0.27      0.34      0.30      2880
weighted avg       0.36      0.42      0.39      2880



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Interesting...

# Conclusion

One thing to note here is that the sentiment-analysis pipeline _does not_ have a neutral class. Another thing to note is that it has about 20% greater accuracy than predicting sentiment with raw text. However, this is simply due to class imbalance and is not significant in any way.

The raw text approach is labelling all examples as neutral and obtains 20% accuracy. Intuitively, this is what we expect, as all the transcriptions are (mostly) the same and the sentiment can _only_ be derived from the emotion of the audio clip--the phrases themselves are indeed neutral.

Simply put, this approach demonstrates that speech input has signal that cannot be captured in simple, normalized text. We will explore this signal in approach #2 through Mel-Frequency Cepstral Coefficients!