<a href="https://colab.research.google.com/github/ericphann/voice-sentiment-analysis/blob/main/speech_to_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Eric Phann / Jessica Ricks  
DSBA 6156: Applied Machine Learning

# 📝 Approach #1: Speech-to-Text / Transcription

This notebook will walk through how to transcribe audio samples, embed them, and classify the sentiment of those embeddings. We will then analyze the performance of the approach given some examples.

## 1. Installing dependencies

First we need to install our dependencies. We recommend using a virtual environment or Colab to better manage dependencies and storage.

In [5]:
! pip install librosa transformers torch soundfile nltk



In [6]:
import librosa
import soundfile as sf
import torch
from transformers import pipeline, Wav2Vec2ForCTC, Wav2Vec2Processor
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

## 2. Load & Transcribe Audio

For this experiment, we will use a pre-trained model. Here we are using [Wav2Vec2-Large-960h](https://huggingface.co/facebook/wav2vec2-large-960h).

In [7]:
# Load a pre-trained model for audio-speech transcription
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# This is our function for transcribing audio
def transcribe_audio(file_path):
    # Load the audio file
    audio_input, sr = librosa.load(file_path, sr=16000)  # Wav2Vec2 expects 16kHz audio

    # Process audio and convert to tensor
    input_values = processor(audio_input, return_tensors="pt", padding="longest", sampling_rate=sr).input_values

    # Predict transcription
    with torch.no_grad():
        logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]

    return transcription

Let's import an audio dataset from Kaggle.

In [13]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/uwrfkaggler/ravdess-emotional-speech-audio?dataset_version_number=1...


100%|██████████| 429M/429M [00:07<00:00, 57.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1


Let's test the transcribe function using an example audio file.

In [14]:
audio_file = "/root/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1/Actor_01/03-01-01-01-01-01-01.wav"  # Replace with your file
transcription = transcribe_audio(audio_file)
print("Transcription:", transcription)

Transcription: KIDS ARE TALKING BY THE DOOR


Great -- this is the transcription we are expecting from the example audio!

## 3. Analyzing the Transcription

We can analyze the transcriptions 1 of 2 ways:


1.   Raw text sentiment analysis -- simple, not so intensive
2.   Text embedding sentiment analysis -- nuanced, more intensive

To give a high-level example, take the phrase "it was not bad." The first approach might classify it as __negative__ due to the keyword "bad" dragging the overall sentiment down. However, the latter approach might successfully classify the phrase as __positive__.

We can take a look at both approaches. However, we don't expect too much to differ between the two approaches. This is because the dataset has an emphasis on deriving sentiment through _speech_. This is a dimension that neither approach can really capture well.



### 3. a. Raw Text

In [16]:
# NLTK Sentiment Analyzer
sia = SentimentIntensityAnalyzer()

# This is our function for raw text sentiment analysis
def sentiment_analysis(text):
    sentiment_scores = sia.polarity_scores(text)
    sentiment = 'Positive' if sentiment_scores['compound'] > 0 else 'Negative' if sentiment_scores['compound'] < 0 else 'Neutral'
    return sentiment, sentiment_scores

# Predict sentiment of the given transcription
sentiment, scores = sentiment_analysis(transcription)
print(f"Sentiment: {sentiment}\nScores: {scores}")

Sentiment: Neutral
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


### 3. b. Text Embedding (Transformers)

This is how we would go about embedding the transcribed text:

In [17]:
from transformers import AutoTokenizer, AutoModel
import numpy as np

# Load pre-trained model for embeddings
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# This is our function for turning text into an embedding
def embed_text(text):
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = model_output.last_hidden_state.mean(dim=1)  # Pooling
    return embeddings.squeeze().numpy()

# Predict sentiment of the given transcription
embedding = embed_text(transcription)
print("Text Embedding Shape:", embedding.shape)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Text Embedding Shape: (384,)


From here we would train a classification model (e.g., LogReg, SVM, NN) using these embeddings as input features. However, that is outside the scope of this experiment.  

We can just use a transformer pipeline to do all of the processing steps (tokenizing/embedding) for us:

In [22]:
from transformers import pipeline

# Load pre-trained sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# Predict sentiment directly
result = sentiment_pipeline(transcription)
print(result)

[{'label': 'POSITIVE', 'score': 0.980668306350708}]


Interesting! The transformer model classified the phrase "KIDS ARE TALKING BY THE DOOR" as positive, with a score of 98%.

## Evaluation

Now lets run these two approaches over our test set to see which performs best: