# Metrics for ASR Evaluation
* Notebook by Adam Lang
* Date: 5/5/2025

# Overview
* This notebook contains code examples of various approaches to evaluating automatic speech recognition.

## 1. Confidence-based evaluation
* Many speech recognition models output a confidence score or probability distribution over possible transcripts.
* If your model provides this information, you can use it to estimate the reliability of the ASR transcript.
* You can threshold the confidence scores to filter out low-confidence transcripts or words.
* In PyTorch, you can access the output probabilities of your model and compute the confidence scores.
* For example, if your model outputs a tensor logits with shape (batch_size, sequence_length, vocab_size), you can compute the confidence scores as seen below.

In [None]:
import torch
import torch.nn.functional as F

logits = ...  # output of your speech recognition model
probs = F.softmax(logits, dim=-1)  # compute probabilities
confidence_scores = probs.max(dim=-1).values  # compute confidence scores

## 2. Language model-based evaluation -- Perplexity
* You can use any language model (LM) to evaluate the **fluency and coherence** of the generated transcripts.
* The concept is that a well-formed transcript should have a **low perplexity** under a LM.
* You can use a pre-trained LM like BERT or a dedicated speech recognition LM.
  * Note: Depending upon the size of the transcript, consider using an LM with a max_sequence > 512 (e.g. don't use BERT base), use ModernBERT instead. Otherwise you will have to chunk the text first or truncate inputs.
* In PyTorch, you can use the transformers library to load a pre-trained LM and compute the perplexity of your transcripts as seen in the code below.

In [None]:
import torch
from transformers import BertTokenizer, BertModel ## can use any model

## load  model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

## load transcript
transcript = ...  # your transcript text

## tokenize transcript
inputs = tokenizer.encode_plus(transcript, return_tensors='pt')

## send tokenized inputs to model
outputs = model(**inputs)

## calculate loss
loss = torch.nn.CrossEntropyLoss()(outputs.last_hidden_state[:, :-1, :], inputs['input_ids'][:, 1:])

## calculate perplexity
perplexity = torch.exp(loss)

## 3. SemDistance
* SemDist (Semantic Distance) is a metric used in speech recognition (ASR) to assess the semantic similarity between the reference (human) transcription and the hypothesis (system) transcription.
* It measures the distance between these two sentences in an embedding space, typically using pre-trained language models like RoBERTa.
* **A Lower SemDist scores indicate a greater semantic similarity between the reference and hypothesis.**

### Step 1: Load model from hugging face

In [None]:
import torch
from transformers import RobertaTokenizer, RobertaModel

## Load model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

### Step 2: Preprocess the reference and hypothesis transcripts
* Preprocess the reference and hypothesis transcripts by tokenizing them and converting them to PyTorch tensors.
* Important Distinctions

1. **Hypothesis**
  * The hypothesis is the transcription output generated by the speech recognition model.
2. **Reference**
  * The reference is typically considered to be the ground truth or the "correct" transcription, usually obtained through human annotation or transcription.
  * The reference is used as a benchmark to evaluate the performance of the speech recognition model.

In [None]:
## Function to preprocess transcript
def preprocess_transcript(transcript):
    inputs = tokenizer(transcript, return_tensors='pt')
    return inputs

## text reference vs. hypothesis
reference = "This is the reference transcript."
hypothesis = "This is the hypothesis transcript."


## preprocess both
reference_inputs = preprocess_transcript(reference)
hypothesis_inputs = preprocess_transcript(hypothesis)

### Step 3: Compute sentence embeddings
* Compute sentence embeddings for the reference and hypothesis transcripts using the pre-trained `RoBERTa` model.
* Note: An alternative to this would be to use a different external embedding model from `SentenceTransformers`.

In [None]:
## Compute embeddings using RoBERTa model
def compute_sentence_embedding(inputs):
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # Use the [CLS] token representation
    return embeddings


## Get embeddings
reference_embedding = compute_sentence_embedding(reference_inputs)
hypothesis_embedding = compute_sentence_embedding(hypothesis_inputs)

### Step 4: Calculate SemDist metric
* Calculate the SemDist metric by computing the cosine distance or L2 distance between the reference and hypothesis sentence embeddings.

In [None]:
## Function to calculate semdist metric
def calculate_semdist(embedding1, embedding2):
    # Cosine distance
    cosine_similarity = torch.nn.CosineSimilarity(dim=1)
    semdist = 1 - cosine_similarity(embedding1, embedding2)
    return semdist.item()

## use function
semdist = calculate_semdist(reference_embedding, hypothesis_embedding)
print("SemDist:", semdist)

Note: Instead, you can also use the L2 distance metric below

In [None]:
## L2 semdistance
def calculate_semdist_l2(embedding1, embedding2):
    # L2 distance
    l2_distance = torch.nn.PairwiseDistance(p=2)
    semdist = l2_distance(embedding1, embedding2)
    return semdist.item()

## If using semdist_l2 instead
semdist_l2 = calculate_semdist_l2(reference_embedding, hypothesis_embedding)
print("SemDist (L2):", semdist_l2)

### Full Script
* If you want to run this as a `.py` script instead

In [None]:
import os

## make dir
os.makedirs('name_of_dir/')

In [None]:
%%writefile SemDist/name_of_dir.py

import torch
from transformers import RobertaTokenizer, RobertaModel

## Preprocess transcript function
def preprocess_transcript(transcript, tokenizer):
    inputs = tokenizer(transcript, return_tensors='pt')
    return inputs

## Compute embeddings functions
def compute_sentence_embedding(inputs, model):
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # Use the [CLS] token representation
    return embeddings

## Calculate semdist function
def calculate_semdist(embedding1, embedding2):
    cosine_similarity = torch.nn.CosineSimilarity(dim=1)
    semdist = 1 - cosine_similarity(embedding1, embedding2)
    return semdist.item()

## Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')


## Init reference and hypothesis transcripts
reference = "This is the reference transcript."
hypothesis = "This is the hypothesis transcript."


## Feed ref and hypoth to preprocess_transcript function
reference_inputs = preprocess_transcript(reference, tokenizer)
hypothesis_inputs = preprocess_transcript(hypothesis, tokenizer)


## Get embeddings for both transcripts
reference_embedding = compute_sentence_embedding(reference_inputs, model)
hypothesis_embedding = compute_sentence_embedding(hypothesis_inputs, model)


## Calculate SemDist
semdist = calculate_semdist(reference_embedding, hypothesis_embedding)
print("SemDist:", semdist)

# No Ground Truth or Reference Transcripts?!

* If you don't have a ground truth or reference transcription, you won't be able to directly compute the SemDist metric, as it requires **both the hypothesis and the reference.**

* However, there are a few potential workarounds or alternatives:

1. **Pseudo-labeling**
  * You can use another speech recognition model or a more advanced language model to generate a pseudo-reference transcription.
  * This can be used as a substitute for the ground truth.


2. **Unsupervised evaluation**
  * You can use other unsupervised evaluation metrics that don't require a ground truth, such as:
    * **Confidence-based evaluation** - Confidence scores from the ASR model you are using
    * **Language model-based evaluation** - RoBERTa Perplexity example above.
    * **Word or Sentence embedding-based evaluation** -- see example below

4. **Comparing multiple hypotheses**
  * If you have multiple speech recognition models or systems, you can compare their outputs (hypotheses) against each other, even without a ground truth.
  * This can help you evaluate the relative performance of the models.

5. **Self-supervised learning**
  * You can use self-supervised learning techniques, where the model is trained on unlabeled data and learns to predict its own outputs or representations.
  * A good example of this would be Chain-of-Thought Prompting with or without In-Context Learning.

6. **Unsupervised metrics**
  * There are other unsupervised metrics that can be used to evaluate the quality of speech recognition outputs, such as:
    * perplexity
    * entropy


## Example 1 - Semantic Coherence -- using Word Embeddings

### Step 1 -  Load pre-trained word or sentence embeddings
* You can use pre-trained word embeddings like Word2Vec or GloVe, or encoder based models from HuggingFace (e.g. SentenceTransformers, MixedBread, anything from the MTEB benchmark on HuggingFace).
* For this example, we'll use GloVe embeddings but it would probably make more sense to consider using encoder based embeddings such as most SentenceTransformer models.

In [None]:
import torch
import numpy as np

# Load pre-trained GloVe embeddings
glove_file = 'glove.6B.100d.txt'
glove_dict = {}
with open(glove_file, 'r') as f:
    for line in f:
        values = line.split()
        glove_dict[values[0]] = np.array(values[1:], dtype='float32')

# Create a PyTorch tensor for the word embeddings
embedding_dim = 100
word_embeddings = torch.zeros((len(glove_dict), embedding_dim))
word_to_idx = {}
idx = 0
for word, vector in glove_dict.items():
    word_to_idx[word] = idx
    word_embeddings[idx] = torch.from_numpy(vector)
    idx += 1

### Step 2 - Preprocess the transcript
* Preprocess the transcript by tokenizing it into individual words and converting them to indices in the word embedding dictionary.
* Or if you used an encoder model, then use that tokenizer obviously.

In [None]:
## preprocess transcript if using word embeddings.
def preprocess_transcript(transcript):
    words = transcript.lower().split()
    indices = []
    for word in words:
        if word in word_to_idx:
            indices.append(word_to_idx[word])
    return torch.tensor(indices)

## load transcript and preprocess
transcript = "This is an example transcript."
indices = preprocess_transcript(transcript)

### Step 3 - Compute Semantic coherence
* To evaluate semantic coherence, you can use various metrics such as:

1. **Average cosine similarity**
  * Compute the cosine similarity between consecutive words in the transcript and average them.
  * Note: Another method would be to summarize the transcript using an encoder-decoder model like BART or T5, then compare the cosine similarity.
  * Note: Another method would be to look at rolling window cosine similarities.

2. **Word mover's distance**
  * Compute the word mover's distance between the transcript and a reference text (e.g., a pseudo-ground truth transcript).


Here's an example of computing **average cosine similarity**:

In [None]:
## function to compute avg cosine similarity using word embeddings -- modify if using other embedding model
def compute_average_cosine_similarity(indices, word_embeddings):
    similarities = []
    for i in range(len(indices) - 1):
        word1 = word_embeddings[indices[i]]
        word2 = word_embeddings[indices[i + 1]]
        similarity = torch.nn.CosineSimilarity(dim=0)(word1, word2)
        similarities.append(similarity.item())
    return np.mean(similarities)

## apply function
average_similarity = compute_average_cosine_similarity(indices, word_embeddings)
print("Average cosine similarity:", average_similarity)

### Step 3b - Using PyTorch to compute semantic coherence
* You can also use PyTorch to compute semantic coherence by defining a custom module.

In [None]:
## Customized Semantic Coherence
class SemanticCoherence(torch.nn.Module):
    def __init__(self, word_embeddings):
        super(SemanticCoherence, self).__init__()
        self.word_embeddings = word_embeddings

    def forward(self, indices):
        embeddings = self.word_embeddings[indices]
        similarities = []
        for i in range(embeddings.shape[0] - 1):
            word1 = embeddings[i]
            word2 = embeddings[i + 1]
            similarity = torch.nn.CosineSimilarity(dim=0)(word1, word2)
            similarities.append(similarity)
        return torch.mean(torch.stack(similarities))


## run metrics and get results
semantic_coherence = SemanticCoherence(word_embeddings)
average_similarity = semantic_coherence(indices)
print("Average cosine similarity:", average_similarity.item())

## Example 2 - Semantic Coherence using SentenceTransformers
* Same as above but using SBERT model instead.

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np

# Load a Sentence Transformer model -- pick model of choice
## Consider max_sequence_length and dimensions of the transcript
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define a function to compute the semantic coherence
def compute_semantic_coherence(sentences):
    # Compute the sentence embeddings
    embeddings = model.encode(sentences)

    # Compute the cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(sentences) - 1):
        embedding1 = embeddings[i]
        embedding2 = embeddings[i + 1]
        similarity = util.cos_sim(embedding1, embedding2)
        similarities.append(similarity.item())

    # Compute the average semantic coherence
    average_similarity = np.mean(similarities)
    return average_similarity

# Define a transcript
transcript = "This is an example transcript. The transcript is used to demonstrate the calculation of semantic coherence. The semantic coherence is a measure of how well the sentences in the transcript are related to each other."

# Split the transcript into sentences
sentences = transcript.split('. ')

# Compute the semantic coherence
semantic_coherence = compute_semantic_coherence(sentences)
print("Semantic Coherence:", semantic_coherence)

# Define a function to compute the average semantic distance
def compute_average_semantic_distance(sentences):
    # Compute the sentence embeddings
    embeddings = model.encode(sentences)

    # Compute the cosine distance between consecutive sentences
    distances = []
    for i in range(len(sentences) - 1):
        embedding1 = embeddings[i]
        embedding2 = embeddings[i + 1]
        distance = 1 - util.cos_sim(embedding1, embedding2).item()
        distances.append(distance)

    # Compute the average semantic distance
    average_distance = np.mean(distances)
    return average_distance

# Compute the average semantic distance
average_semantic_distance = compute_average_semantic_distance(sentences)
print("Average Semantic Distance:", average_semantic_distance)

## Example 3 - Using an LLM for Speech
* This example uses `SpeechLLM`.
* `SpeechLLM is a multi-modal Language Model (LLM) specifically trained to analyze and predict metadata from a speaker's turn in a conversation.
* This advanced model integrates a speech encoder to transform speech signals into meaningful speech representations. These embeddings, combined with text instructions, are then processed by the LLM to generate predictions.

The model inputs an speech audio file of 16 KHz and predicts the following:
```
SpeechActivity : if the audio signal contains speech (True/False)
Transcript : ASR transcript of the audio
Gender of the speaker (Female/Male)
Age of the speaker (Young/Middle-Age/Senior)
Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
```
* See the repo here: https://github.com/skit-ai/SpeechLLM

### General Concept of what you could do with this
1. Use `SpeechLLM` to generate a transcript using your audio file(s).
2. Then use metrics/methods discussed in this notebook or in the repo to compare the transcription results.

In [None]:
## sample code of how to run SpeechLLM assuming you have an audio file
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500,
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

Summary
* Then after you get the results you can run comparison using metrics and methods above or in repo to compare this as a reference vs. hypothesis transcript.