# Introduction to Language and Speech Technology - ReMA (RU)
*Seminar 11 - Inference and Evaluation of ASR*

Last update: 2024/11/25

Aditya Kamlesh Parikh - @aditya.parikh@ru.nl


In this tutorial we will evaluate some finetuned/trained models. The first important aspect is to choose a relevant evaluation matrix. In ASR, Word-Error-Rate (WER) is one of the most common and used metric.

From *Wikipedia*:

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. The WER metric ranges from 0 to 1, where 0 indicates that the compared pieces of text are exactly identical, and 1 indicates that they are completely different with no similarity. This way, a WER of 0.8 means that there is an 80% error rate for compared sentences.


## Formula

The formula for WER is:

<img src="https://latex.codecogs.com/svg.image?\bg{black}&space;WER=\frac{S&plus;D&plus;I}{N}" title=" WER=\frac{S+D+I}{N}" />

Where:
- **S**: Number of substitutions (when a word in the hypothesis replaces a word in the reference).
- **D**: Number of deletions (when a word in the reference is missing in the hypothesis).
- **I**: Number of insertions (when an extra word appears in the hypothesis).
- **N**: Total number of words in the reference.

---

## Levenshtein Distance

WER uses the Levenshtein Distance to compute the minimum number of operations (insertions, deletions, substitutions) needed to transform one sequence into another.

---

## Example of WER Calculation

### Reference Transcript:
`"the cat sat on the mat"`

### Hypothesis (ASR Output):
`"the cat is on mat"`

### Step 1: Align the sequences

| Reference        | the | cat | sat | on  | the | mat |
|------------------|-----|-----|-----|-----|-----|-----|
| Hypothesis       | the | cat | is  | on  |     | mat |

### Step 2: Identify the Errors
- **Substitution**: `sat` → `is` (1 substitution)
- **Deletion**: `the` (before `mat`) is missing (1 deletion)
- **Insertion**: None in this example.

### Step 3: Apply the Formula

<img src="https://latex.codecogs.com/svg.image?\bg{white}\text{WER}=\frac{S&plus;D&plus;I}{N}=\frac{1&plus;1&plus;0}{6}=0.33\,(33\%)" title="\text{WER}=\frac{S+D+I}{N}=\frac{1+1+0}{6}=0.33\,(33\%)" />

The Word Error Rate for this example is **33%**.



To calculate WER, we can use Hugging Face's `evaluate` library

In [None]:
%%capture
! pip install evaluate jiwer

In [None]:
import evaluate

# Load the WER metric
wer = evaluate.load("wer")

# Define reference and hypothesis transcripts
reference = ["the cat sat on the mat"]
hypothesis = ["the cat is on mat"]

# Calculate WER
result = wer.compute(references=reference, predictions=hypothesis)
print(f"WER: {result:.2f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

WER: 0.33


In [None]:
# Multiple references and hypotheses
references = [
    "the cat sat on the mat",
    "hello world this is a test",
    "huggingface is awesome",
    "introduction to language and speech technology"
]
hypotheses = [
    "the cat is on mat",
    "hello this is test",
    "huggingface awesome",
    "introdution language and speech"
]

# Calculate WER for the batch
result = wer.compute(references=references, predictions=hypotheses)
print(f"Batch WER: {result:.2f}")

Batch WER: 0.38


## Task 1
**Task 1: Evaluate Whisper Large and Whisper Tiny Models on the SpeechOcean Test Dataset**

Your task is to use two variants of OpenAI's Whisper ASR model—**Whisper Large** and **Whisper Tiny**—and compute and benchmark the Word Error Rate (WER) on the SpeechOcean762 test dataset. This will help you compare the performance of the two models.  



You can load the SpeechOcean dataset from huggingface as we did in previous tutorial.

In [None]:
# Add your code here

# SCLITE


**SCLITE (Scoring and CLustering of Transcripts Evaluation)** is a tool from the Speech Recognition Scoring Toolkit (SCTK) used to evaluate the accuracy of speech recognition systems.

It compares ASR-generated transcriptions (**hypotheses**) against **reference** transcriptions to calculate key metrics such as **Word Error Rate (WER), Sentence Error Rate (SER)**, and more.

In [None]:
!git clone https://github.com/usnistgov/SCTK.git

Cloning into 'SCTK'...
remote: Enumerating objects: 5487, done.[K
remote: Counting objects: 100% (344/344), done.[K
remote: Compressing objects: 100% (223/223), done.[K
remote: Total 5487 (delta 143), reused 286 (delta 117), pack-reused 5143 (from 1)[K
Receiving objects: 100% (5487/5487), 7.61 MiB | 13.55 MiB/s, done.
Resolving deltas: 100% (3827/3827), done.


In [None]:
%%capture
!cd SCTK && make config
!cd SCTK && make all
!cd SCTK && make check
!cd SCTK && make install
!cd SCTK && make doc

Please dowload the files `reference.stm` and `hypothesis.ctm` from Brightspace and upload them to your Google Colab environment.

In [None]:
# Make sure the name (path) of the files is the same!!

! /content/SCTK/bin/sclite -s -i rm -r /content/reference -h /content/hypothesis.ctm -o all

sclite: 2.10 TK Version 1.3
Begin alignment of Ref File: '/content/reference.stm' and Hyp File: '/content/hypothesis.ctm'
    Alignment# 1 for speaker POD1000000004              Alignment# 2 for speaker POD1000000004              Alignment# 3 for speaker POD1000000004              Alignment# 4 for speaker POD1000000004              Alignment# 5 for speaker POD1000000004              Alignment# 6 for speaker POD1000000004              Alignment# 7 for speaker POD1000000004              Alignment# 8 for speaker POD1000000004              Alignment# 9 for speaker POD1000000004              Alignment# 10 for speaker POD1000000004              Alignment# 11 for speaker POD1000000004              Alignment# 12 for speaker POD1000000004              Alignment# 13 for speaker POD1000000004              Alignment# 14 for speaker POD1000000004              Alignment# 15 for speaker POD1000000004              Alignment# 16 for speaker POD1000000004              Alignment# 17 for 

After running SCLITE, check:

`.sys` file: For overall WER and SER.

`.pra` file: For detailed alignment showing substitutions (S), deletions (D), and insertions (I).

# Task 2

**Task 2**: Now you already know SCLITE and you already have reference and hypothesis trasncripts of speechocean762 dataset. Your task consists on using the SCLITE tool to calculate insertions, deletions and substitutions.

In [None]:
# Your code here

# Task 3
**Task 3 (optional)**:Also, other similar Python-based tools are available on Github. Please find such repositories and analyse the transcripts.

In [None]:
# Your code here