# Installing Whisper

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

In [1]:
# ! pip install pexpect
! pip install git+https://github.com/openai/whisper.git # Dawloading whisper
# ! pip install openai-whisper
! pip install jiwer # libary for calculating the word error rate (WER), character error rate (CER), and other metrics commonly used in the evaluation of automatic speech recognition (ASR) and speech-to-text systems.

Defaulting to user installation because normal site-packages is not writeable
[0mCollecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-98k5qkn1
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-98k5qkn1
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.met

# Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt # satt in for å plott log mel specrum

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

2024-05-08 20:53:09.015017: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=DEVICE):
      # check if the audio files is there, and if not dowlade them
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self): # Check the lenght
        return len(self.dataset)

    def __getitem__(self, item):
      # brings out induvidual samples from the data set
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        # % First: flatten the audio. This meand converting a 2D array ([channels,samples])
        # into a 1D ([samples]). This ensures  compatibility with downstream processing functions.
        # % Second  pad or trim: Makes so every sample is the same lengt, eiter by adding zeros at
        # the end (padding), or cutting it to right length (trimming).
        # % self.device only specefy where this task is supposed to be done.
        mel = whisper.log_mel_spectrogram(audio) # extract the log mel spectrum for the sample

        return (mel, text)

In [4]:
dataset = LibriSpeech("test-clean") # gir ut mel spektrum og teksten
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

In [5]:
print('data set type: ', type(dataset))
print('loader type:', type(loader))

data set type:  <class '__main__.LibriSpeech'>
loader type: <class 'torch.utils.data.dataloader.DataLoader'>


In [6]:
# Inspect dataset object
print("Attributes and methods of dataset object:")
print(dir(dataset))

# Inspect loader object
print("\nAttributes and methods of loader object:")
print(dir(loader))


Attributes and methods of dataset object:
['__add__', '__annotations__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_is_protocol', 'dataset', 'device']

Attributes and methods of loader object:
['_DataLoader__initialized', '_DataLoader__multiprocessing_context', '_IterableDataset_len_called', '__annotations__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '

# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [7]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is English-only and has 71,825,408 parameters.


In [8]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [9]:
hypotheses = [] # Predictions
references = [] # GroundThruth
test_out_mels = [] # TESTS Mel filter bank
corr_tex = [] # TESTS corresponding text

i = 0 # TESTS

for mels, texts in tqdm(loader): # Tqdm is a popular Python library that provides a simple and convenient way to add progress bars to loops and iterable objects.
    results = model.decode(mels, options) # bruker log mels spectogrammene og decoder ved bruk av språket som er valgt. her er det "en" = engelsk.
    # it try to decode what has been said in the sound file
    # Tokenizer gjør tekts om til en vektor - som da kan brukes videre i modellen

    hypotheses.extend([result.text for result in results]) # from the results list every text attribute is exstracted and added in the hypotheses
    references.extend(texts) # the actuall text for the sound file

    # TESTS
    # if i == 0:
    #   print(mels.shape)
    #   print(mels.dtype)
    #   print(mels)

    #   # print('mels', '\n', mels, '\n', len(mels), '\n') # a tensor filled with numbers
    #   print('texts', '\n', texts, '\n', len(texts), '\n') # long text snippets
    #   # len mels and len texts is the same len = 16

    if i % 12 == 0:
      test_out_mels.append(mels)
      corr_tex.append(texts)

    i += 1

  0%|          | 0/164 [00:00<?, ?it/s]

In [10]:
# print(test_out_mels.shape())
# corr_tex


print(test_out_mels[0][0].shape)

# for i in range(len(test_out_mels)):
#   mel = test_out_mels[i]

#   plt.figure(figsize=(10, 4))
#   plt.imshow(mel[0].detach().cpu().numpy(), cmap='viridis', origin='lower', aspect='auto')#, extent=[0, 500, 0, mels.shape[1]] )
#   plt.title(f"Log Mel Spectrogram for Iteration {i}")
#   plt.xlabel("Time")
#   plt.ylabel("Mel Filter")
#   plt.colorbar(label="Amplitude (dB)")
#   # Set x-axis limit
#   plt.xlim(0, 500)

#   plt.show()


torch.Size([80, 3000])


`extend()`: This method adds elements from an iterable (such as a list, tuple, or another iterable object) to the end of the list. It effectively appends each item from the iterable to the original list. (`append()` only add one element at the time)


```
my_list = [1, 2, 3]
my_list.extend([4, 5, 6])
print(my_list)  # Output: [1, 2, 3, 4, 5, 6]
```

`mels` tensor -> presumably a batch of mel spectrograms



In [11]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references)) # makes a dictionary where yiu get the predicted tekst and the actual tekst for the sound file
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...


# Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

`Jiwer` is a Python library for computing the Word Error Rate (WER), Character Error Rate (CER), and other metrics commonly used in evaluating the performance of Automatic Speech Recognition (ASR) or Optical Character Recognition (OCR) systems.


`Normalizing` make the scale of teh dala fi between 0 and 1. It makes sure no feaure end up dominating the leening prosses. It therefor improves the accuracy of the results and makes the model more robust. To encure that the reference and hypoteses is compareble it is importan to normalize them both.

I the mnormalize prosses her it iterates throug each column and exstract the text.

In [12]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

In [13]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]] # Normalize the reults and the referance
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffered into you his belly counseled him,stuff it into you his belly counseled him
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hello bertie any good in your mind,hello bertie any good in your mind
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
...,...,...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...,0 to shoot my soul is full meaning into future...,0 to shoot my soul is full meaning into future...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...,then i long tried by natural ills received the...,then i long tried by natural ills received the...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...,i love thee freely as men strive for right i l...,i love thee freely as men strive for right i l...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...,i love thee with the passion put to use in my ...,i love thee with the passion put to use in my ...


In [14]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"])) # makes a list and uses both attributes to calculate the word error rate

print(f"WER: {wer * 100:.2f} %")

WER: 4.27 %


For det next step using allignment the following letters will appair:

Exsample
```
REF: **** short one here
HYP: shoe order one ****
        I     S        D

```

I indicates an insertion ('s' in "shoe" is inserted)

S indicates a substitution ('o' is substituted for 'h' in "short")

D indicates a deletion ('r' in "order" is deleted)


In [15]:
# Get the allignment beteen words, and then visualize them
wer_all = jiwer.process_words(list(data["reference_clean"]), list(data["hypothesis_clean"]))
weer_all_fin =wer_all.wer

print(f"Allignment WER: {weer_all_fin * 100:.2f} %")

Allignment WER: 4.27 %


In [16]:
print(jiwer.visualize_alignment(wer_all))

sentence 1
REF: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered  flour fattened sauce
HYP: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flower    faten sauce
                                                                                                                                                    S        S      

sentence 2
REF:     stuff it into you his belly counseled him
HYP: stuffered ** into you his belly counseled him
             S  D                                 

sentence 7
REF: the dull light fell more faintly upon the page ***** whereon another equation began to unfold itself slowly and to spread abroad its widening tail
HYP: the dull light fell more faintly upon the page where      on another equation began to unfold itself slowly and to spread abroad its widening tail
             

In [17]:
# Chracter error rate
# See https://jitsi.github.io/jiwer/usage/
# With jiwer.process_words and jiwer.process_characters, you get the alignment between the reference and hypothesis.

error = jiwer.cer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

# if you also want the alignment
output = jiwer.process_characters(list(data["reference_clean"]), list(data["hypothesis_clean"]))
error = output.cer

In [18]:
print(jiwer.visualize_alignment(output))

sentence 1
REF: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flo*ur fattened sauce
HYP: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flower fat*en** sauce
                                                                                                                                                  IS     D  DD      

sentence 2
REF: stuff* it into you his belly counseled him
HYP: stuffered into you his belly counseled him
          ISSS                                 

sentence 7
REF: the dull light fell more faintly upon the page where*on another equation began to unfold itself slowly and to spread abroad its widening tail
HYP: the dull light fell more faintly upon the page where on another equation began to unfold itself slowly and to spread abroad its widening tail
                                