# Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

<img src="figures/whisper-arch.png" title="Whisper Framework" style="width: 640px;" />

In [1]:
# pip install evaluate
# pip install torchaudio
# pip install transformers
# pip install numpy
# pip install tqdm

In [40]:
# !pip install -U openai-whisper
# !pip install torchaudio
# !pip install jiwer

In [2]:
import os
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

# Open-AI Library

## Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [41]:
import os
import time
import numpy as np
from tqdm.auto import tqdm

import torch
import pandas as pd
import whisper
import torchaudio

device = "cuda" if torch.cuda.is_available() else "cpu"

In [42]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=device):
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        mel = whisper.log_mel_spectrogram(audio)
        
        return (mel, text)

In [43]:
dataset = LibriSpeech("test-clean")
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

In [44]:
import whisper

model = whisper.load_model("tiny")
result = model.transcribe("audio.mp3")
print(result["text"])

  checkpoint = torch.load(fp, map_location=device)


 Help me in here! Help me in here!


## Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [45]:
model = whisper.load_model("tiny")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is multilingual and has 37,184,640 parameters.


  checkpoint = torch.load(fp, map_location=device)


In [46]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [47]:
hypotheses = []
references = []

for mels, texts in tqdm(loader):
    results = model.decode(mels, options)
    hypotheses.extend([result.text for result in results])
    references.extend(texts)
    break

  0%|          | 0/164 [00:00<?, ?it/s]

In [48]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffed into you, his belly, couchled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,"After early nightfall, the yellow lamps would ...",AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hey Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10 Fresh Nelly is waiting on you. Good ...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
5,The music came nearer and he recalled the word...,THE MUSIC CAME NEARER AND HE RECALLED THE WORD...
6,The dull light fell more faintly upon the page...,THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE...
7,A cold lucid indifference rained in his soul.,A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL
8,The chaos in which his order extinguished itse...,THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITS...
9,"At most, by an arms given to a beggar whose bl...",AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLE...


## Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

In [49]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

In [50]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffed into you, his belly, couchled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffed into you his belly couchled him,stuff it into you his belly counseled him
2,"After early nightfall, the yellow lamps would ...",AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hey Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hey bertie any good in your mind,hello bertie any good in your mind
4,Number 10 Fresh Nelly is waiting on you. Good ...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
5,The music came nearer and he recalled the word...,THE MUSIC CAME NEARER AND HE RECALLED THE WORD...,the music came nearer and he recalled the word...,the music came nearer and he recalled the word...
6,The dull light fell more faintly upon the page...,THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE...,the dull light fell more faintly upon the page...,the dull light fell more faintly upon the page...
7,A cold lucid indifference rained in his soul.,A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL,a cold lucid indifference rained in his soul,a cold lucid indifference reigned in his soul
8,The chaos in which his order extinguished itse...,THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITS...,the chaos in which his order extinguished itse...,the chaos in which his ardor extinguished itse...
9,"At most, by an arms given to a beggar whose bl...",AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLE...,at most by an arms given to a beggar whose ble...,at most by an alms given to a beggar whose ble...


In [51]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")

WER: 7.41 %
