# Benchmarking of SpeechBrain on Common Voice and LibriSpeech

This notebook was inspired by a repository avalible at https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-en and *Tutorial on Data Loading* for SpeechBrain avalible at https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing#scrollTo=EnirXev8XUyt.


## Settig up the environment

In [None]:
! pip install speechbrain
! pip install datasets
! pip install datasets[audio]
! pip install 'torchaudio<0.12.0'
! pip install evaluate
! pip install rouge_score
! pip install transformers
! pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/cu113/torch_stable.html


## Testing SpeechBrain

In [None]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(
    source="speechbrain/asr-crdnn-rnnlm-librispeech",
    savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")

Transcribe example auido

In [None]:
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

'THE BIRCH CANOE SLID ON THE SMOOTH PLANKS'

Transcibe audio from dataset

In [None]:
import torch
import numpy as np
from datasets import load_dataset
from datasets import Audio

#ds = load_dataset("common_voice", "en", split="train", streaming=True)
ds = load_dataset("bgstud/libri-whisper-raw", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000)) # resample auido



In [None]:
ds

<datasets.iterable_dataset.IterableDataset at 0x7f13a391fe50>

In [None]:
sample = next(iter(ds))
wavs = torch.Tensor(np.array([sample['audio']['array']]))
wavs_len = torch.Tensor([1.])

print(f"reference: {sample['sentence']}")
predictions, _ = asr_model.transcribe_batch(wavs, wavs_len)
print(f"prediction: {predictions[0]}")

reference: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS
prediction: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS


In [None]:
#import matplotlib.pyplot as plt 
#plt.title('audio signal')
#plt.plot(wavs)
#plt.show()

## Benchmark

Set up criteria

In [None]:
import evaluate

bleu = evaluate.load('bleu')
rouge = evaluate.load('rouge')
wer = evaluate.load('wer')

Utility fuctions

In [None]:
import re
import pandas as pd

def prepare_inputs(batch):
  # make list of signals
  wav_list = [torch.Tensor(np.array(s['audio']['array'])) for s in batch]
  # calculate lengths
  abs_wav_lens = torch.Tensor([w.shape[0] for w in wav_list])
  max_len = torch.max(abs_wav_lens)
  wav_lens = torch.Tensor([ len / max_len for len in abs_wav_lens])
  # normalise audio arrays lenghts
  wavs = torch.vstack([torch.Tensor(np.pad(w, (0, int(max_len - w.shape[0])), 'constant')) for w in wav_list])
  return wavs, wav_lens

def normalize_text(strings):
  return [re.sub(r'[^\w\s]','', s.lower()) for s in strings]

Benchmarking

In [None]:
num_samples = 10

# make predictions
batch = list(ds.take(num_samples))
wavs, wav_lens = prepare_inputs(batch)

predictions, _ = asr_model.transcribe_batch(wavs, wav_lens)
#predictions = normalize_text(predictions)

# evalaute
references = [s['sentence'] for s in batch]
#references = normalize_text(references)
bleu_res = bleu.compute(predictions=predictions, references=references)
rouge_res = rouge.compute(predictions=predictions, references=references)
wer_res = wer.compute(predictions=predictions, references=references)

print(f"BLEU: {bleu_res}\nROUGE: {rouge_res}\nWER: {wer_res}")

BLEU: {'bleu': 0.5065967725628453, 'precisions': [0.5242105263157895, 0.5118279569892473, 0.5010989010989011, 0.4898876404494382], 'brevity_penalty': 1.0, 'length_ratio': 1.8924302788844622, 'translation_length': 475, 'reference_length': 251}
ROUGE: {'rouge1': 0.9057172829792816, 'rouge2': 0.9015300546448088, 'rougeL': 0.9057172829792813, 'rougeLsum': 0.9057172829792814}
WER: 0.900398406374502


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-09699bae0071>", line 7, in <module>
    predictions, _ = asr_model.transcribe_batch(wavs, wav_lens)
  File "/usr/local/lib/python3.7/dist-packages/speechbrain/pretrained/interfaces.py", line 598, in transcribe_batch
    predicted_tokens, scores = self.mods.decoder(encoder_out, wav_lens)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/speechbrain/decoders/seq2seq.py", line 641, in forward
    inp_tokens, memory, enc_states, enc_lens
  File "/usr/local/lib/python3.7/dist-packages/speechbrain/decoders/seq2seq.py", line 959, in forward_step
    e, hs, c, enc_states, enc_lens
  File "/usr/local/lib/python3.7/dist-packages/speechbrain/n

KeyboardInterrupt: ignored