# SpeechBrain + HuggingFace for Speech Recognition tasks

compiled by: [Vaibhav Srivastav](https://twitter.com/reach_vb)

for pre-reads + further materials headover to: [ml-with-audio repo](https://github.com/Vaibhavs10/ml-with-audio)

Some important Speech Recognition tasks:
- **Speech Recognition**: Speech-to-text ([see this tutorial](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing))
- **Speaker Recognition**: Speaker verification/ID ([see this tutorial](https://colab.research.google.com/drive/1UwisnAjr8nQF3UnrkIJ4abBMAWzVwBMh?usp=sharing)).
- **Speaker Diarization**: Detect who spoke when.
- **Speech Enhancement**: Noisy to clean speech ([see this tutorial](https://colab.research.google.com/drive/18RyiuKupAhwWX7fh3LCatwQGU5eIS3TR?usp=sharing)).
- **Speech Separation**: Separate overlapped speech ([see this tutorial](https://colab.research.google.com/drive/1YxsMW1KNqP1YihNUcfrjy0zUp7FhNNhN?usp=sharing)). 
- **Spoken Language Understanding**: Speech to intent/slots. 
- **Multi-microphone processing**: Combining input signals ([see this tutorial](https://colab.research.google.com/drive/1UVoYDUiIrwMpBTghQPbA6rC1mc9IBzi6?usp=sharing)).

In [1]:
%%capture
!pip install speechbrain
!pip install transformers

In [2]:
import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio

## Let's use a pre-trained model from the HF hub and transcribe some text

In [3]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

Downloading:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/212M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/253k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/104k [00:00<?, ?B/s]

'THE BIRCH CANOE SLID ON THE SMOOTH PLANKS'

In [4]:
signal = read_audio("example.wav").squeeze()
Audio(signal, rate=16000)

## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and transcribe the wav file

Try both the types of pretrained ASR models:

1. EncoderDecoderASR
2. EncoderASR

In [9]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="<Pretrained model goes here>", savedir="pretrained_models/<Pretrained model name>")
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

'THE BIRCH CANOE SLID ON THE SMOOTH PLANKS'

### Let's take it up a notch: What if we are provided with a sound file with multiple speakers, how do we seperate their individual sounds?

In [5]:
from speechbrain.pretrained import SepformerSeparation as separator

model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') 

Downloading:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/113M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/66.2k [00:00<?, ?B/s]

In [6]:
signal = read_audio("test_mixture.wav").squeeze()
Audio(signal, rate=8000)

In [7]:
Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)

In [8]:
Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)

## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and separate the sounds

Look for Sepformer :)

In [None]:
from speechbrain.pretrained import SepformerSeparation as separator

model = separator.from_hparams(source="<Pretrained model goes here>", savedir='pretrained_models/<Pretrained model name>')
est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') 

## Alright, so far so good, let's now try to see if we can verify if two audio files are from the same speaker

In [10]:
from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = verification.verify_files("speechbrain/spkrec-ecapa-voxceleb/example1.wav", "speechbrain/spkrec-ecapa-voxceleb/example2.flac")

print(prediction, score)

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/129k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/104k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

tensor([False]) tensor([0.1635])


In [11]:
signal = read_audio("example1.wav").squeeze()
Audio(signal, rate=16000)

In [12]:
signal = read_audio("example2.flac").squeeze()
Audio(signal, rate=16000)

Want to have more fun with pre-trained models and out of the box tasks, head over to the [SpeechBrain documentation](https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html)

Some suggestions:

- [Speech Enhancement](https://huggingface.co/speechbrain/metricgan-plus-voicebank)
- [Command Recognition](https://huggingface.co/speechbrain/google_speech_command_xvector)
- [Spoken Language Understanding](https://huggingface.co/speechbrain/slu-timers-and-such-direct-librispeech-asr)
- [Urban Sound Classification](https://huggingface.co/speechbrain/urbansound8k_ecapa)

Send us your experiments on twitter or discord ;)

## Let's train a ASR model on some sample files!

In [13]:
%%capture
!git clone https://github.com/speechbrain/speechbrain.git

In [14]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!python example_asr_ctc_experiment.py hyperparams.yaml 

/content/speechbrain/tests/integration/neural_networks/ASR_CTC
100% 8/8 [00:05<00:00,  1.44it/s, train_loss=12.2]
100% 2/2 [00:00<00:00,  6.79it/s]
Epoch 0 complete
Train loss: 12.19
Stage.VALID loss: 4.75
Stage.VALID PER: 90.91
100% 8/8 [00:03<00:00,  2.40it/s, train_loss=7.09]
100% 2/2 [00:00<00:00,  6.45it/s]
Epoch 1 complete
Train loss: 7.09
Stage.VALID loss: 4.40
Stage.VALID PER: 94.55
100% 8/8 [00:03<00:00,  2.43it/s, train_loss=4.73]
100% 2/2 [00:00<00:00,  6.40it/s]
Epoch 2 complete
Train loss: 4.73
Stage.VALID loss: 4.20
Stage.VALID PER: 90.91
100% 8/8 [00:03<00:00,  2.40it/s, train_loss=3.68]
100% 2/2 [00:00<00:00,  6.28it/s]
Epoch 3 complete
Train loss: 3.68
Stage.VALID loss: 4.44
Stage.VALID PER: 90.91
100% 8/8 [00:03<00:00,  2.45it/s, train_loss=3.17]
100% 2/2 [00:00<00:00,  6.63it/s]
Epoch 4 complete
Train loss: 3.17
Stage.VALID loss: 4.78
Stage.VALID PER: 90.91
100% 8/8 [00:03<00:00,  2.47it/s, train_loss=2.85]
100% 2/2 [00:00<00:00,  6.10it/s]
Epoch 5 complete
Train los

In [17]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!cat example_asr_ctc_experiment.py

[Errno 2] No such file or directory: 'speechbrain/tests/integration/neural_networks/ASR_CTC/'
/content/speechbrain/tests/integration/neural_networks/ASR_CTC
#!/usr/bin/env/python3
"""This minimal example trains a CTC-based speech recognizer on a tiny dataset.
The encoder is based on a combination of convolutional, recurrent, and
feed-forward networks (CRDNN) that predict phonemes.  A greedy search is used on
top of the output probabilities.
Given the tiny dataset, the expected behavior is to overfit the training dataset
(with a validation performance that stays high).
"""

import pathlib
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml


class CTCBrain(sb.Brain):
    def compute_forward(self, batch, stage):
        "Given an input batch it computes the output probabilities."
        batch = batch.to(self.device)
        wavs, lens = batch.sig
        feats = self.modules.compute_features(wavs)
        feats = self.modules.mean_var_norm(feats, lens)
        x = self.mod

In [18]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!cat hyperparams.yaml

[Errno 2] No such file or directory: 'speechbrain/tests/integration/neural_networks/ASR_CTC/'
/content/speechbrain/tests/integration/neural_networks/ASR_CTC
# Seed needs to be set at top of yaml, before objects with parameters are made
# NOTE: Seed does not guarantee replicability with CTC
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]

# Training params
N_epochs: 15
lr: 0.002
dataloader_options:
    batch_size: 1

# Special tokens and labels
blank_index: 0
num_labels: 44


# Model parameters
activation: !name:torch.nn.LeakyReLU []
dropout: 0.15
cnn_blocks: 1
cnn_channels: (16,)
cnn_kernelsize: (3, 3)
rnn_layers: 1
rnn_neurons: 128
rnn_bidirectional: True
dnn_blocks: 1
dnn_neurons: 128

compute_features: !new:speechbrain.lobes.features.MFCC

mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

model: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, 660]
    activation: !ref <activation>
    dropout: !ref <dr

## Your turn, Take the sample data and train a Seq2Seq model next.

Hint: Look at the [integrations folder](https://github.com/speechbrain/speechbrain/tree/develop/tests/integration/neural_networks) ;)