# SpeechBrain + HuggingFace for Speech Recognition tasks

compiled by: [Vaibhav Srivastav](https://twitter.com/reach_vb)

for pre-reads + further materials headover to: [ml-with-audio repo](https://github.com/Vaibhavs10/ml-with-audio)

some important Speech Recognition tasks:
- **Speech Recognition**: Speech-to-text ([see this tutorial](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing))
- **Speaker Recognition**: Speaker verification/ID ([see this tutorial](https://colab.research.google.com/drive/1UwisnAjr8nQF3UnrkIJ4abBMAWzVwBMh?usp=sharing)).
- **Speaker Diarization**: Detect who spoke when.
- **Speech Enhancement**: Noisy to clean speech ([see this tutorial](https://colab.research.google.com/drive/18RyiuKupAhwWX7fh3LCatwQGU5eIS3TR?usp=sharing)).
- **Speech Separation**: Separate overlapped speech ([see this tutorial](https://colab.research.google.com/drive/1YxsMW1KNqP1YihNUcfrjy0zUp7FhNNhN?usp=sharing)). 
- **Spoken Language Understanding**: Speech to intent/slots. 
- **Multi-microphone processing**: Combining input signals ([see this tutorial](https://colab.research.google.com/drive/1UVoYDUiIrwMpBTghQPbA6rC1mc9IBzi6?usp=sharing)).

In [None]:
%%capture
!pip install speechbrain
!pip install transformers

In [None]:
import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio

## Let's use a pre-trained model from the HF hub and transcribe some text

In [None]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

In [None]:
signal = read_audio("example.wav").squeeze()
Audio(signal, rate=16000)

## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and transcribe the wav file

Try both the types of pretrained ASR models:

1. EncoderDecoderASR
2. EncoderASR

In [None]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="<Pretrained model goes here>", savedir="pretrained_models/<Pretrained model name>")
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

### Let's take it up a notch: What if we are provided with a sound file with multiple speakers, how do we seperate their individual sounds?

In [None]:
from speechbrain.pretrained import SepformerSeparation as separator

model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') 

In [None]:
signal = read_audio("test_mixture.wav").squeeze()
Audio(signal, rate=8000)

In [None]:
Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)

In [None]:
Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)

## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and separate the sounds

Look for Sepformer :)

In [None]:
from speechbrain.pretrained import SepformerSeparation as separator

model = separator.from_hparams(source="<Pretrained model goes here>", savedir='pretrained_models/<Pretrained model name>')
est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') 

## Alright, so far so good, let's now try to see if we can verify if two audio files are from the same speaker

In [None]:
from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = verification.verify_files("speechbrain/spkrec-ecapa-voxceleb/example1.wav", "speechbrain/spkrec-ecapa-voxceleb/example2.flac")

print(prediction, score)

In [None]:
signal = read_audio("example1.wav").squeeze()
Audio(signal, rate=16000)

In [None]:
signal = read_audio("example2.flac").squeeze()
Audio(signal, rate=16000)

Want to have more fun with pre-trained models and out of the box tasks, head over to the [SpeechBrain documentation](https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html)

Some suggestions:

- [Speech Enhancement](https://huggingface.co/speechbrain/metricgan-plus-voicebank)
- [Command Recognition](https://huggingface.co/speechbrain/google_speech_command_xvector)
- [Spoken Language Understanding](https://huggingface.co/speechbrain/slu-timers-and-such-direct-librispeech-asr)
- [Urban Sound Classification](https://huggingface.co/speechbrain/urbansound8k_ecapa)

Send us your experiments on twitter or discord ;)

## Let's train a ASR model on some sample files!

In [None]:
%%capture
!git clone https://github.com/speechbrain/speechbrain.git

In [None]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!python example_asr_ctc_experiment.py hyperparams.yaml 

In [None]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!cat example_asr_ctc_experiment.py

In [None]:
%cd speechbrain/tests/integration/neural_networks/ASR_CTC/
!cat hyperparams.yaml

## Your turn, Take the sample data and train a Seq2Seq model next.

Hint: Look at the [integrations folder](https://github.com/speechbrain/speechbrain/tree/develop/tests/integration/neural_networks) ;)