In [1]:
%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

Once installed, you should be able to import the speechbrain project with python:

In [3]:
import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio

## **Speech Recognition on Different Languages**

### *English*

In [17]:
from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
asr_model.transcribe_file('test (mp3cut.net).wav')

'NOW MY PHONE HAS BEEN USED AS A SPEAKER AWHILE AND YOU CAN SEE IF YOU SAY SOMETHING OR NOT'

In [20]:
signal = read_audio("test (mp3cut.net).wav").squeeze()
Audio(signal, rate=16000)

## **Speech Separation**

We here show a mixture with 2 speakers, but we have a state-of-the-art system for separating mixture with 3 speakers as well. We also have models that deals witj noise and reverberation. [See your HuggingFace repository](https://huggingface.co/speechbrain/)

In [None]:
from speechbrain.inference.separation import SepformerSeparation as separator

model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
est_sources = model.separate_file(path='/content/test_mixture.wav')

hyperparams.yaml:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

  if ismodule(module) and hasattr(module, '__file__'):


KeyboardInterrupt: 

In [None]:
signal = read_audio("/content/test_mixture.wav").squeeze()
Audio(signal, rate=8000)

In [None]:
Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)

In [None]:
Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)

## **Speech Enhancement**
The goal of speech enhancement is to remove the noise that affects a recording.
Speechbrain has several systems for speech enhancement. In the following, you can find an example processed by the SepFormer (the version trained to perform enhancement):

In [21]:
from speechbrain.inference.separation import SepformerSeparation as separator
import torchaudio

model = separator.from_hparams(source="speechbrain/sepformer-whamr-enhancement", savedir='pretrained_models/sepformer-whamr-enhancement4')
enhanced_speech = model.separate_file(path='test (mp3cut.net).wav')


RuntimeError: torchaudio_sox::load_audio_file() Expected a value of type 'str' for argument '_0' but instead found type 'PosixPath'.
Position: 0
Value: PosixPath('audio_cache/test (mp3cut.net).wav')
Declaration: torchaudio_sox::load_audio_file(str _0, int? _1, int? _2, bool? _3, bool? _4, str? _5) -> (Tensor _0, int _1)
Cast error details: Unable to cast Python instance of type <class 'pathlib.PosixPath'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)

In [22]:
signal = read_audio("test (mp3cut.net).wav").squeeze()
Audio(signal, rate=16000)

In [13]:
Audio(enhanced_speech[:, :].detach().cpu().squeeze(), rate=16000)

NameError: name 'enhanced_speech' is not defined

# **Speaker Verification**
The task here is to determine whether two sentences belong to the same speaker or not.

In [None]:
from speechbrain.inference.speaker import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = verification.verify_files("/content/example1.wav", "/content/example2.flac")

print(prediction, score)

hyperparams.yaml:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]

tensor([False]) tensor([0.1799])


In [None]:
signal = read_audio("/content/example1.wav").squeeze()
Audio(signal, rate=16000)

In [None]:
signal = read_audio("/content/example2.flac").squeeze()
Audio(signal, rate=16000)

## **Speech Synthesys (Text-to-Speech)**
The goal of speech synthesys is to create a speech signal from the input text.
If the following you can find an example with the popular [Tacotron2](https://arxiv.org/abs/1712.05884) model coupled with [HiFiGAN](https://arxiv.org/abs/2010.05646) as a vocoder:

In [None]:
import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("This is an open-source toolkit for the development of speech technologies.")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

hyperparams.yaml:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/113M [00:00<?, ?B/s]

hyperparams.yaml:   0%|          | 0.00/1.16k [00:00<?, ?B/s]



generator.ckpt:   0%|          | 0.00/55.8M [00:00<?, ?B/s]

In [None]:
Audio(waveforms.detach().cpu().squeeze(), rate=22050)

## **Other Tasks**
We support several other tasks. Click at the link below to see the pretrained model and the easy-inference function:

- [Speech Enhancement](https://huggingface.co/speechbrain/metricgan-plus-voicebank)
- [Command Recognition](https://huggingface.co/speechbrain/google_speech_command_xvector)
- [Spoken Language Understanding](https://huggingface.co/speechbrain/slu-timers-and-such-direct-librispeech-asr)
- [Urban Sound Classification](https://huggingface.co/speechbrain/urbansound8k_ecapa)

# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/


# **Citing SpeechBrain**
Please, cite SpeechBrain if you use it for your research or business.

```bibtex
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}
```