<a href="https://colab.research.google.com/github/enmwmak/Teaching/blob/main/EIE558/lab/2025/SB_speech_enhancement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech Enhancement Based on SpeechBrain

## 1. Objectives
This lab exercise enables you to learn how to use the pre-trained speech models from SpeechBrain to reduce the noise in speech files. You can compare the speech quality of the denoised (enhanced) speech subjectively (through listening tests) and objectively (through Short-Time Objective Intelligibility (STOI)).  

## 2. Prerequisites
Before starting this lab, you should learn the <a href="https://speechbrain.readthedocs.io/en/v1.0.3/tutorials/basics.html">basics</a> of SpeechBrain and read the <a href="https://speechbrain.readthedocs.io/en/v1.0.3/tutorials/tasks/speech-enhancement-from-scratch.html">tutorial</a> on speech enhancement. After knowing what SpeechBrain is about and how it performs speech enhancement, you need to read the procedure for using its <a href="https://speechbrain.readthedocs.io/en/v1.0.3/API/speechbrain.inference.enhancement.html">pre-trained models</a> for speech enhancement. You may also want to read the paper on <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5713237">STOI</a>.

##3. Submission
Write a report, convert it to PDF, and submit it to Blackboard before the deadline specified in Blackboard. Your report may contain the following:
<ol type="a">
  <li>Discussions on your observations, e.g., what kind of noise is difficult for the pre-trained model</li>
  <li>Waveforms and spectrograms of noisy speech corrupted by different noise types</li>
  <li>Waveforms and spectrograms of denoised speech enhanced by different pre-trained models</li>
  <li>Comparison between spectral subtraction and deep learning based speech enhancement
  <li>The STOI scores of the noisy and enhanced speech</li>

</ol>

##4. Prepare Colab Environment
Colab runs on browsers. You need a Google account to use Colab. If you do not have one, visit https://support.google.com/mail/answer/56256?hl=en.

Display the Google Drive page (https://drive.google.com/drive/my-drive) in your browser. Use the “+ New” button on the left panel to create a directory structure in your Google Drive as follows: "My Drive/Learning/EIE558/Lab1"

##5. Procedure
1. Upload this .ipynb file to your Google Drive under the "My Drive/Learning/EIE558/Lab1" folder.
2. Create a folder "My Drive/Learning/EIE558/Lab1/audio".
3. Download the .wav files in https://github.com/enmwmak/Teaching/tree/main/EIE558/lab/2025 and upload them to your "My Drive/Learning/EIE558/Lab1/audio" folder in Google Drive.
3. Execute the following cells.

In [None]:
# Check Python version (This script works on Python 3.12, Torch 2.8, and Torchaudio 2.8)
!python --version
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)

In [None]:
# Mount Google Drive and change to the Lab1 folder.
# You should perform this step after the expiration of each session.
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
%cd /content/drive/MyDrive/Learning/EIE558/Lab1

In [None]:
# Installing SpeechBrain and STOI via pip.
# You should perform this step after the expiration of each Colab session.
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH
!pip install pystoi

In [None]:
# Clone SpeechBrain repository to your Lab1 folder.
# Skip this step if you have done this before as SpeechBrain has been installed on your Lab1 folder
%cd /content/drive/MyDrive/Learning/EIE558/Lab1
!git clone https://github.com/speechbrain/speechbrain/

In [None]:
# Define a function that plots the waveforms and spectrograms of noisy and enhanced speech
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display

def show_wav_and_spec(wavfile):
  waveform, srate = librosa.load(wavfile, sr=librosa.get_samplerate(wavfile))
  n_samples = len(waveform)
  frm_size = int(0.032 * srate)       # 32ms per frame
  frm_shift = int(0.01 * srate)       # 100Hz frame rate
  n_fft = n_samples if (n_samples < frm_size) else frm_size

  # Compute magnitude spectrogram
  magspec = abs(librosa.stft(y=waveform, n_fft=frm_size, hop_length=frm_shift))

  # Plot spectrogram and waveform
  plt.figure(figsize=(10, 4))
  plt.subplot(211)
  librosa.display.specshow(librosa.amplitude_to_db(magspec), sr=srate, y_axis='linear',
                           hop_length=frm_shift)
  plt.subplot(313)
  librosa.display.waveshow(waveform, sr=srate, offset=0)
  plt.margins(x=0)
  plt.show()

  display(Audio(wavfile, autoplay=True))

In [None]:
# Download and use a spectral-masking based pre-trained model to enhance speech
%cd /content/drive/MyDrive/Learning/EIE558/Lab1/speechbrain/templates/enhancement
import torch
from speechbrain.inference.enhancement import SpectralMaskEnhancement

# Model is downloaded from the speechbrain HuggingFace repo
enhancer = SpectralMaskEnhancement.from_hparams(source="speechbrain/metricgan-plus-voicebank",
                                                savedir="tmpdir1",)
enhanced = enhancer.enhance_file("../../../audio/noisyspeech16k.wav", output_filename="../../../audio/enhanced1.wav")

In [None]:
# Display the spectrogram and waveform of the noisy speech file and play the audio file
show_wav_and_spec("../../../audio/noisyspeech16k.wav")

In [None]:
# Display the spectrogram and waveform of the enhanced speech file and play the audio file
show_wav_and_spec("../../../audio/enhanced1.wav")

In [None]:
# Download and use another spectral-based pre-trained SE model to enhance speech
%cd /content/drive/MyDrive/Learning/EIE558/Lab1/speechbrain/templates/enhancement

from speechbrain.inference.enhancement import WaveformEnhancement
# Model is downloaded from the speechbrain HuggingFace repo
enhancer = WaveformEnhancement.from_hparams(source="speechbrain/mtl-mimic-voicebank",
                                            savedir="tmpdir2",)
enhanced = enhancer.enhance_file("../../../audio/noisyspeech16k.wav", output_filename="../../../audio/enhanced2.wav")

In [None]:
# Display the spectrogram and waveform of the enhanced speech file and play the audio file
show_wav_and_spec("../../../audio/enhanced2.wav")

In [None]:
# Perform objective evaluation based on STOI
from speechbrain.dataio.dataio import read_audio
from pystoi.stoi import stoi
import numpy as np
clean_wav = read_audio("../../../audio/cleanspeech16k.wav")
noisy_wav = read_audio("../../../audio/noisyspeech16k.wav")
enhan_wav = read_audio("../../../audio/enhanced2.wav")
num_smps = np.min([clean_wav.shape[0], noisy_wav.shape[0], enhan_wav.shape[0]])
clean_wav = clean_wav[:num_smps]
noisy_wav = noisy_wav[:num_smps]
enhan_wav = enhan_wav[:num_smps]
stoi_clean = stoi(clean_wav.numpy(), clean_wav.numpy(), 16000)
stoi_noisy = stoi(clean_wav.numpy(), noisy_wav.numpy(), 16000)
stoi_enhan = stoi(clean_wav.numpy(), enhan_wav.numpy(), 16000)
print(f"The STOI score of clean speech: {stoi_clean:.3f}")
print(f"The STOI score of noisy speech: {stoi_noisy:.3f}")
print(f"The STOI score of enhan speech: {stoi_enhan:.3f}")

In [None]:
# Adding different types of noise to clean speech and save the noisy speech
# as "Lab1/audio/speech+noise.wav"
%cd /content/drive/MyDrive/Learning/EIE558/Lab1/audio
import torch
import torchaudio
import torchaudio.functional as F
snr_db = 0.0  # SNR in dB
speech, _ = torchaudio.load("./cleanspeech16k.wav")
noise, _ = torchaudio.load("./machinegun.wav")
noise = noise[:, : speech.shape[1]]
noisy_speech = F.add_noise(speech, noise, snr=torch.tensor([snr_db]))
torchaudio.save("./speech+noise.wav", noisy_speech, sample_rate=16000, format="wav")

In [None]:
# Play the created noisy speech file
# Display the spectrogram and waveform of the noisy speech file and play the audio file
show_wav_and_spec("./speech+noise.wav")

In [None]:
# Download and use an end-to-end pre-trained model to enhance speech
%cd /content/drive/MyDrive/Learning/EIE558/Lab1/speechbrain/templates/enhancement

from speechbrain.inference.enhancement import WaveformEnhancement
# Model is downloaded from the speechbrain HuggingFace repo
enhancer = WaveformEnhancement.from_hparams(source="speechbrain/mtl-mimic-voicebank",
                                            savedir="tmpdir2",)
enhanced = enhancer.enhance_file("../../../audio/speech+noise.wav", output_filename="../../../audio/enhanced2.wav")

In [None]:
# Display the spectrogram and waveform of the enhanced speech file and play the audio file
show_wav_and_spec("../../../audio/enhanced2.wav")

### Code for Spectral Subtraction

In [None]:
def specsub(y, frame_size, frame_shift, sr=16000):
    Y = librosa.stft(y, n_fft=frame_size, hop_length=frame_shift)     # Short-time Fourier transform
    Ymag= np.abs(Y)         # Get magnitude (spectrogram)
    Ypha= np.angle(Y)       # Get phase

    # Assume that the beginning of the speech file contains noise only and get its average noise spectrum
    noise_dur = 0.5             # Duration at the beginning of the file considered as noise
    alpha = 2                   # Over-subtraction factor
    noise_mag = Ymag[:, 0:int(noise_dur*sr/frame_shift)]
    mean_noise_mag = np.mean(noise_mag, axis=1)
    Xmag = Ymag - alpha * mean_noise_mag.reshape((mean_noise_mag.shape[0],1))

    # Implement |Y(w) - B(w)| so that all negative values are set to 0
    mask = (Xmag > 0).astype(int)
    Xmag = Xmag * mask

    # Convert to complex number using the phase information of noisy speech. Then, convert to time domain using ISTFT
    Y = Xmag * np.exp(1.0j* Ypha)
    y = librosa.istft(Y, n_fft=frame_size, hop_length=frame_shift)
    return y, Y, mask

In [None]:
# Load the noisy speech and perform spectral subtraction
%cd /content/drive/MyDrive/Learning/EIE558/Lab1
frame_size = 512
frame_shift = 64
y, sr = librosa.load("audio/noisyspeech16k.wav", sr=None, mono=True) # keep native sr (sampling rate) and trans into mono
x, X, mask = specsub(y, frame_size=frame_size, frame_shift=frame_shift, sr=sr)
print(X.shape)

In [None]:
# Function for plotting speech signal and its spectrogram
def plot_speech(x, sr=8000, frm_len=512, hop_len=256):
    X = librosa.amplitude_to_db(np.abs(librosa.stft(x, n_fft=frm_len, hop_length=hop_len)), ref=np.max)  # STFT of x
    _, ax = plt.subplots(nrows=2, sharex=True, figsize=(8,4))
    librosa.display.waveshow(x, sr=sr, ax=ax[0])
    librosa.display.specshow(X, sr=sr, n_fft=frm_len, hop_length=hop_len, x_axis='time', y_axis='linear', ax=ax[1])
    display(Audio(x, rate=sr, autoplay=True))

In [None]:
# Plot noisy speech
plot_speech(y, sr=16000, frm_len=frame_size, hop_len=frame_shift)

In [None]:
# Plot denoised speech
plot_speech(x, sr=16000, frm_len=frame_size, hop_len=frame_shift)

## References
1. Fu, Szu-Wei, et al. "Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement." International Conference on Machine Learning. PmLR, 2019.
2. Fu, Szu-Wei, et al. "Metricgan+: An improved version of metricgan for speech enhancement." arXiv preprint arXiv:2104.03538 (2021).
3. Bagchi, Deblin, et al. "Spectral feature mapping with mimic loss for robust speech recognition." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
4. Taal, Cees H., et al. "An algorithm for intelligibility prediction of time–frequency weighted noisy speech." IEEE Transactions on audio, speech, and language processing 19.7 (2011): 2125-2136.
5. Rix, Antony W., et al. "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs." 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). Vol. 2. IEEE, 2001.