# Problem 2: Speech Denoising using Deep Learning:
I am training a fully connected Neural Network of 2 hidden layers of 1024 units each to remove the noise from the audio files. I've used PyTorch for implementing this network 

In [0]:
import librosa
import numpy as np

In [3]:
### Mounting Google Drive to take inputs ###
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import torch
import torch.nn.functional as F 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Questions 2.1 - 2.7
Importing the wav files from Google drive using the given commands and applying Short-Time Fourier Transform to both clean and dirty audio files. This gives out two matrices of size 513 x 2459. Fourier Transform converts the waveform in frequency domain and returns the Amplitude and Phase of the waveform. Phase has the complex component. So, we take the absolute values of these matrices which gives signal strength.

In [0]:
### Reading audio files ###
s, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)

In [31]:
### Checking the dimensions of the input files ###
print (S.shape)
print (X.shape)

(513, 2459)
(513, 2459)


In [0]:
### Taking the absolute values of the input matrices and converting them to tensors ###
abs_S = np.abs(S.T)
tensor_S = torch.from_numpy(abs_S).float()
abs_X = np.abs(X.T)
tensor_X = torch.from_numpy(abs_X).float()

# Questions 2.8 - 2.9
Defining a fully connected Neural Network with 1 Input layer - 2 Hidden layers - 1 Output layer. Results with and without Xavier initialization was a little better so I've used it in my model. I am using ReLU as the activation function for all the layers including the last layer since we want only non-negative values as output.

In [0]:
### Input dimensions D_in = input size, D_out = output size, H is the dimensions of Hidden layers ###
D_in = 513
D_out = 513
H = 1024

In [0]:
### Creating a Network of Input layer - 2 Hidden layers - Output layer ###
class Network(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    super(Network, self).__init__()
    self.linear1 = torch.nn.Linear(D_in,H)
    torch.nn.init.xavier_uniform_(self.linear1.weight)    
    self.linear2 = torch.nn.Linear(H,H)
    torch.nn.init.xavier_uniform_(self.linear2.weight)
    self.linear3 = torch.nn.Linear(H,H)
    torch.nn.init.xavier_uniform_(self.linear3.weight)
    self.linear4 = torch.nn.Linear(H,D_out)
    torch.nn.init.xavier_uniform_(self.linear4.weight)    

  def forward(self, x):
    output1 = F.relu(self.linear1(x))
    output2 = F.relu(self.linear2(output1))
    output3 = F.relu(self.linear3(output2))
    output = F.relu(self.linear4(output3))
    return output

Since the network is small, I am training the model for 1000 epochs with Learning rate as 0.0001. Since this is similar to a regression problem, I am using Mean Squared Error as my loss function along with Adam optimizer.

In [35]:
### Setting Parameters ###
epochs = 1000
learning_rate = 0.0001
model = Network(D_in, H, D_out).to(device)
lossFunction = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print (model)

Network(
  (linear1): Linear(in_features=513, out_features=1024, bias=True)
  (linear2): Linear(in_features=1024, out_features=1024, bias=True)
  (linear3): Linear(in_features=1024, out_features=1024, bias=True)
  (linear4): Linear(in_features=1024, out_features=513, bias=True)
)


In [36]:
### Training the model ###
for epoch in range(0,epochs):
  x = tensor_X.to(device)
  y = tensor_S.to(device)
  y_pred = model(x)
  loss = lossFunction(y_pred,y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  if epoch % 10 == 0:
    print('Epoch: ',epoch,'Loss', loss.item())

Epoch:  0 Loss 0.09541206061840057
Epoch:  10 Loss 0.0667504295706749
Epoch:  20 Loss 0.05063803121447563
Epoch:  30 Loss 0.03687821701169014
Epoch:  40 Loss 0.02631743997335434
Epoch:  50 Loss 0.01930518075823784
Epoch:  60 Loss 0.014744609594345093
Epoch:  70 Loss 0.011920355260372162
Epoch:  80 Loss 0.010011102072894573
Epoch:  90 Loss 0.00869621429592371
Epoch:  100 Loss 0.007746441755443811
Epoch:  110 Loss 0.0070236120373010635
Epoch:  120 Loss 0.006456894334405661
Epoch:  130 Loss 0.005994929000735283
Epoch:  140 Loss 0.005609826184809208
Epoch:  150 Loss 0.005279646720737219
Epoch:  160 Loss 0.004996515344828367
Epoch:  170 Loss 0.00475503271445632
Epoch:  180 Loss 0.004543407820165157
Epoch:  190 Loss 0.004359397105872631
Epoch:  200 Loss 0.004193364642560482
Epoch:  210 Loss 0.004041930660605431
Epoch:  220 Loss 0.003902291879057884
Epoch:  230 Loss 0.0037823847960680723
Epoch:  240 Loss 0.0036744815297424793
Epoch:  250 Loss 0.0035720346495509148
Epoch:  260 Loss 0.003475579

# Question 2.10
Importing the test audio files with noise of eating chips. Applying STFT to it and taking the absolute values to avoid complex numbers.

In [0]:
### Importing the test audio files for testing ###
test1, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_01.wav', sr=None)
test_1=librosa.stft(test1, n_fft=1024, hop_length=512)
test2, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_02.wav', sr=None)
test_2=librosa.stft(test2, n_fft=1024, hop_length=512)

In [0]:
### Taking absolute values of the Test files and converting them to tensors ###
test_1_abs = np.abs(test_1.T)
tensor_test_1 = torch.from_numpy(test_1_abs).float()
test_2_abs = np.abs(test_2.T)
tensor_test_2 = torch.from_numpy(test_2_abs).float()

Passing the dirty audio through the model by converting the numpy matrices to Torch tensors and then taking Hadamard product to add the phase information

In [0]:
### Converting the outputs too numpy array for further calculation ###
test_1_pred = model(tensor_test_1.to(device))
test_1_pred = test_1_pred.detach().cpu().numpy()
test_2_pred = model(tensor_test_2.to(device))
test_2_pred = test_2_pred.detach().cpu().numpy()

In [0]:
### Hadamard Product ###
test_1_spect = np.multiply((test_1/test_1_abs.T),test_1_pred.T)
test_2_spect = np.multiply((test_2/test_2_abs.T), test_2_pred.T)

# Questions 2.11 - 2.12
The output of Hadamard product is used to reconstruct the clean audio files for the dirty test inputs. These wav files are exported to Google drive.

In [0]:
import IPython.display as ipd

In [0]:
### Inverse Fourier Transform ###
test_1_inv_stft = librosa.istft(test_1_spect, win_length= 1024, hop_length=512)
test_2_inv_stft = librosa.istft(test_2_spect, win_length= 1024, hop_length=512)

In [0]:
### Exporting both the test outputs as wav files to drive ###
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/test_1_reconstructed.wav', test_1_inv_stft, sr2)
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/test_2_reconstructed.wav', test_2_inv_stft, sr2)

In [46]:
ipd.Audio(test_1_inv_stft, rate=sr2)

In [47]:
ipd.Audio(test_2_inv_stft, rate=sr2)

# References
I've used examples for reference from the following websites to complete the implementation of neural network and reconstruct the audio
https://pytorch.org/tutorials/beginner/pytorch_with_examples.html \\
https://musicinformationretrieval.com/ipython_audio.html