# Problem 2: Speech Denoising using 2D CNN:
I am training a 2D CNN with two convolutional layers and a fully connected network of 1 hidden layer of 2000 units to remove the noise from the audio files. I've used PyTorch for implementing this network 

In [0]:
  import librosa
  import numpy as np
  import IPython.display as ipd

In [2]:
### Mounting Google Drive to take inputs ###
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import torch
import torch.nn.functional as F 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Questions 2.1
Importing the wav files from Google drive using the given commands and applying Short-Time Fourier Transform to both clean and dirty audio files. This gives out two matrices of size 513 x 2459. Fourier Transform converts the waveform in frequency domain and returns the Amplitude and Phase of the waveform. Phase has the complex component. So, we take the absolute values of these matrices which gives signal strength.


In [0]:
### Reading audio files ###
s, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)

In [5]:
### Checking the dimensions of the input files ###
print (S.shape)
print (X.shape)

(513, 2459)
(513, 2459)


In [0]:
### Taking the absolute values of the input matrices and converting them to tensors ###
abs_S = np.abs(S.T)
tensor_S = torch.from_numpy(abs_S).float()
abs_X = np.abs(X.T)
tensor_X = torch.from_numpy(abs_X).float()

# Questions 2.2 - 2.3
Defining and generating images from the audio signal. I am initially adding random 19 rows (19 silent frames mentioned in the question) to the fourier transformed version of 'train_dirty_male' audio file. This would help in creating 2459 images of size 20x513. The output of the Convolutional and Fully connected layers is compared with each row of the Fourier transformed version of 'train_clean_male' audio file.


In [7]:
### Original shape of the audio 2459x513 ###
tensor_X.shape

torch.Size([2459, 513])

The 'extra' variable below contains a 19x513 tensor which is concatenated with the original dirty dataset to form a 'new_x'. The shape of this new input file would be 2478x513 due to addition of 19 extra frames.

In [8]:
### Shape after adding 19 frames ###
extra = torch.rand(19,513)
new_x = torch.cat((extra, tensor_X),0)
new_x.shape

torch.Size([2478, 513])

In [0]:
# ### Input dimensions D_in = input size, D_out = output size, H is the dimensions of Hidden layers ###
# D_in = 513
# D_out = 513
# H = 1024

I have created a list called 'train' which has all the 2459 images of size 20x513. So the length of this 'train' list is 2459. These images are then loaded into a PyTorch dataloader to feed it into the Network

In [0]:
train = []
for i in range(2459):
  train.append(new_x[i:i+20])
train_data = torch.utils.data.DataLoader(train, batch_size=2459)

# Questions 2.4 - 2.5
I have finalized on a network with 2 Convolutional layers where both Convolutional layers have 3 kernels of size 2x2 and stride 1. I tried out networks with kernel size of 8x8 and 16x16 and 8/16 channels but I observed that these networks had very low loss for training data but performed poorly on the test dataset. Since we are not allowed to use padding for this homework, I applied smaller kernels with a stride of 1 which had much better results on the test data. \\
I also think that maxpool and higher stride was reducing the quality of test output, hence did not use it

In [0]:
### Creating a Network of 2 Convolutional layers - Fully Connected Network with 1 Hidden layer ###
class Network_2D_CNN(torch.nn.Module):
  def __init__(self):
    super(Network_2D_CNN, self).__init__()
    self.conv1 = torch.nn.Conv2d(in_channels=1,out_channels=3,kernel_size=(2,2),stride=1)
    self.conv2 = torch.nn.Conv2d(in_channels=3,out_channels=3,kernel_size=(2,2),stride=1)
    ### Fully Connected (First layer is (N-D+1 for height)*channels*(N-D+1 for width)) ###
    self.fc1 = torch.nn.Linear(18*3*511,2000)
    self.fc2 = torch.nn.Linear(2000,2000)
    self.fc3 = torch.nn.Linear(2000,513)
   

  def forward(self,x):
    output1 = self.conv1(x)
    output2 = self.conv2(output1)
    output = output2.view(-1, self.num_flat_features(output2))
    output = F.relu(self.fc1(output))
    output = F.relu(self.fc2(output))
    output = F.relu(self.fc3(output))     
    return output

  def num_flat_features(self, x):
    size = x.size()[1:]  # all dimensions except the batch dimension
    num_features = 1
    for s in size:
        num_features *= s
    return num_features    

In [13]:
### Setting Parameters ###
epochs = 900
learning_rate = 0.0001
model = Network_2D_CNN().to(device)
lossFunction = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print (model)

Network_2D_CNN(
  (conv1): Conv2d(1, 3, kernel_size=(2, 2), stride=(1, 1))
  (conv2): Conv2d(3, 3, kernel_size=(2, 2), stride=(1, 1))
  (fc1): Linear(in_features=27594, out_features=2000, bias=True)
  (fc2): Linear(in_features=2000, out_features=2000, bias=True)
  (fc3): Linear(in_features=2000, out_features=513, bias=True)
)


In [15]:
for epoch in range(0,epochs+1):
  for i,images in enumerate(train_data):
    x = images.view(-1,1,20,513).to(device)
    y_pred = model(x)
    y = tensor_S[i:i+2459].to(device)
    loss = lossFunction(y_pred,y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
      print('Epoch: ',epoch,'Loss', loss.item())


Epoch:  0 Loss 0.08471981436014175
Epoch:  10 Loss 0.0782671570777893
Epoch:  20 Loss 0.07248956710100174
Epoch:  30 Loss 0.06551314890384674
Epoch:  40 Loss 0.06123145669698715
Epoch:  50 Loss 0.05766981095075607
Epoch:  60 Loss 0.053854744881391525
Epoch:  70 Loss 0.049900155514478683
Epoch:  80 Loss 0.0461595393717289
Epoch:  90 Loss 0.04258827865123749
Epoch:  100 Loss 0.0406232625246048
Epoch:  110 Loss 0.03704841434955597
Epoch:  120 Loss 0.03466994687914848
Epoch:  130 Loss 0.03277180716395378
Epoch:  140 Loss 0.031007561832666397
Epoch:  150 Loss 0.029290402308106422
Epoch:  160 Loss 0.027540117502212524
Epoch:  170 Loss 0.026323087513446808
Epoch:  180 Loss 0.02494809217751026
Epoch:  190 Loss 0.024012641981244087
Epoch:  200 Loss 0.02306288108229637
Epoch:  210 Loss 0.022219985723495483
Epoch:  220 Loss 0.021156102418899536
Epoch:  230 Loss 0.02033074013888836
Epoch:  240 Loss 0.020242134109139442
Epoch:  250 Loss 0.019286787137389183
Epoch:  260 Loss 0.018274934962391853
Epo

Checking audio file obtained from the training data. The model recounstructs a fairly clean version of the dirty audio file


In [0]:
aud = y_pred.detach().cpu().numpy()
train_1_spect = np.multiply((X/abs_X.T),aud.T)
train_1_inv_stft = librosa.istft(train_1_spect, win_length= 1024, hop_length=512)

In [27]:
aud.shape

(2459, 513)

In [17]:
ipd.Audio(train_1_inv_stft, rate=sr2)

# Questions 2.5 - 2.7
Importing the test audio files with noise of eating chips. Applying STFT to it and taking the absolute values to avoid complex numbers. The audio files generated from 2D CNN are audible but not better than the audio files generated from the 1D CNN network which means that we could probably use initial layers with a 1D convolution and later have 2D convolutional layers. 

In [0]:
### Importing the test audio files for testing ###
test1, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_01.wav', sr=None)
test_aud_1=librosa.stft(test1, n_fft=1024, hop_length=512)
test2, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_02.wav', sr=None)
test_aud_2=librosa.stft(test2, n_fft=1024, hop_length=512)

In [44]:
test_aud_2.shape

(513, 380)

In [0]:
### Taking absolute values of the Test files ###
test_1_abs = np.abs(test_aud_1.T)
tensor_test_1 = torch.from_numpy(test_1_abs).float()
test_2_abs = np.abs(test_aud_2.T)
tensor_test_2 = torch.from_numpy(test_2_abs).float()

In [0]:
### Applying the same procedure to the test files as done on the training data ###
new_test_2 = torch.cat((extra, tensor_test_2),0)
test_2 = []
for i in range(380):
  test_2.append(new_test_2[i:i+20])
test_data_2 = torch.utils.data.DataLoader(test_2, batch_size=380)

new_test_1 = torch.cat((extra, tensor_test_1),0)
test_1 = []
for i in range(142):
  test_1.append(new_test_1[i:i+20])
test_data_1 = torch.utils.data.DataLoader(test_1, batch_size=142)

### Reconstructing the First test audio file

In [0]:
for i,images in enumerate(test_data_1):
  x = images.view(-1,1,20,513).to(device)
  test_1_pred = model(x)

In [0]:
test_pred = test_1_pred.detach().cpu().numpy()
test_1_spect = np.multiply((test_aud_1/test_1_abs.T), test_pred.T)
test_1_inv_stft = librosa.istft(test_1_spect, win_length= 1024, hop_length=512)

In [49]:
ipd.Audio(test_1_inv_stft, rate=sr2)

In [0]:
### Exporting the test outputs as wav files to drive ###
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/HW_2_prob_2_test_1_reconstructed.wav', test_1_inv_stft, sr2)

### Reconstructing the Second test audio file

In [0]:
for i,images in enumerate(test_data_2):
  x = images.view(-1,1,20,513).to(device)
  test_2_pred = model(x)

In [0]:
test_2_pred = test_2_pred.detach().cpu().numpy()
test_2_spect = np.multiply((test_aud_2/test_2_abs.T), test_2_pred.T)
test_2_inv_stft = librosa.istft(test_2_spect, win_length= 1024, hop_length=512)

In [55]:
ipd.Audio(test_2_inv_stft, rate=sr2)

In [0]:
### Exporting the test outputs as wav files to drive ###
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/HW_2_prob_2_test_2_reconstructed.wav', test_2_inv_stft, sr2)