# Problem 1: Speech Denoising using 1D CNN:
I am training a 1D CNN with two convolutional layers and a fully connected network of 1 hidden layer of 2000 units to remove the noise from the audio files. I've used PyTorch for implementing this network 

In [0]:
  import librosa
  import numpy as np

In [2]:
### Mounting Google Drive to take inputs ###
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import torch
import torch.nn.functional as F 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Questions 1.1 - 1.4
Importing the wav files from Google drive using the given commands and applying Short-Time Fourier Transform to both clean and dirty audio files. This gives out two matrices of size 513 x 2459. Fourier Transform converts the waveform in frequency domain and returns the Amplitude and Phase of the waveform. Phase has the complex component. So, we take the absolute values of these matrices which gives signal strength.


In [0]:
### Reading audio files ###
s, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)

In [5]:
### Checking the dimensions of the input files ###
print (S.shape)
print (X.shape)

(513, 2459)
(513, 2459)


In [0]:
### Taking the absolute values of the input matrices and converting them to tensors ###
abs_S = np.abs(S.T)
tensor_S = torch.from_numpy(abs_S).float()
abs_X = np.abs(X.T)
tensor_X = torch.from_numpy(abs_X).float()

In [7]:
tensor_X.shape

torch.Size([2459, 513])

# Questions 1.5 - 1.11
Defining a fully connected Neural Network with 2 Convolutional layers and a Fully Connected layer with 1 Hidden layer of 2000 units. Results with and without Xavier initialization were almost the same so I've not used it in my model. I am using ReLU as the activation function for all the Fully connected layers including the last layer since we would want only non-negative values as output. \\
The kernel size of 1x3 (No height) gave me better results than the 1x2 kernels and since I am using kernel of size 3, the input to fully connected layer is 3 * 509(N-D+1). I have kept the number of kernels to 3 since having moree kernels didn't improve the test signal by a lot. \\
The stride of 1 with no maxpool generated better results than with stride > 1. The function 'num_flat_features' flattens the convolution output which is then used as input to Fully Connected layer.

In [0]:
### Creating a Network of Input layer - 2 Hidden layers - Output layer ###
class Network_1D_CNN(torch.nn.Module):
  def __init__(self):
    super(Network_1D_CNN, self).__init__()
    self.conv1 = torch.nn.Conv1d(in_channels=1,out_channels=3,kernel_size=3,stride=1)
    self.conv2 = torch.nn.Conv1d(in_channels=3,out_channels=3,kernel_size=3,stride=1)
    ### defining fully connected layers after convolutions ###
    self.fc1 = torch.nn.Linear(3*509,2000)
    self.fc2 = torch.nn.Linear(2000,2000)
    self.fc3 = torch.nn.Linear(2000,513)

  def forward(self, x):
    # output1 = F.max_pool1d(F.relu(self.conv1(x)),2)
    output1 = self.conv1(x)
    # output1 = F.max_pool1d(output1,2)
    output2 = self.conv2(output1)
    output = output2.view(-1, self.num_flat_features(output2))
    output = F.relu(self.fc1(output))
    output = F.relu(self.fc2(output))
    output = F.relu(self.fc3(output))
    return output

  def num_flat_features(self, x):
    size = x.size()[1:]
    num_features = 1
    for s in size:
        num_features *= s
    return num_features

Since the network does not contain a lot of layers, I am training the model for 1000 epochs with Learning rate as 0.0001. Since this is similar to a regression problem, I am using Mean Squared Error as my loss function along with Adam optimizer.

In [10]:
### Setting Parameters ###
epochs = 1000
learning_rate = 0.0001
model = Network_1D_CNN().to(device)
lossFunction = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print (model)

Network_1D_CNN(
  (conv1): Conv1d(1, 3, kernel_size=(3,), stride=(1,))
  (conv2): Conv1d(3, 3, kernel_size=(3,), stride=(1,))
  (fc1): Linear(in_features=1527, out_features=2000, bias=True)
  (fc2): Linear(in_features=2000, out_features=2000, bias=True)
  (fc3): Linear(in_features=2000, out_features=513, bias=True)
)


In [11]:
### Training the model ###
for epoch in range(0,epochs):
  ### Use view to change dimensions.. Now dimensions represent [batch_size, in_channel, len] ###
  x = tensor_X.view(2459, 1, 513).to(device)
  y = tensor_S.to(device)
  y_pred = model(x)
  loss = lossFunction(y_pred,y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  if epoch % 10 == 0:
    print('Epoch: ',epoch,'Loss', loss.item())

Epoch:  0 Loss 0.09810544550418854
Epoch:  10 Loss 0.08297958225011826
Epoch:  20 Loss 0.06617357581853867
Epoch:  30 Loss 0.05639234930276871
Epoch:  40 Loss 0.04786136373877525
Epoch:  50 Loss 0.0392877571284771
Epoch:  60 Loss 0.032454896718263626
Epoch:  70 Loss 0.027367262169718742
Epoch:  80 Loss 0.023311441764235497
Epoch:  90 Loss 0.02034125290811062
Epoch:  100 Loss 0.018216770142316818
Epoch:  110 Loss 0.016503386199474335
Epoch:  120 Loss 0.015424354933202267
Epoch:  130 Loss 0.01436252985149622
Epoch:  140 Loss 0.013568366877734661
Epoch:  150 Loss 0.01291967835277319
Epoch:  160 Loss 0.012386624701321125
Epoch:  170 Loss 0.011909890919923782
Epoch:  180 Loss 0.011517227627336979
Epoch:  190 Loss 0.01112136710435152
Epoch:  200 Loss 0.010810142382979393
Epoch:  210 Loss 0.010535331442952156
Epoch:  220 Loss 0.010304056107997894
Epoch:  230 Loss 0.010003332048654556
Epoch:  240 Loss 0.009768974967300892
Epoch:  250 Loss 0.009551137685775757
Epoch:  260 Loss 0.009338494390249

# Question 1.12
Importing the test audio files with noise of eating chips. Applying STFT to it and taking the absolute values to avoid complex numbers. The audio files generated from 1D CNN are better than the audio files generated from the Fully Connected network only. 

In [0]:
### Importing the test audio files for testing ###
test1, sr1=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_01.wav', sr=None)
test_1=librosa.stft(test1, n_fft=1024, hop_length=512)
test2, sr2=librosa.load('/content/gdrive/My Drive/Speech Denoising/test_x_02.wav', sr=None)
test_2=librosa.stft(test2, n_fft=1024, hop_length=512)

In [0]:
### Taking absolute values of the Test files and converting them to tensors ###
test_1_abs = np.abs(test_1.T)
tensor_test_1 = torch.from_numpy(test_1_abs).float()
tensor_test_1 = tensor_test_1.view(142, 1, 513)
test_2_abs = np.abs(test_2.T)
tensor_test_2 = torch.from_numpy(test_2_abs).float()
tensor_test_2 = tensor_test_2.view(380, 1, 513)

In [0]:
### Converting the outputs too numpy array for further calculation ###
test_1_pred = model(tensor_test_1.to(device)) 
test_1_pred = test_1_pred.detach().cpu().numpy()
test_2_pred = model(tensor_test_2.to(device))
test_2_pred = test_2_pred.detach().cpu().numpy()

In [0]:
### Hadamard Product ###
test_1_spect = np.multiply((test_1/test_1_abs.T),test_1_pred.T)
test_2_spect = np.multiply((test_2/test_2_abs.T), test_2_pred.T)

In [0]:
import IPython.display as ipd

In [0]:
### Inverse Fourier Transform ###
test_1_inv_stft = librosa.istft(test_1_spect, win_length= 1024, hop_length=512)
test_2_inv_stft = librosa.istft(test_2_spect, win_length= 1024, hop_length=512)

In [0]:
### Exporting both the test outputs as wav files to drive ###
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/HW_2_prob1_test_1_reconstructed.wav', test_1_inv_stft, sr2)
librosa.output.write_wav('/content/gdrive/My Drive/Speech Denoising/HW_2_prob1_test_2_reconstructed.wav', test_2_inv_stft, sr2)

In [19]:
ipd.Audio(test_1_inv_stft, rate=sr2)

In [20]:
ipd.Audio(test_2_inv_stft, rate=sr2)