## Data Preparation and Pre-Processing

Real-world data is often noisy and requires pre-processing. To tackle this, we will apply some pre-processing steps to put all the samples in a standard format.

- Resampling and normalization of all the audio samples: Since the sampling rates of all the audio samples will be different, we will resample all the audio files to a specific sampling frequency.

- Data Augmentation techniques for padding of samples.

- Removing all the dead samples (negligible frequency)(if any) from the dataset.

- Denoising: We use Spectral-Subtraction method to reduce background noise. This method is based on spectral averaging and residual noise reduction, widely used for enhancement of noisy speech signals and can remove the stationary noise included in the sound. The Wiener filter is also an option.

- We prepare spectrogram images with Linear Short Time Fourier Transform/Log-Mel Filter Bank features using Python. Librosa will be the preferred choice for computing these feature representations.

- Contrast Enhancement: From past experience, Histogram Equalization has been found as a better option than capping the extreme values to mean ± 1.5 std.

- Focusing on Low-Frequency Spectrum: The whale calls are normally found in the lower frequency spectrum ranging from 100Hz-200Hz. This would allow us to look only at the specific part of the image which will be beneficial to the CNN architecture when given as input. The rest of the image part would roughly count as “noise” (or irrelevant) portion for CNN.

- Signal to Noise (SNR) Ratio: It is essential to have a high value of SNR for all the audio samples. Previous kaggle competition have shown this as an essential factor in improving the model accuracy.

- Hydrophone data are subject to a variety of in-band noise sources. A band-pass filter is a simple way to remove unwanted noise outside of the signal frequency band. Wavelet denoising is an effective method for SNR improvement in environments with wide range of noise types competing for the same subspace.

- To improve robustness to loudness variation, per-channel energy normalization (PCEN) was found better than the standard log-mel frontend (and even with an optimized spectral subtraction process). This provided a 24% reduction in error rate of whale call detection. It also helps in reducing the narrow-band noise which is most often caused by nearby boats and the equipment itself.


In [1]:
import numpy as np

import os
import shutil

import IPython
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook
from sklearn.cross_validation import StratifiedKFold
from scipy.io import wavfile
import wave
import glob
import csv

import librosa
import librosa.display
import numpy as np
import scipy
from keras import losses, models, optimizers
from keras.activations import relu, softmax
from keras.callbacks import (EarlyStopping, LearningRateScheduler,
                             ModelCheckpoint, TensorBoard, ReduceLROnPlateau)
from keras.layers import (Convolution1D, Dense, Dropout, GlobalAveragePooling1D, 
                          GlobalMaxPool1D, Input, MaxPool1D, concatenate)
from keras.utils import Sequence, to_categorical

%matplotlib inline
matplotlib.style.use('ggplot')

ModuleNotFoundError: No module named 'sklearn.cross_validation'

In [None]:
'''
Reading train and test class labels of the dataset from csv file.
If we get a json file, we can easily convert it to csv and then proceed. 
'''
train = pd.read_csv("whales_classlabel.csv")
# test = pd.read_csv("whales_test.csv")

In [None]:
'''
We reduce the number of negative examples here so as to balance out the whole dataset. The other method would be 
to augment the minority classes
'''
train.head()

In [None]:
print("Number of training examples=", train.shape[0], "  Number of classes=", len(train.label.unique()))

In [None]:
print(train.label.unique())

In [None]:
df = pd.read_csv('whales_classlabel.csv')
df.set_index('fname', inplace=True)

for f in df.index:
    rate, signal = wavfile.read('data_labels/' + f)
    #print(rate)
    df.at[f, 'length'] = signal.shape[0]/rate
classes = list(np.unique(df.label))
class_dist = df.groupby(['label'])['length'].mean()

fig, ax = plt.subplots()
ax.set_title('Class Distribution', y=1.08)
ax.pie(class_dist, labels = class_dist.index, autopct='%1.1f%%',
    shadow=False, startangle=90)
ax.axis('equal')
plt.show()
df.reset_index(inplace=False)

In [None]:
'''
We observe that:

The number of audio samples per category is non-uniform. The minimum number of audio samples in a category 
is 1 while the maximum is 6
'''

print('Minimum samples per category = ', min(train.label.value_counts()))
print('Maximum samples per category = ', max(train.label.value_counts()))

In [None]:
'''
We see that different samples have different durations of audio
'''
wav_list = glob.glob('data_labels/*.wav')
#fname = 'data_labels/neg_00.wav'
l = []
for i in range(len(wav_list)):
    wav = wave.open(wav_list[i])
#     print(wav_list[i])
    l.append([wav_list[i][12:], wav.getframerate(), wav.getnframes(), wav.getnframes()/wav.getframerate()])
print("File  ", "Sampling Rate  ", "Total Samples  ", "Duration(s) " )
for dat in l:
    print(*dat)

In [None]:
fname = 'data_labels/pos_14.wav'
rate, data = wavfile.read(fname)
print("Sampling (frame) rate = ", rate)
print("Total samples (frames) = ", data.shape)
print(data)

In [None]:
plt.plot(data, '-', );

In [None]:
# Looking at first 500 frames

plt.figure(figsize=(16, 4))
plt.plot(data[:500], '.'); plt.plot(data[:500], '-');

In [None]:
'''
We see that the distribution of audio length across labels is non-uniform and has very high variance
'''
train['nframes'] = train['fname'].apply(lambda f: wave.open('data_labels/' + f).getnframes())
# test['nframes'] = test['fname'].apply(lambda f: wave.open('data_test/' + f).getnframes())

_, ax = plt.subplots(figsize=(16, 4))
sns.violinplot(ax=ax, x="label", y="nframes", data=train, scale='width')
plt.xticks(rotation=90)
plt.title('Distribution of audio frames, per label', fontsize=16)
plt.show()

In [None]:
'''
Majority of the audio files are short and there exist an outlier as seen below.
'''
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(16,5))
train.nframes.hist(bins=100, ax=axes)
plt.suptitle('Frame Length Distribution', ha='center', fontsize='large');

In [None]:
file_path = 'data_labels/pos_10.wav'
data, sr = librosa.core.load(file_path, sr=44100,
                                        res_type='kaiser_fast')

In [None]:
librosa.display.waveplot(data, sr=sr)

In [None]:
D = librosa.amplitude_to_db(np.abs(librosa.stft(data)), ref=np.max)
librosa.display.specshow(D, y_axis='linear')

In [None]:
librosa.display.specshow(D, y_axis='log')

In [None]:
librosa.display.specshow(D, cmap='gray_r', y_axis='linear')

In [None]:
sampling_rate = 16000
audio_duration = 4
use_mfcc = False
n_mfcc = 20
audio_length = sampling_rate * audio_duration
preprocessing_fn = lambda x: x

input_length = audio_length
#print(input_length)

# Random offset / Padding
if len(data) > input_length:
    max_offset = len(data) - input_length
    offset = np.random.randint(max_offset)
    data = data[offset:(input_length+offset)]
else:
    if input_length > len(data):
        max_offset = input_length - len(data)
        offset = np.random.randint(max_offset)
    else:
        offset = 0
    data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
print(data.shape)
# Normalization + Other Preprocessing
if use_mfcc:
    data = librosa.feature.mfcc(data, sr=sampling_rate,
                                       n_mfcc=n_mfcc)
    data = np.expand_dims(data, axis=-1)
else:
    data = preprocessing_fn(data)[:, np.newaxis]

In [None]:
data

In [None]:
librosa.display.waveplot(data, sr=sr)

In [None]:
librosa.display.cmap(data)

In [None]:
SAMPLE_RATE = 44100
wav, _ = librosa.core.load(file_path, sr=SAMPLE_RATE)
#wav = wav[:2*44100]

In [None]:
mfcc = librosa.feature.mfcc(wav, sr = SAMPLE_RATE, n_mfcc=20)
mfcc.shape

In [None]:
plt.figure(figsize=(16, 14))
plt.imshow(mfcc);

In [None]:
D = np.abs(librosa.stft(wav, n_fft=2048, win_length=2000, hop_length=500))
librosa.display.specshow(librosa.amplitude_to_db(D, ref = np.max),y_axis='mel', x_axis='time')

In [None]:
def audio_norm(data):
    max_data = np.max(data)
    min_data = np.min(data)
    data = (data-min_data)/(max_data-min_data+1e-6)
    return data-0.5


In [None]:
file_path = 'data_labels/pos_14.wav'
data, sr = librosa.core.load(file_path, sr=44100,
                                        res_type='kaiser_fast')

In [None]:
file_path = 'data_labels/pos_15.wav'
dataa, sr = librosa.core.load(file_path, sr=44100,
                                        res_type='kaiser_fast')
dataa_norm = audio_norm(dataa)

In [None]:
data_norm = audio_norm(data)

In [None]:
sampling_rate = 16000
audio_duration = 3
use_mfcc = False
n_mfcc = 20
audio_length = sampling_rate * audio_duration
preprocessing_fn = lambda x: x

input_length = audio_length
print(input_length)
print(len(data))

# Random offset / Padding
if len(data) > input_length:
    max_offset = len(data) - input_length
    offset = np.random.randint(max_offset)
    data = data[offset:(input_length+offset)]
else:
    if input_length > len(data):
        max_offset = input_length - len(data)
        offset = np.random.randint(max_offset)
    else:
        offset = 0
    data = np.pad(data, (offset, input_length - len(data) - offset), "constant")

# data = librosa.feature.mfcc(data, sr=sampling_rate,
#                                                    n_mfcc=n_mfcc)
# data = np.expand_dims(data, axis=-1)

In [None]:
plt.figure(figsize=(16, 14))
plt.subplot(4,1,1)
librosa.display.waveplot(data)
plt.subplot(4,1,2)
librosa.display.waveplot(data_norm)
plt.subplot(4,1,3)
librosa.display.waveplot(dataa)
plt.subplot(4,1,4)
librosa.display.waveplot(dataa_norm)

In [None]:
plt.figure(figsize=(16, 14))
D1 = np.abs(librosa.stft(data, n_fft=1024, win_length=1024, hop_length=200))
plt.subplot(2,1,1)
librosa.display.specshow(librosa.amplitude_to_db(D1, ref = np.max),y_axis='mel', x_axis='time')
D2 = np.abs(librosa.stft(data_norm, n_fft=1024, win_length=1024, hop_length=200))
plt.subplot(2,1,2)
librosa.display.specshow(librosa.amplitude_to_db(D2, ref = np.max),y_axis='mel', x_axis='time')