The [Urban Sound Dataset](https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html) contains 8,732 labeled examples of length <= 4 seconds. Visually inspecting the different classes in the dataset aids in understanding the similarities and differences between the classes as well as any transformation we may need to apply to make the classification task easier for the model.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

CLASSES = [
    'air conditioner',
    'car horn',
    'children playing',
    'dog bark',
    'drilling',
    'engine idling',
    'gun shot',
    'jackhammer',
    'siren',
    'street music'
]
    
SOUND_FILE_PATHS = [
    'Data/UrbanSound8K/audio/fold1/57320-0-0-7.wav',
    'Data/UrbanSound8K/audio/fold1/24074-1-0-3.wav',
    'Data/UrbanSound8K/audio/fold1/15564-2-0-1.wav',
    'Data/UrbanSound8K/audio/fold1/31323-3-0-1.wav',
    'Data/UrbanSound8K/audio/fold1/46669-4-0-35.wav',
    'Data/UrbanSound8K/audio/fold1/89948-5-0-0.wav',
    'Data/UrbanSound8K/audio/fold1/102305-6-0-0.wav',
    'Data/UrbanSound8K/audio/fold1/103074-7-3-2.wav',
    'Data/UrbanSound8K/audio/fold1/106905-8-0-0.wav',
    'Data/UrbanSound8K/audio/fold1/108041-9-0-4.wav'
]

def load_sound_files(file_paths):
    raw_sounds = []
    for fp in file_paths:
        samples, _ = librosa.load(fp)
        raw_sounds.append(samples)
    return raw_sounds
    
RAW_SOUNDS = load_sound_files(SOUND_FILE_PATHS)

def plot_waves(sound_names, raw_sounds):
    figure = plt.figure(figsize=(8, 16))
    for idx, (name, sound) in enumerate(zip(sound_names, raw_sounds)):
        plt.subplot(10, 1, idx + 1)
        librosa.display.waveplot(np.array(sound), sr=22050)
        plt.title(name)
    figure.tight_layout()
    plt.show()
    
plot_waves(CLASSES, RAW_SOUNDS)

After visualizing the different classes we can make an important observations. The gun shot example is shorter in time than all the other examples. This observation is a problem because our models expect the input shapes or our training examples to be uniform.

Also, perhaps we can use a better representation of our data to improve the accuracy of our model. In this case we will focus on spectrograms which include amplitude and frequency information over time. Something considered essential in most analyses of acoustic information.   

Below we will define a function to pad the audio files with silence when they are too short or clip them when they are too long. Then, we will plot spectrograms of the different classes to see how our dataset is affected by the transformations.

In [None]:
SAMPLE_RATE = 22050

MAX_SECS = 4

MAX_SAMPLES = SAMPLE_RATE * MAX_SECS

def fix_raw_sounds(raw_sounds, length):
    new_raw_sounds = []
    for sound in raw_sounds:
        if sound.shape[0] < length:
            sound = np.pad(sound, (0, length - sound.shape[0]), 'constant')
        elif sound.shape[0] > length:
            sound = sound[:length]
        new_raw_sounds.append(sound)
    return new_raw_sounds

RAW_SOUNDS = fix_raw_sounds(RAW_SOUNDS, MAX_SAMPLES)

def plot_spectrograms(sound_names, raw_sounds):
    figure = plt.figure(figsize=(8, 16))
    for idx, (name, sound) in enumerate(zip(sound_names, raw_sounds)):
        plt.subplot(10, 1, idx + 1)
        plt.specgram(np.array(sound), cmap='jet', Fs=22050)
        plt.colorbar()
        plt.title(name)
        plt.xlabel("Time (s)")
        plt.ylabel("Frequency (hz)")
    figure.tight_layout()
    plt.show()
    
plot_spectrograms(CLASSES, RAW_SOUNDS)

Now that we have fixed our data we want to start working on the input pipeline. We also want to track some useful statistics in the process and save our dataset in a more suitable format for ingestion by TensorFlow models.