When it comes to building the data pipeline for dealing with spectrograms, we have 2 options:
- Build a dynamic data pipeline in which flows the `.wav` files, which spectrograms are computed "on-the-go". This method allows to process the most raw version of the spectrogram, making sure that no loss of information is allowed to take place. However, a naive version of this would store the spectrograms in the RAM for the whole training dataset, which could easily trigger `OutOfMemory` issues, or in the best case leads to drastically slower computational availability. This issue happened in our early experiments when we first tried to implement this kind of data pipeline. However, the `tf.data` module is very extensive and provides ways to operate around this issue, namely with the `TFRecords` class, that temporary store dataset on disk in a way that they be easily retrieved and streamed from. We did not look yet into it because the `tf.data` module is a whole topic on its own and is not necessarily the aim of this project. Instead we would rather go for the second option:
- Store the pre-computed spectrograms on disk, as images, and simply flow the image data from disk. The problem with this method is that by saving and retrieving images, on and from disk, we are subject to compression error. Moreover, the spectrograms originally in the provided Kaggle dataset were bad in the sense that they had white border around them, were compressed and were in RGB format whereas the spectrogram should be a mono-channel image, for storage and processing purposes. However, my experiments in [this notebook](url) have led me to conclude that we could store the spectrograms as `.png` files with no loss due to compression (however, there is loss due to rounding for casting to `int8` type).

Consequently, this notebook simply shows the process used to compute the spectrograms and save them on disk as `.png` images.

# Imports

In [1]:
import os
import glob

from PIL import Image

from scipy import signal
from scipy.io import wavfile
import numpy as np

from tqdm.notebook import tqdm

# Setup

In [2]:
# STFT parameters
nfft = 1024
noverlap = 512

img_format = "png"
png_compression_level = 9

image_dir = "../res/spectrograms/"

Make directories for the spectrograms, akin to the structure of the original dataset:

In [5]:
for genre in os.listdir("../res/gtzan_wav/data/"):
    try:
        os.makedirs(image_dir + genre)
    except Exception as err:
        print(err.__class__.__name__)

FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError
FileExistsError


# Define functions

In [6]:
def pad_or_slice(waveform, length):
    if len(waveform) > length:
        return waveform[:length]
    elif len(waveform) < length:
        return np.concatenate((waveform, waveform[len(waveform) - length - 1: -1][::-1]))
    else:
        return waveform

def get_int8_spectrogram(filename, nperseg, noverlap):
    # Load the audio file
    rate, data = wavfile.read(filename)
    data = pad_or_slice(data, 30 * rate)

    # Compute the spectrogram
    Sxx = signal.spectrogram(data, rate, nperseg=nperseg, noverlap=noverlap)[2]
    Sxx = np.log1p(Sxx)
    Sxx = np.round(Sxx / np.max(Sxx) * 255).astype(np.uint8)

    return Sxx

# Loop over all files

In [15]:
files = glob.glob("../res/gtzan_wav/data/*/*.wav")

for file in tqdm(files):
    genre, name = file.replace('\\', '/').split('/')[-2:]
    name = '.'.join(name.split('.')[:-1] + [img_format])

    print(genre, name, end='\r')

    spec = get_int8_spectrogram(file, nperseg=nfft, noverlap=noverlap)

    img = Image.fromarray(spec)
    img.save(os.path.join(image_dir, genre, name), format=img_format, compression_level=png_compression_level)

  0%|          | 0/1000 [00:00<?, ?it/s]

rock rock.00099.png.pngng.png