## This notebook implements a basic preprocessing algorithm for audio files

Audio is taken as input and converted to a spectrogram with...
* 80 frequency bins
* Song data organized into 10 ms slices
    * So each matrix is has shape (80, song length / 10ms)
* 70 ms of empty data appended and 70ms prepended
    * The algorithm will consider 70ms before and 70ms after each 10ms time step

This currently produces a single channel corresponding to an STFT window length of 96 ms.
By modulating the n_fft parameter in the "S = np.abs(librosa.stft(data_new, n_fft = 2048, hop_length = 441))"
line it will be possible to produce extra channels corresponding to window lengths 46 ms and 23 ms.

| n_fft | window length |
| -------- | ---------- |
| 2048 | 93 ms |
| 1024 | 46 ms |
| 512 | 23 ms |

This is with hop length = 441 and sample rate = 44100

In [1]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Try out a song from Training Data
# data = the song data
# sr = sample rate
data, sr = librosa.load('/Users/ewais/Documents/Github/tensor-hero/Preprocessing/Training Data/Audioslave - Exploder (Chezy)/song.ogg')


In [2]:

# This is a 1D array, with as many elements as there are samples
print('The original data shape is', data.shape)
print('The original sampling rate is', sr)       # sr is 22050 for .ogg files I believe

# Let's upsample to 44.1k so that mp3 files can be handled and to make the 10ms window more accurate when computing STFT w/ hop length
data_new = librosa.resample(data, sr, 44100)
print('After resampling, data shape is', data_new.shape)
print('And the new sr is 44100')

# Take the STFT (Short Time Fourier Transform)
S = np.abs(librosa.stft(data_new, n_fft = 2048, hop_length = 441))
print('After taking the STFT w/ 10 ms stride, the shape of the data is', S.shape)

# Create mel filter
melfilter = librosa.filters.mel(44100, n_fft = 2048, n_mels = 80)
print('The shape of the mel filter is', melfilter.shape)

# Let's transform the STFT matrix to the mel filterbank, reducing the dimensionality of the columns to 80
S_filtered = np.matmul(melfilter,S)
print('The shape of the new filtered data is', S_filtered.shape)

# Take the log of the data to better represent human perception
S_filtered = librosa.amplitude_to_db(S_filtered, ref=np.max)

# Prepend and append 7 columns of zeros (corresponding to 70ms of silence before and after song starts)
S_for_parsing = np.c_[np.zeros((80,7)), S_filtered, np.zeros((80,7))]


#S_for_parsing = np.insert(S_for_parsing, range(np.size(S_for_parsing,1)-1,(np.size(S_for_parsing,1)+6)), 0)
print('Before appending zeros, the shape was', S_filtered.shape)
print('After appending zeros, the shape is', S_for_parsing.shape)

The original data shape is (4594459,)
The original sampling rate is 22050
After resampling, data shape is (9188918,)
And the new sr is 44100
After taking the STFT w/ 10 ms stride, the shape of the data is (1025, 20837)
The shape of the mel filter is (80, 1025)
The shape of the new filtered data is (80, 20837)
Before appending zeros, the shape was (80, 20837)
After appending zeros, the shape is (80, 20851)
