Our dataset is composed of multiple different timeseries, which may not share the same length. As per the [documentation](https://github.com/reservoirpy/reservoirpy/blob/master/tutorials/1-Getting_Started.ipynb), we will have a list of NumPy arrays of shape (timesteps, features).

In our case, the number of features is simply the amplitude of the signal.

The name of each input is directly coupled with the output filename. 

In [4]:
import librosa
import os
from constants import GENERATED_INPUT_DIRECTORY, OUTPUT_DIR
from collections import namedtuple
import numpy as np

AudioFile = namedtuple('AudioFile', ['filename', 'waveform'])  # global samplerate : 22050

# Ensure this notebook is run at the root of the project
input_files = os.listdir(GENERATED_INPUT_DIRECTORY)
output_files = os.listdir(OUTPUT_DIR)

# Pre-load file contents
input_files = [AudioFile(in_f, librosa.load(os.path.join(GENERATED_INPUT_DIRECTORY, in_f))[0]) for in_f in input_files]
output_files = [AudioFile(out_f, librosa.load(os.path.join(OUTPUT_DIR, out_f))[0]) for out_f in output_files]

data_input = []
data_output = []

# Format is defined as `file . noise_index . wav`
for out_file in output_files:
    # match output file with its associated inputs
    # (in other words, the original audio file, matched with the ones that have noise added to it)
    associated_inputs: list[AudioFile] = list(filter(lambda af: af.filename.split('.')[0] == out_file.filename.removesuffix('.wav'), input_files))
    for ai in associated_inputs:
        data_input.append(np.array(ai.waveform))
        data_output.append(np.array(out_file.waveform))

We now have the data in the correct shape : for one cell of the `data_input` list at index `i`, we have the ouput to be predicted inside `data_output[i]`.

Let's split the dataset into a training and test dataset

In [12]:
sp_index = int(len(data_input) / 2)
X_train, X_test = data_input[sp_index:], data_input[:sp_index]
Y_train, Y_test = data_output[sp_index:], data_output[:sp_index]