# About instrument recognition using NSynth dataset. 
## Enzo Fragale, on the 18th of december 2022
Documentations: 
* https://www.tensorflow.org/datasets/catalog/nsynth
* https://magenta.tensorflow.org/datasets/nsynth
* https://paperswithcode.com/dataset/nsynth


## Citation:

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders." 2017.



pmlr-v70-engel17a,

  title =    {Neural Audio Synthesis of Musical Notes with {W}ave{N}et Autoencoders},

  author =   {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan},

  booktitle =    {Proceedings of the 34th International Conference on Machine Learning},

  pages =    {1068--1077},

  year =     {2017},

  editor =   {Doina Precup and Yee Whye Teh},

  volume =   {70},

  series =   {Proceedings of Machine Learning Research},

  address =      {International Convention Centre, Sydney, Australia},

  month =    {06--11 Aug},

  publisher =    {PMLR},

  pdf =      {http://proceedings.mlr.press/v70/engel17a/engel17a.pdf},

  url =      {http://proceedings.mlr.press/v70/engel17a.html},


In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
from IPython.display import Audio
from sklearn.linear_model import LogisticRegression

Load the full NSynth train dataset as tf.data.Dataset to train, validate and test (cannot use regular validation and test due to full disk on colab VM)

In [2]:
train_dataset, validation_dataset , test_dataset = tfds.load(
    name="nsynth/full", 
    shuffle_files=True,
    read_config = tfds.ReadConfig(shuffle_seed=26),
    split=['train[:10%]', 'train[10%:12%]', 'train[12%:14%]'])

In [3]:
train_dataset

<PrefetchDataset element_spec={'audio': TensorSpec(shape=(64000,), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'instrument': {'family': TensorSpec(shape=(), dtype=tf.int64, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'source': TensorSpec(shape=(), dtype=tf.int64, name=None)}, 'pitch': TensorSpec(shape=(), dtype=tf.int64, name=None), 'qualities': {'bright': TensorSpec(shape=(), dtype=tf.bool, name=None), 'dark': TensorSpec(shape=(), dtype=tf.bool, name=None), 'distortion': TensorSpec(shape=(), dtype=tf.bool, name=None), 'fast_decay': TensorSpec(shape=(), dtype=tf.bool, name=None), 'long_release': TensorSpec(shape=(), dtype=tf.bool, name=None), 'multiphonic': TensorSpec(shape=(), dtype=tf.bool, name=None), 'nonlinear_env': TensorSpec(shape=(), dtype=tf.bool, name=None), 'percussive': TensorSpec(shape=(), dtype=tf.bool, name=None), 'reverb': TensorSpec(shape=(), dtype=tf.bool, name=None), 'tempo-synced': TensorSpec(shape=

Let's see (or rather hear) several examples in our dataset

In [4]:
audio_tensors=[]
instruments=[]
ids=[]
pitches=[]
qualities=[]
velocities=[]
for element in train_dataset.take(3):
    audio_tensors.append(element["audio"])
    instruments.append(element["instrument"])
    ids.append(element["id"])
    pitches.append(element["pitch"])
    qualities.append(element["qualities"])
    velocities.append(element["velocity"])
print(audio_tensors)

[<tf.Tensor: shape=(64000,), dtype=float32, numpy=
array([-0.01809136, -0.04473087,  0.5086554 , ...,  0.        ,
        0.        ,  0.        ], dtype=float32)>, <tf.Tensor: shape=(64000,), dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(64000,), dtype=float32, numpy=
array([ 2.5229062e-06, -3.7287662e-06,  5.1306415e-06, ...,
       -1.6906321e-06,  6.6180382e-07, -2.7452256e-07], dtype=float32)>]


In [5]:
rate = 16000 #audio tensors are of shape (64000,): 4 seconds records of 
             #monophonic 16kHz audio snippets (cf. documentation)

In [6]:
ids[1]

<tf.Tensor: shape=(), dtype=string, numpy=b'bass_electronic_013-033-075'>

In [7]:
print(ids[0])
print(instruments[0])
print(pitches[0])
print(qualities[0])
print(velocities[0])
#prints instrument's family, label, and source 
Audio(audio_tensors[0], rate=rate)

tf.Tensor(b'bass_synthetic_066-037-100', shape=(), dtype=string)
{'family': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=107>, 'source': <tf.Tensor: shape=(), dtype=int64, numpy=2>}
tf.Tensor(37, shape=(), dtype=int64)
{'bright': <tf.Tensor: shape=(), dtype=bool, numpy=True>, 'dark': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'distortion': <tf.Tensor: shape=(), dtype=bool, numpy=True>, 'fast_decay': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'long_release': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'multiphonic': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'nonlinear_env': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'percussive': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'reverb': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'tempo-synced': <tf.Tensor: shape=(), dtype=bool, numpy=False>}
tf.Tensor(100, shape=(), dtype=int64)


We can see here the different labels on this audio file even though it is a bit complicated to see because of the tensorflow verbose. Notice that the id sums it all up: the instruments' family and source (, probably the record id), the pitch and the velocity. 

Let's print the instrument information in a more user-friendly way:

In [8]:
def getFamilySourceArray(instrumentsTensor):
    #comprehension list getting family and source int values from instruments
    #tensor
    familySourceArray=\
    [instrumentsTensor[x].numpy() for x in instrumentsTensor][0::2]
    return familySourceArray

In [9]:
def printInstrumentFamilySource(instrumentFamilySourceArray):
    #utility function printing the corresponding string for int values in
    #instrument family and source array
    instrumentFamilyMap={0:"bass", 1:"brass", 2:"flute", 3:"guitar",
                         4:"keyboard", 5:"mallet", 6:"organ", 7:"reed",
                         8:"string", 9:"synth_lead", 10:"vocal"}
    instrumentSourceMap={0:"acoustic", 1:"electronic", 2:"synthetic"}
    print(instrumentFamilyMap[instrumentFamilySourceArray[0]] + ", " +
          instrumentSourceMap[instrumentFamilySourceArray[1]])

In [10]:
def printFromInstrumentsTensor(instrumentsTensor):
    printInstrumentFamilySource(getFamilySourceArray(instrumentsTensor))

In [11]:
printFromInstrumentsTensor(instruments[0])
#prints instrument's family, and source 
Audio(audio_tensors[0], rate=rate)

bass, synthetic


In [12]:
printFromInstrumentsTensor(instruments[1])
Audio(audio_tensors[1], rate=rate)

bass, electronic


In [13]:
printFromInstrumentsTensor(instruments[2])
Audio(audio_tensors[2], rate=rate)

organ, electronic


It is better :)

We can now begin our feature extraction with librosa

In [14]:
import librosa

In [15]:
#TO DO: CQT, mfcc and mel spectrogram extractions
def computeMelSpectrogram(audio):
    rate=16000
    melSpectrogram = librosa.feature.melspectrogram(y=audio, sr=rate)
    return melSpectrogram

In [16]:
X_train = np.array([computeMelSpectrogram((x["audio"]).numpy()) for x in train_dataset])

In [17]:
y_train = np.array([x["instrument"]["family"].numpy() for x in train_dataset])

In [18]:
X_valid = np.array([computeMelSpectrogram((x["audio"]).numpy()) for x in validation_dataset])

In [19]:
y_valid = np.array([x["instrument"]["family"].numpy() for x in validation_dataset])

In [20]:
X_train.shape, X_valid.shape

((28920, 128, 126), (5785, 128, 126))

In [21]:
X_train[0]

array([[2.2730122e-04, 6.9461399e-05, 2.8902195e-07, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [2.1088084e-03, 5.6040171e-04, 1.1145765e-06, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [8.0479626e-03, 2.0850345e-03, 7.3984265e-06, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       ...,
       [3.8798942e-05, 1.1574578e-05, 2.1949224e-07, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [8.2603074e-06, 2.3648279e-06, 1.2799580e-07, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [2.0321355e-07, 1.3413141e-07, 1.0091105e-07, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00]], dtype=float32)

In [22]:
X_train = X_train.reshape((X_train.shape[0],X_train.shape[1]*X_train.shape[2]))
X_valid = X_valid.reshape((X_valid.shape[0],X_valid.shape[1]*X_valid.shape[2]))

In [23]:
lr = LogisticRegression(max_iter=500, verbose=1)
lr.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 32.0min finished


LogisticRegression(max_iter=500, verbose=1)

In [24]:
lr.score(X_valid, y_valid)

0.16076058772687987

In [25]:
lr.score(X_train, y_train)

0.40985477178423235