# About instrument recognition using NSynth dataset. 
## Enzo Fragale, on the 18th of december 2022
Documentations: 
* https://www.tensorflow.org/datasets/catalog/nsynth
* https://magenta.tensorflow.org/datasets/nsynth
* https://paperswithcode.com/dataset/nsynth


## Citation:

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders." 2017.



pmlr-v70-engel17a,

  title =    {Neural Audio Synthesis of Musical Notes with {W}ave{N}et Autoencoders},

  author =   {Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan},

  booktitle =    {Proceedings of the 34th International Conference on Machine Learning},

  pages =    {1068--1077},

  year =     {2017},

  editor =   {Doina Precup and Yee Whye Teh},

  volume =   {70},

  series =   {Proceedings of Machine Learning Research},

  address =      {International Convention Centre, Sydney, Australia},

  month =    {06--11 Aug},

  publisher =    {PMLR},

  pdf =      {http://proceedings.mlr.press/v70/engel17a/engel17a.pdf},

  url =      {http://proceedings.mlr.press/v70/engel17a.html},


In [37]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
from IPython.display import Audio

Load the full NSynth train dataset as tf.data.Dataset to train, validate and test (cannot use regular validation and test due to full disk on colab VM)

In [19]:
train_dataset, validation_dataset , test_dataset = tfds.load(
    name="nsynth/full", 
    shuffle_files=True,
    read_config = tfds.ReadConfig(shuffle_seed=26),
    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'])

In [4]:
train_dataset

<PrefetchDataset element_spec={'audio': TensorSpec(shape=(64000,), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'instrument': {'family': TensorSpec(shape=(), dtype=tf.int64, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'source': TensorSpec(shape=(), dtype=tf.int64, name=None)}, 'pitch': TensorSpec(shape=(), dtype=tf.int64, name=None), 'qualities': {'bright': TensorSpec(shape=(), dtype=tf.bool, name=None), 'dark': TensorSpec(shape=(), dtype=tf.bool, name=None), 'distortion': TensorSpec(shape=(), dtype=tf.bool, name=None), 'fast_decay': TensorSpec(shape=(), dtype=tf.bool, name=None), 'long_release': TensorSpec(shape=(), dtype=tf.bool, name=None), 'multiphonic': TensorSpec(shape=(), dtype=tf.bool, name=None), 'nonlinear_env': TensorSpec(shape=(), dtype=tf.bool, name=None), 'percussive': TensorSpec(shape=(), dtype=tf.bool, name=None), 'reverb': TensorSpec(shape=(), dtype=tf.bool, name=None), 'tempo-synced': TensorSpec(shape=

Let's see (or rather hear) several examples in our dataset

In [57]:
audio_tensors=[]
instruments=[]
ids=[]
pitches=[]
qualities=[]
velocities=[]
for element in train_dataset.take(3):
    audio_tensors.append(element["audio"])
    instruments.append(element["instrument"])
    ids.append(element["id"])
    pitches.append(element["pitch"])
    qualities.append(element["qualities"])
    velocities.append(element["velocity"])
print(audio_tensors)

[<tf.Tensor: shape=(64000,), dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(64000,), dtype=float32, numpy=
array([-0.0204345 ,  0.01125625, -0.14146294, ...,  0.        ,
        0.        ,  0.        ], dtype=float32)>, <tf.Tensor: shape=(64000,), dtype=float32, numpy=
array([-4.5825157e-08,  1.8514351e-07, -4.1045467e-07, ...,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32)>]


In [58]:
rate = 16000 #audio tensors are of shape (64000,): 4 seconds records of 
             #monophonic 16kHz audio snippets (cf. documentation)

In [64]:
ids[1]

<tf.Tensor: shape=(), dtype=string, numpy=b'mallet_electronic_001-056-127'>

In [67]:
print(ids[0])
print(instruments[0])
print(pitches[0])
print(qualities[0])
print(velocities[0])
#prints instrument's family, label, and source 
Audio(audio_tensors[0], rate=rate)

tf.Tensor(b'bass_synthetic_104-067-127', shape=(), dtype=string)
{'family': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=145>, 'source': <tf.Tensor: shape=(), dtype=int64, numpy=2>}
tf.Tensor(67, shape=(), dtype=int64)
{'bright': <tf.Tensor: shape=(), dtype=bool, numpy=True>, 'dark': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'distortion': <tf.Tensor: shape=(), dtype=bool, numpy=True>, 'fast_decay': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'long_release': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'multiphonic': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'nonlinear_env': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'percussive': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'reverb': <tf.Tensor: shape=(), dtype=bool, numpy=False>, 'tempo-synced': <tf.Tensor: shape=(), dtype=bool, numpy=True>}
tf.Tensor(127, shape=(), dtype=int64)


We can see here the different labels on this audio file even though it is a bit complicated to see because of the tensorflow verbose. Notice that the id sums it all up: the instruments' family and source (, probably the record id), the pitch and the velocity. 

Let's print the instrument information in a more user-friendly way:

In [49]:
def getFamilySourceArray(instrumentsTensor):
    #comprehension list getting family and source int values from instruments
    #tensor
    familySourceArray=\
    [instrumentsTensor[x].numpy() for x in instrumentsTensor][0::2]
    return familySourceArray

In [50]:
def printInstrumentFamilySource(instrumentFamilySourceArray):
    #utility function printing the corresponding string for int values in
    #instrument family and source array
    instrumentFamilyMap={0:"bass", 1:"brass", 2:"flute", 3:"guitar",
                         4:"keyboard", 5:"mallet", 6:"organ", 7:"reed",
                         8:"string", 9:"synth_lead", 10:"vocal"}
    instrumentSourceMap={0:"acoustic", 1:"electronic", 2:"synthetic"}
    print(instrumentFamilyMap[instrumentFamilySourceArray[0]] + ", " +
          instrumentSourceMap[instrumentFamilySourceArray[1]])

In [51]:
def printFromInstrumentsTensor(instrumentsTensor):
    printInstrumentFamilySource(getFamilySourceArray(instrumentsTensor))

In [60]:
printFromInstrumentsTensor(instruments[0])
#prints instrument's family, and source 
Audio(audio_tensors[0], rate=rate)

bass, synthetic


In [61]:
printFromInstrumentsTensor(instruments[1])
Audio(audio_tensors[1], rate=rate)

mallet, electronic


In [62]:
printFromInstrumentsTensor(instruments[2])
Audio(audio_tensors[2], rate=rate)

vocal, synthetic


It is better :)

We can now begin our feature extraction with librosa

In [38]:
import librosa

In [None]:
#TO DO: CQT, mfcc and mel spectrogram extractions