In this notebook we shall try to recognize individual instruments in recorded audio files.

The Dataset used could be found here.[download the dataset](https://zenodo.org/record/1432913#.W6dPeJNKjOR)!

We'll load in the pre-computed [VGGish features](https://github.com/tensorflow/models/tree/master/research/audioset) and labels, and fit a [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model for each of the 20 instrument classes using the pre-defined train-test splits provided in the repository.

We'll then evaluate the models we fit, and apply them to new audio signals.

## 1.0 Loading the necessary packages

In [1]:
# These dependencies are necessary for loading the data
!pip3 install pandas
!pip3 install scikit-learn
os.system('apt-get install libsndfile1')
!pip3 install soundfile
!pip3 install resampy

import json
import os
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,f1_score

# Be sure to set this after downloading the dataset!
DATA_ROOT = 'data'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

# We need soundfile to load audio data
import soundfile as sf

# And the openmic-vggish preprocessor
import openmic.vggish

# For audio playback
from IPython.display import Audio

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



---

## 1.1 Loading the data

The openmic data is provided in a python-friendly format as `openmic-2018.npz`.

You can load it as follows:

In [2]:
OPENMIC = np.load(os.path.join(DATA_ROOT, 'openmic-2018/openmic-2018.npz'),allow_pickle=True)

In [3]:
# What's included?
print(list(OPENMIC.keys()))

['X', 'Y_true', 'Y_mask', 'sample_key']


## Data Description
- `X`: 20000 * 10 * 128 array of VGGish features
- `Y_true`: 20000 * 20 array of *true* label probabilities
- `Y_mask`: 20000 * 20 binary mask values
- `sample_key`: 20000 array of sample key strings

In [4]:
X, Y_true, Y_mask, sample_key = OPENMIC['X'], OPENMIC['Y_true'], OPENMIC['Y_mask'], OPENMIC['sample_key']

In [5]:
# Getting the 5th time slice in 10st sample key
X[10, 5]

array([191,  15, 177, 107, 176, 110, 149, 138, 107, 229,  97,  54, 166,
       117,  40,  94,  37, 114, 181, 146,  78, 255,  65,  68, 113,  81,
       157, 163, 206,  89, 132,  46,  67,   0,  68, 119, 104,  39, 118,
       100, 126, 187,  24, 119, 103, 197, 103,  92, 175,  19, 171,  54,
       100, 202, 132,  47, 254,   0,  99, 211, 255,  44,  30, 205, 101,
       177,  62,  70,  61, 166, 122, 178, 171, 173, 112, 106, 106, 162,
       208, 156,  72, 255, 134,   0,  63,  92, 150, 190, 120, 189, 123,
       255,  52, 178, 250, 203,   0,  56,  75, 214, 236, 249, 104,  93,
        78,   6, 246, 210, 255, 190, 191, 217,   0, 133, 216, 115, 142,
        64,   0, 153,  93,  95,  79, 255, 187,   0,  94, 255])

In [6]:
Y_true[10],Y_mask[10]

(array([0.5   , 0.5   , 0.5   , 0.5   , 0.5   , 0.5   , 0.5   , 0.5   ,
        0.5   , 0.5   , 0.    , 0.5   , 0.8243, 0.5   , 0.5   , 0.5   ,
        0.5   , 0.5   , 0.5   , 0.5   ]),
 array([False, False, False, False, False, False, False, False, False,
        False,  True, False,  True, False, False, False, False, False,
        False, False]))

---

## 1.2 Loading class mappings for instruments

In [7]:
with open('class-map.json', 'r') as f:
    class_map = json.load(f)

In [8]:
class_map

{'accordion': 0,
 'banjo': 1,
 'bass': 2,
 'cello': 3,
 'clarinet': 4,
 'cymbals': 5,
 'drums': 6,
 'flute': 7,
 'guitar': 8,
 'mallet_percussion': 9,
 'mandolin': 10,
 'organ': 11,
 'piano': 12,
 'saxophone': 13,
 'synthesizer': 14,
 'trombone': 15,
 'trumpet': 16,
 'ukulele': 17,
 'violin': 18,
 'voice': 19}

---

## 1.3 Loading train and test sample keys

OpenMIC-2018 dataset comes with a pre-defined train-test split based on sample_key entries.

In [9]:
split_train = pd.read_csv('examples/split01_train.csv', 
                          header=None, squeeze=True)
split_test = pd.read_csv('examples/split01_test.csv', 
                         header=None, squeeze=True)

In [10]:
print('# Train: {},  # Test: {}'.format(len(split_train), len(split_test)))

# Train: 14915,  # Test: 5085


In [11]:
train_set = set(split_train)
test_set = set(split_test)

In [12]:
train_set

{'076564_30720',
 '148958_195840',
 '138062_180480',
 '020030_1144320',
 '093416_7680',
 '129390_46080',
 '129014_80640',
 '005955_1486080',
 '135793_88320',
 '035799_46080',
 '013974_330240',
 '149618_88320',
 '058850_76800',
 '073813_199680',
 '146188_241920',
 '146736_291840',
 '094272_76800',
 '063270_241920',
 '027944_825600',
 '147937_57600',
 '115047_88320',
 '099108_65280',
 '039550_161280',
 '110538_7680',
 '150002_806400',
 '154773_495360',
 '097530_103680',
 '063596_380160',
 '033428_122880',
 '015541_34560',
 '113237_679680',
 '017464_65280',
 '143343_384000',
 '019384_180480',
 '005050_115200',
 '131868_203520',
 '067347_314880',
 '026309_84480',
 '015263_92160',
 '084973_207360',
 '116731_3840',
 '125825_138240',
 '111245_280320',
 '148439_57600',
 '112266_19200',
 '149629_503040',
 '071498_176640',
 '139367_26880',
 '055157_38400',
 '117303_357120',
 '001977_119040',
 '050367_253440',
 '060084_19200',
 '017977_407040',
 '004151_96000',
 '052439_107520',
 '103549_138240',

---

## 2.0 Split the data

Now that we have the sample keys for the training and testing examples, we need to partition the data arrays (`X`, `Y_true`, `Y_mask`).

In [13]:
#saving data indices using row numbers to either idx_train or idx_test
idx_train, idx_test = [], []

for idx, n in enumerate(sample_key):
    if n in train_set:
        idx_train.append(idx)
    elif n in test_set:
        idx_test.append(idx)
    else:
        # This should never happen, but better safe than sorry.
        raise RuntimeError('Unknown sample key={}! Abort!'.format(sample_key[n]))
        
# Finally, cast the idx_* arrays to numpy structures
idx_train = np.asarray(idx_train)
idx_test = np.asarray(idx_test)

In [14]:
# Finally, we use the split indices to partition the features, labels, and masks
X_train = X[idx_train]
X_test = X[idx_test]

Y_true_train = Y_true[idx_train]
Y_true_test = Y_true[idx_test]

Y_mask_train = Y_mask[idx_train]
Y_mask_test = Y_mask[idx_test]

In [15]:
# Print out the sliced shapes as a sanity check
print(X_train.shape)
print(X_test.shape)

(14915, 10, 128)
(5085, 10, 128)


---
## 3.0 Model Fitting


## 3.1 Using individual models to detect each instrument 

Now, we'll iterate over all the instrument classes, and fit a separate model for each one.

For each instrument, the steps are as follows:

1. Find the subset of training (and testing) data that have been annotated for the current instrument
2. Simplify the features to have one observation point per clip, instead of one point per time slice within each clip
3. Initialize a classifier
4. Fit the classifier to the training data
5. Evaluate the classifier on the test data and print a report


## 3.1.1 RandomForest

In [16]:
# This dictionary will include the classifiers for each model
models = dict()
instrument_f1 = {}

for instrument in class_map:
    
    # Map the instrument name to its column number
    inst_num = class_map[instrument]
        
    # Step 1: sub-sample the data
    
    # First, we need to select down to the data for which we have annotations
    # This is what the mask arrays are for
    train_inst = Y_mask_train[:, inst_num]
    test_inst = Y_mask_test[:, inst_num]
    
    # Here, we're using the Y_mask_train array to slice out only the training examples
    # for which we have annotations for the given class
    X_train_inst = X_train[train_inst]
    
    # Step 3: simplify the data by averaging over time
    
    # Let's arrange the data for a sklearn Random Forest model 
    # Instead of having time-varying features, we'll summarize each track by its mean feature vector over time
    X_train_inst_sklearn = np.mean(X_train_inst, axis=1)
    
    # Again, we slice the labels to the annotated examples
    # We thresold the label likelihoods at 0.5 to get binary labels
    Y_true_train_inst = Y_true_train[train_inst, inst_num] >= 0.5

    
    # Repeat the above slicing and dicing but for the test set
    X_test_inst = X_test[test_inst]
    X_test_inst_sklearn = np.mean(X_test_inst, axis=1)
    Y_true_test_inst = Y_true_test[test_inst, inst_num] >= 0.5

    # Step 3.
    # Initialize a new classifier
    clf = RandomForestClassifier(max_depth=8, n_estimators=100, random_state=0)
    
    # Step 4.
    clf.fit(X_train_inst_sklearn, Y_true_train_inst)

    # Step 5.
    # Finally, we'll evaluate the model on both train and test
    Y_pred_train = clf.predict(X_train_inst_sklearn)
    Y_pred_test = clf.predict(X_test_inst_sklearn)
    
    print('-' * 52)
    print('\033[1m'+instrument+'\033[0m')
    print('\tTRAIN: ')
    print(classification_report(Y_true_train_inst, Y_pred_train))
    print('\tTEST: ')
    print(classification_report(Y_true_test_inst, Y_pred_test))
    
    # Store the classifier in our dictionary
    models[instrument] = clf
    instrument_f1[instrument] = f1_score(Y_true_test_inst, Y_pred_test,average='macro')

print('\033[1m Test F1 Scores for instruments:\033[0m ',instrument_f1)
mean_f1 = np.mean(list(instrument_f1.values()))
print('\033[1m Average Test F1:\033[0m '+str(mean_f1))

----------------------------------------------------
[1maccordion[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       0.96      1.00      0.98      1159
        True       1.00      0.88      0.94       374

    accuracy                           0.97      1533
   macro avg       0.98      0.94      0.96      1533
weighted avg       0.97      0.97      0.97      1533

	TEST: 
              precision    recall  f1-score   support

       False       0.84      0.97      0.90       423
        True       0.77      0.32      0.45       115

    accuracy                           0.83       538
   macro avg       0.81      0.65      0.68       538
weighted avg       0.83      0.83      0.81       538

----------------------------------------------------
[1mbanjo[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       0.98      0.98      0.98      1148
        True       0.97      0.97      0.97       592

    accuracy      

----------------------------------------------------
[1mpiano[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       1.00      0.96      0.98       420
        True       0.98      1.00      0.99       885

    accuracy                           0.99      1305
   macro avg       0.99      0.98      0.99      1305
weighted avg       0.99      0.99      0.99      1305

	TEST: 
              precision    recall  f1-score   support

       False       0.96      0.85      0.90       130
        True       0.93      0.98      0.96       285

    accuracy                           0.94       415
   macro avg       0.94      0.91      0.93       415
weighted avg       0.94      0.94      0.94       415

----------------------------------------------------
[1msaxophone[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       0.99      0.94      0.97       906
        True       0.94      0.99      0.96       830

    accuracy      

## 3.1.2 ExtraTrees Classifier

In [17]:
# This dictionary will include the classifiers for each model
models = dict()
instrument_f1 = {}

for instrument in class_map:
    
    # Map the instrument name to its column number
    inst_num = class_map[instrument]
        
    # Step 1: sub-sample the data
    
    # First, we need to select down to the data for which we have annotations
    # This is what the mask arrays are for
    train_inst = Y_mask_train[:, inst_num]
    test_inst = Y_mask_test[:, inst_num]
    
    # Here, we're using the Y_mask_train array to slice out only the training examples
    # for which we have annotations for the given class
    X_train_inst = X_train[train_inst]
    
    # Step 3: simplify the data by averaging over time
    
    # Let's arrange the data for a sklearn Random Forest model 
    # Instead of having time-varying features, we'll summarize each track by its mean feature vector over time
    X_train_inst_sklearn = np.mean(X_train_inst, axis=1)
    
    # Again, we slice the labels to the annotated examples
    # We thresold the label likelihoods at 0.5 to get binary labels
    Y_true_train_inst = Y_true_train[train_inst, inst_num] >= 0.5

    
    # Repeat the above slicing and dicing but for the test set
    X_test_inst = X_test[test_inst]
    X_test_inst_sklearn = np.mean(X_test_inst, axis=1)
    Y_true_test_inst = Y_true_test[test_inst, inst_num] >= 0.5

    # Step 3.
    # Initialize a new classifier
    clf1 = ExtraTreesClassifier(n_estimators=100, random_state=0)
    
    # Step 4.
    clf1.fit(X_train_inst_sklearn, Y_true_train_inst)

    # Step 5.
    # Finally, we'll evaluate the model on both train and test
    Y_pred_train = clf1.predict(X_train_inst_sklearn)
    Y_pred_test = clf1.predict(X_test_inst_sklearn)
    
    print('-' * 52)
    print('\033[1m'+instrument+'\033[0m')
    print('\tTRAIN: ')
    print(classification_report(Y_true_train_inst, Y_pred_train))
    print('\tTEST: ')
    print(classification_report(Y_true_test_inst, Y_pred_test))
    
    # Store the classifier in our dictionary
    models[instrument] = clf1
    instrument_f1[instrument] = f1_score(Y_true_test_inst, Y_pred_test,average='macro')

print('\033[1m Test F1 Scores for instruments:\033[0m ',instrument_f1)
mean_f1 = np.mean(list(instrument_f1.values()))
print('\033[1m Average Test F1:\033[0m '+str(mean_f1))
##

----------------------------------------------------
[1maccordion[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      1159
        True       1.00      1.00      1.00       374

    accuracy                           1.00      1533
   macro avg       1.00      1.00      1.00      1533
weighted avg       1.00      1.00      1.00      1533

	TEST: 
              precision    recall  f1-score   support

       False       0.84      0.97      0.90       423
        True       0.74      0.32      0.45       115

    accuracy                           0.83       538
   macro avg       0.79      0.65      0.67       538
weighted avg       0.82      0.83      0.80       538

----------------------------------------------------
[1mbanjo[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      1148
        True       1.00      1.00      1.00       592

    accuracy      

----------------------------------------------------
[1mpiano[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       1.00      1.00      1.00       420
        True       1.00      1.00      1.00       885

    accuracy                           1.00      1305
   macro avg       1.00      1.00      1.00      1305
weighted avg       1.00      1.00      1.00      1305

	TEST: 
              precision    recall  f1-score   support

       False       0.96      0.85      0.90       130
        True       0.94      0.98      0.96       285

    accuracy                           0.94       415
   macro avg       0.95      0.92      0.93       415
weighted avg       0.94      0.94      0.94       415

----------------------------------------------------
[1msaxophone[0m
	TRAIN: 
              precision    recall  f1-score   support

       False       1.00      1.00      1.00       906
        True       1.00      1.00      1.00       830

    accuracy      

## 3.2 Multiclass SVM Classifier

Now, we'll use a multiclass SVM Classifier to predict the instrument

1. Find the subset of training (and testing) data that have been annotated for the current instrument or use the instrument which has the highest probability corresponding to it in Y_true_train
2. Simplify the features to have one observation point per clip, instead of one point per time slice within each clip
3. Initialize a classifier
4. Fit the classifier to the training data
5. Evaluate the classifier on the test data and print a report


In [33]:
inv_class_map= {v: k for k, v in class_map.items()}

## average input vectors actoss the time frame of the audio clip
X_train_avg = np.mean(X_train, axis=1)
X_test_avg = np.mean(X_test, axis=1)

## Create a list that stores the most likely instrument for an input    
Y_train = []
for i in range(len(Y_true_train)): #loop through the list of lists one by one and look for the best instrument label
    #if annotated through y_mask then we could look for the specific labels among that set of instruments
    if True in Y_mask_train[i]: 
        ids = np.where(Y_mask_train[i]==True)
        Y_train.append(list(Y_true_train[i]).index(max(Y_true_train[i][ids])))
    #else we could assign the label to the instrument with the highest absolute probability(coming from y_true labels)
    else:
        Y_train.append(list(Y_true_train[i]).index(max(Y_true_train[i])))
        
Y_test = []
for i in range(len(Y_true_test)):
    if True in Y_mask_test[i]: 
        ids = np.where(Y_mask_test[i]==True)
        Y_test.append(list(Y_true_test[i]).index(max(Y_true_test[i][ids])))
    else:
        Y_test.append(list(Y_true_test[i]).index(max(Y_true_test[i])))


Y_train = [inv_class_map[i] for i in Y_train] 
Y_test = [inv_class_map[i] for i in Y_test] 
# Step 3.
# Initialize a new classifier
clf = SVC(kernel='rbf')

# Step 4.
clf.fit(X_train_avg, Y_train)

# Step 5.
# Finally, we'll evaluate the model on both train and test
Y_pred_train = clf.predict(X_train_avg)
Y_pred_test = clf.predict(X_test_avg)

print('-' * 52)
print('\033[1m'+instrument+'\033[0m')
print('\tTRAIN: ')
print(classification_report(Y_train, Y_pred_train))
print('\tTEST: ')
print(classification_report(Y_test, Y_pred_test))

print('\033[1m Test F1 Scores for instruments:\033[0m ',f1_score(Y_test, Y_pred_test,average='macro'))
##

----------------------------------------------------
[1mvoice[0m
	TRAIN: 
                   precision    recall  f1-score   support

        accordion       0.75      0.67      0.70       704
            banjo       0.67      0.68      0.68       875
             bass       0.78      0.75      0.76       777
            cello       0.75      0.77      0.76       818
         clarinet       0.52      0.39      0.45       488
          cymbals       0.80      0.81      0.80       855
            drums       0.85      0.80      0.82       733
            flute       0.75      0.79      0.77       794
           guitar       0.81      0.88      0.84       840
mallet_percussion       0.80      0.81      0.81       768
         mandolin       0.57      0.60      0.59       598
            organ       0.80      0.78      0.79       814
            piano       0.92      0.94      0.93       880
        saxophone       0.65      0.78      0.71       865
      synthesizer       0.88      0.90

---

## 4.0 Applying the model to new data

In this section, we'll take the models trained above and apply them to audio signals, stored as OGG Vorbis files.

In [19]:
# We include a test ogg file in the openmic repository, which we can use here.
audio, rate = sf.read(os.path.join(DATA_ROOT, 'openmic-2018/audio/000/000046_3840.ogg'))

time_points, features = openmic.vggish.waveform_to_features(audio, rate)




Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.flatten instead.


INFO:tensorflow:Restoring parameters from /tf/openmic/vggish/_model/vggish_model.ckpt


In [20]:
# The time_points array marks the starting time of each observation
time_points

array([0.  , 0.96, 1.92, 2.88, 3.84, 4.8 , 5.76, 6.72, 7.68, 8.64])

In [22]:
# Let's listen to the example
Audio(data=audio.T, rate=rate)

### predict the instrument in the audio. The output is defined in a way that, when a new instrument is encountered with respect to the last one, we print it out

In [51]:

sec = 0
pred='None'
for i in range(len(features)):
    inp = features[i].reshape(-1,128)
    if sec!=0 and clf.predict(inp)!=pred:
        pred=clf.predict(inp)
        print(f'At {i} seconds : {clf.predict(inp)}')
    sec+=1

    

At 1 seconds : ['clarinet']
At 4 seconds : ['flute']
At 5 seconds : ['organ']
At 6 seconds : ['accordion']
At 7 seconds : ['voice']
At 8 seconds : ['accordion']
At 9 seconds : ['voice']
