# OpenMIC-2018 baseline model tutorial

This notebook demonstrates how to replicate a simplified version of the baseline modeling experiment in [(Humphrey, Durand, and McFee, 2018)](http://ismir2018.ircam.fr/doc/pdfs/203_Paper.pdf).

First, make sure you [download the dataset](https://zenodo.org/record/1432913#.W6dPeJNKjOR)!

We'll load in the pre-computed [VGGish features](https://github.com/tensorflow/models/tree/master/research/audioset) and labels, and fit a [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model for each of the 20 instrument classes using the pre-defined train-test splits provided in the repository.

We'll then evaluate the models we fit, and show how to apply them to new audio signals.

This notebook is not meant to demonstrate state-of-the-art performance on instrument recognition.  Rather, we hope that it can serve as a starting point for building your own instrument detectors without too much effort!

In [1]:
# These dependencies are necessary for loading the data
import json
import os
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Be sure to set this after downloading the dataset!
DATA_ROOT = '/mnt/c/Users/andre/Desktop/openmic-2018'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

## Loading the data

The openmic data is provided in a python-friendly format as `openmic-2018.npz`.

You can load it as follows:

In [2]:
OPENMIC = np.load(os.path.join(DATA_ROOT, 'openmic-2018.npz'))

In [3]:
# What's included?
print(list(OPENMIC.keys()))

['X', 'Y_true', 'Y_mask', 'sample_key']


### What's included in the data?

- `X`: 20000 * 10 * 128 array of VGGish features
    - First index (0..19999) corresponds to the sample key
    - Second index (0..9) corresponds to the time within the clip (each time slice is 960 ms long)
    - Third index (0..127) corresponds to the VGGish features at each point in the 10sec clip
    - Example `X[40, 8]` is the 128-dimensional feature vector for the 9th time slice in the 41st example
- `Y_true`: 20000 * 20 array of *true* label probabilities
    - First index corresponds to sample key, as above
    - Second index corresponds to the label class (accordion, ..., voice)
    - Example: `Y[40, 4]` indicates the confidence that example #41 contains the 5th instrument
- `Y_mask`: 20000 * 20 binary mask values
    - First index corresponds to sample key
    - Second index corresponds to the label class
    - Example: `Y[40, 4]` indicates whether or not we have observations for the 5th instrument for example #41
- `sample_key`: 20000 array of sample key strings
    - Example: `sample_key[40]` is the sample key for example #41

In [4]:
# It will be easier to use if we make direct variable names for everything
X, Y_true, Y_mask, sample_key = OPENMIC['X'], OPENMIC['Y_true'], OPENMIC['Y_mask'], OPENMIC['sample_key']

In [5]:
X.shape

(20000, 10, 128)

In [6]:
sample_key[40]

'000385_249600'

In [7]:
# Features for the 9th time slice of 81st example
X[80, 8]

array([192,  30, 176, 126, 208,  85,  84,  95,  69, 234,  99, 118, 166,
       150, 106,  68, 165, 156, 146, 206,  75, 210, 131,  49,  61, 218,
        92, 152, 121, 167,  62, 166, 167, 237,  22, 168, 165, 137, 178,
       132, 196,  96,  54, 166, 169, 132,  59,  27,  46, 123,  89,  47,
        58, 116,  48, 188, 157,  28,  44, 252, 248, 100,  28, 154, 147,
       148, 204, 104,  95,  67, 109, 147, 204, 146, 196, 222,  90, 255,
        94, 171,  53, 133, 202, 152,  35,  55, 231, 255,  62, 227, 168,
       192,  87, 144, 130, 255,   0,   0, 163,  75, 255, 135, 216,  68,
         0, 199,   0, 193, 254, 114,  12, 255,   0,  74, 165,   0, 201,
       246,   0, 127, 211, 218, 164,  57, 238, 176, 158, 255])

In [8]:
Y_true[40]

array([0.5    , 0.5    , 0.5    , 0.5    , 0.5    , 0.15055, 0.5    ,
       0.5    , 0.5    , 0.5    , 0.5    , 0.5    , 0.5    , 0.5    ,
       0.5    , 0.5    , 0.5    , 0.5    , 0.5    , 0.5    ])

In [9]:
Y_mask[40]

array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

In [10]:
sample_key.shape

(20000,)

In [11]:
sample_key[40]

'000385_249600'

### Load the class map

For convenience, we provide a simple JSON object that maps class indices to names.


In [12]:
with open(os.path.join(DATA_ROOT, 'class-map.json'), 'r') as f:
    class_map = json.load(f)

In [13]:
class_map

{'accordion': 0,
 'banjo': 1,
 'bass': 2,
 'cello': 3,
 'clarinet': 4,
 'cymbals': 5,
 'drums': 6,
 'flute': 7,
 'guitar': 8,
 'mallet_percussion': 9,
 'mandolin': 10,
 'organ': 11,
 'piano': 12,
 'saxophone': 13,
 'synthesizer': 14,
 'trombone': 15,
 'trumpet': 16,
 'ukulele': 17,
 'violin': 18,
 'voice': 19}

---

## Loading the train-test splits

OpenMIC-2018 comes with a pre-defined train-test split.  Great care was taken to ensure that this split is approximately balanced and artists are not represented in both sides of the split, so please use it!

This is done by sample key, not row number, so you will need to go through the `sample_key` array to slice the data.

In [14]:
# Let's split the data into the training and test set
# We use squeeze=True here to return a single array for each, rather than a full DataFrame

split_train = pd.read_csv(os.path.join(DATA_ROOT, 'partitions/split01_train.csv'), 
                          header=None, squeeze=True)
split_test = pd.read_csv(os.path.join(DATA_ROOT, 'partitions/split01_test.csv'), 
                         header=None, squeeze=True)

In [15]:
# These two tables contain the sample keys for training and testing examples
# Let's see the keys for the first five training example
split_train.head(5)

0      000046_3840
1    000135_483840
2    000139_119040
3    000141_153600
4     000144_30720
Name: 0, dtype: object

In [16]:
# How many train and test examples do we have?  About 75%/25%
print('# Train: {},  # Test: {}'.format(len(split_train), len(split_test)))

# Train: 14915,  # Test: 5085



These sample key maps are easier to use as sets, so let's make them sets!

In [17]:
train_set = set(split_train)
test_set = set(split_test)

In [18]:
test_set

{'029771_364800',
 '083820_203520',
 '127089_107520',
 '006369_3840',
 '081387_257280',
 '120282_11520',
 '041694_161280',
 '043516_88320',
 '128908_395520',
 '048786_46080',
 '046481_145920',
 '102342_180480',
 '011156_134400',
 '127465_38400',
 '032115_833280',
 '104931_334080',
 '001211_119040',
 '150625_42240',
 '020211_852480',
 '004742_130560',
 '042234_23040',
 '051614_161280',
 '074732_26880',
 '086435_107520',
 '117700_149760',
 '027987_46080',
 '130526_165120',
 '135420_42240',
 '127628_76800',
 '067363_103680',
 '059378_341760',
 '122902_184320',
 '059683_145920',
 '039677_241920',
 '047406_53760',
 '062519_264960',
 '151915_445440',
 '010610_380160',
 '029380_0',
 '014604_238080',
 '072717_69120',
 '037043_61440',
 '117390_192000',
 '033266_1451520',
 '070266_3840',
 '126377_199680',
 '055484_2626560',
 '131364_460800',
 '032370_3840',
 '017569_307200',
 '146619_0',
 '032377_1440000',
 '117679_30720',
 '000361_122880',
 '154887_238080',
 '118950_199680',
 '075013_7680',
 '1

### Split the data

Now that we have the sample keys for the training and testing examples, we need to partition the data arrays (`X`, `Y_true`, `Y_mask`).

This is a little delicate to get right.

In [19]:
# These loops go through all sample keys, and save their row numbers
# to either idx_train or idx_test
#
# This will be useful in the next step for slicing the array data
idx_train, idx_test = [], []

for idx, n in enumerate(sample_key):
    if n in train_set:
        idx_train.append(idx)
    elif n in test_set:
        idx_test.append(idx)
    else:
        # This should never happen, but better safe than sorry.
        raise RuntimeError('Unknown sample key={}! Abort!'.format(sample_key[n]))
        
# Finally, cast the idx_* arrays to numpy structures
idx_train = np.asarray(idx_train)
idx_test = np.asarray(idx_test)

In [20]:
idx_test

array([    7,    29,    30, ..., 19985, 19986, 19987])

In [21]:
# Finally, we use the split indices to partition the features, labels, and masks
X_train = X[idx_train]
X_test = X[idx_test]

Y_true_train = Y_true[idx_train]
Y_true_test = Y_true[idx_test]

Y_mask_train = Y_mask[idx_train]
Y_mask_test = Y_mask[idx_test]

In [22]:
Y_mask_test
#Y_true_test

array([[False, False, False, ..., False, False,  True],
       [False, False,  True, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False,  True, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [23]:
# Print out the sliced shapes as a sanity check
print(X_train.shape)
print(X_test.shape)

(14915, 10, 128)
(5085, 10, 128)


---
# Fit the models

Now, we'll iterate over all the instrument classes, and fit a separate `RandomForest` model for each one.

For each instrument, the steps are as follows:

1. Find the subset of training (and testing) data that have been annotated for the current instrument
2. Simplify the features to have one observation point per clip, instead of one point per time slice within each clip
3. Initialize a classifier
4. Fit the classifier to the training data
5. Evaluate the classifier on the test data and print a report


In [24]:
# This dictionary will include the classifiers for each model
models = dict()

# We'll iterate over all istrument classes, and fit a model for each one
# After training, we'll print a classification report for each instrument
for instrument in class_map:
    
    # Map the instrument name to its column number
    inst_num = class_map[instrument]
        
    # Step 1: sub-sample the data
    
    # First, we need to select down to the data for which we have annotations
    # This is what the mask arrays are for
    train_inst = Y_mask_train[:, inst_num]
    test_inst = Y_mask_test[:, inst_num]
    
    # Here, we're using the Y_mask_train array to slice out only the training examples
    # for which we have annotations for the given class
    X_train_inst = X_train[train_inst]
    
    # Step 3: simplify the data by averaging over time
    
    # Let's arrange the data for a sklearn Random Forest model 
    # Instead of having time-varying features, we'll summarize each track by its mean feature vector over time
    X_train_inst_sklearn = np.mean(X_train_inst, axis=1)
    
    # Again, we slice the labels to the annotated examples
    # We thresold the label likelihoods at 0.5 to get binary labels
    # Così ha valori di 0 e 1 dipendentemente dalla probabilità!!
    Y_true_train_inst = Y_true_train[train_inst, inst_num] >= 0.5

    
    # Repeat the above slicing and dicing but for the test set
    X_test_inst = X_test[test_inst]
    X_test_inst_sklearn = np.mean(X_test_inst, axis=1)
    Y_true_test_inst = Y_true_test[test_inst, inst_num] >= 0.5

    # Step 3.
    # Initialize a new classifier
    clf = RandomForestClassifier(max_depth=8, n_estimators=100, random_state=0)
    
    # Step 4.
    clf.fit(X_train_inst_sklearn, Y_true_train_inst)

    # Step 5.
    # Finally, we'll evaluate the model on both train and test
    Y_pred_train = clf.predict(X_train_inst_sklearn)
    Y_pred_test = clf.predict(X_test_inst_sklearn)
    
    print('-' * 52)
    print(instrument)
    print('\tTRAIN')
    print(classification_report(Y_true_train_inst, Y_pred_train))
    print('\tTEST')
    print(classification_report(Y_true_test_inst, Y_pred_test))
    
    # Store the classifier in our dictionary
    models[instrument] = clf

----------------------------------------------------
accordion
	TRAIN
              precision    recall  f1-score   support

       False       0.96      1.00      0.98      1159
        True       1.00      0.88      0.94       374

   micro avg       0.97      0.97      0.97      1533
   macro avg       0.98      0.94      0.96      1533
weighted avg       0.97      0.97      0.97      1533

	TEST
              precision    recall  f1-score   support

       False       0.84      0.97      0.90       423
        True       0.77      0.32      0.45       115

   micro avg       0.83      0.83      0.83       538
   macro avg       0.81      0.65      0.68       538
weighted avg       0.83      0.83      0.81       538

----------------------------------------------------
banjo
	TRAIN
              precision    recall  f1-score   support

       False       0.98      0.98      0.98      1148
        True       0.97      0.97      0.97       592

   micro avg       0.98      0.98      0

----------------------------------------------------
piano
	TRAIN
              precision    recall  f1-score   support

       False       1.00      0.96      0.98       420
        True       0.98      1.00      0.99       885

   micro avg       0.99      0.99      0.99      1305
   macro avg       0.99      0.98      0.99      1305
weighted avg       0.99      0.99      0.99      1305

	TEST
              precision    recall  f1-score   support

       False       0.96      0.85      0.90       130
        True       0.93      0.98      0.96       285

   micro avg       0.94      0.94      0.94       415
   macro avg       0.94      0.91      0.93       415
weighted avg       0.94      0.94      0.94       415

----------------------------------------------------
saxophone
	TRAIN
              precision    recall  f1-score   support

       False       0.99      0.94      0.97       906
        True       0.94      0.99      0.96       830

   micro avg       0.96      0.96      0

---

# Applying the model to new data

In this section, we'll take the models trained above and apply them to audio signals, stored as OGG Vorbis files.

In [32]:
# We need soundfile to load audio data
import soundfile as sf

# And the openmic-vggish preprocessor
import openmic.vggish

# For audio playback
from IPython.display import Audio

In [38]:
# We include a test ogg file in the openmic repository, which we can use here.
audio, rate = sf.read(os.path.join(DATA_ROOT, 'audio/lagovostok2.ogg'))

time_points, features = openmic.vggish.waveform_to_features(audio, rate)

INFO:tensorflow:Restoring parameters from /home/andreamolgora/anaconda3/lib/python3.6/site-packages/openmic/vggish/_model/vggish_model.ckpt


In [39]:
# The time_points array marks the starting time of each observation
time_points

array([0.  , 0.96, 1.92, 2.88, 3.84, 4.8 , 5.76, 6.72, 7.68, 8.64])

In [40]:
# The features array includes the vggish feature observations
features.shape

(10, 128)

In [41]:
# Let's listen to the example
Audio(data=audio.T, rate=rate)

In [42]:
# finally, apply the classifier

# Average over time to one observation, but keep the number of dimensions the same
# The test clip is 10sec long, so this is the same process as in the training step
# However, you could also apply the classifier to each frame independently to get time-varying predictions
feature_mean = np.mean(features, axis=0, keepdims=True)

for instrument in models:
    
    clf = models[instrument]
    
    print('P[{:18s}=1] = {:.3f}'.format(instrument, clf.predict_proba(feature_mean)[0,1]))

P[accordion         =1] = 0.063
P[banjo             =1] = 0.050
P[bass              =1] = 0.396
P[cello             =1] = 0.003
P[clarinet          =1] = 0.021
P[cymbals           =1] = 0.893
P[drums             =1] = 0.961
P[flute             =1] = 0.050
P[guitar            =1] = 0.695
P[mallet_percussion =1] = 0.102
P[mandolin          =1] = 0.070
P[organ             =1] = 0.021
P[piano             =1] = 0.040
P[saxophone         =1] = 0.129
P[synthesizer       =1] = 0.342
P[trombone          =1] = 0.130
P[trumpet           =1] = 0.118
P[ukulele           =1] = 0.037
P[violin            =1] = 0.095
P[voice             =1] = 0.645


# Wrapping up

So the predictions here are definitely not perfect, but they're a good start!

Some things you might want to try out:

1. Instead of averaging features over time, apply the classifiers to each time-step to get a time-varying instrument detector.
2. Play with the parameters of the `RandomForest` model, changing the depth and number of estimators.
3. Run the trained model on your own favorite songs!
4. Train a different model, maybe using different features!
5. Make use of label uncertainties or unlabeled data when training!