Boostrapping weak multi-instrument classifiers to build a development dataset with the open-mic taxonomy.
Before anything else, download the TensorFlow model parameters locally. You can do this with the bash script included.
$ cd {repo_root}
$ ./scripts/download-deps.sh
Be sure to install audioset
with the -e
flag, as the resulting scripts depend on this structure. In the future this might leverage something like a package resources; if this is sufficiently important to you, please create an issue.
$ pip install -e .
After following the installation instructions above, you can either process a single file (via --file
) or a newline-separated text list of filepaths (via --input_list
):
$ cd {repo_root}
$ ./scripts/featurefy.py --file /some/audio/file.wav ./output_dir
OR
$ ls /path/to/audio/*wav > file_list.txt
$ ./scripts/featurefy.py --input_list file_list.txt ./output_dir
The original TensorFlow-friendly dump of the VGGish features for AudioSet is made freely available online. Here, VGGish features are sharded into 4096 nested tf.SequenceExample
s, with each shard containing several hundred excerpts (between 300-1000, averaging around 900).
While this works well for TensorFlow, the format can be a bit overwraught for more Pythonic implementations, e.g. sklearn
. Here, we provide a transforms_features.py
script that unpacks the features into a large(ish) NumPy tensor (2.4GB) with shape [examples * time, feature]
, and a CSV file of metadata, tracking both labels and provenance info, with the columns [index, labels, time, video_id]
. The index of this table corresponds exactly to the feature array.
These artifacts are made freely available here:
# Image of data snapshot goes here.
As a final preprocessing step to build our development set, the full AudioSet is filtered on OpenMIC instrument classes and the labels are transformed into a sparse indicator (boolean) array, for easy use with Pythonic ML frameworks, e.g. sklearn
.
The OpenMIC subset for the unbalanced training set looks like the following:
Instrument | Expected | Actual | Missing |
---|---|---|---|
guitar | 56926 | 56489 | 0.77% |
violin | 28065 | 28001 | 0.23% |
drums | 26331 | 26076 | 0.97% |
piano | 12744 | 12654 | 0.71% |
bass | 8549 | 8428 | 1.42% |
mallet_percussion | 7257 | 7128 | 1.78% |
voice | 6611 | 6549 | 0.94% |
cymbals | 5365 | 5301 | 1.19% |
ukulele | 5232 | 5172 | 1.15% |
cello | 5215 | 5148 | 1.28% |
synthesizer | 4981 | 4921 | 1.20% |
flute | 4721 | 4659 | 1.31% |
trumpet | 3771 | 3707 | 1.70% |
organ | 3578 | 3458 | 3.35% |
saxophone | 3013 | 2950 | 2.09% |
accordion | 2833 | 2772 | 2.15% |
trombone | 2731 | 2666 | 2.38% |
banjo | 2396 | 2336 | 2.50% |
mandolin | 2312 | 2252 | 2.60% |
harmonica | 2156 | 2095 | 2.83% |
clarinet | 2061 | 1998 | 3.06% |
harp | 1983 | 1921 | 3.13% |
bagpipes | 1715 | 1655 | 3.50% |
none of the above | 1841243 | 8000 | -- |
Note that we backfill with 8000 randomly sampled observations, chosen to be reasonably disjoint with the given instrument classes.