Code for creating a dataset of MIDI ground truth
Jupyter Notebook Python
Switch branches/tags
Nothing to show
Clone or download
craffel Update
Add hardware section to clarify requirements.
Latest commit c79115e Oct 5, 2016

MIDI Dataset

The goal of this project is to match and align a very large collection of MIDI files to a very large collection of audio files so that the MIDI data can be used to infer ground truth information about the audio. Alternatively, this repository contains code for reproducing most of the results in [1], which describes the goals, ideas, and research behind this project in much greater detail.


  • If you're looking for a high-level overview of the techniques used in this project and the results, take a look at chapter 1 of my thesis [1].

  • This repository contains code for performing the matching; if you're looking for the "Lakh MIDI Dataset" itself (the result of using this code to match a collection of 178,561 MIDI files to the Million Song Dataset), you can find that here.

  • If you just want a tutorial on potential uses of the Lakh MIDI dataset, take a look at the Tutorial.ipynb notebook.

  • Over time, this project has undergone some restructuring; if you're looking for the version of this repository used in the experiments in [2], check this tag.


Before utilizing the code in this repository, you need to gather some data and software.


Create a folder called data in the root of this repository. In it, you need the following subdirectories:

  • clean_midi, which should contain the "clean MIDI subset", as described in section 5.2.1 of [1]. These MIDI files should live in data/clean_midi/mid. You can obtain this collection here.
  • unique_midi, which should contain LMD-full, the 176,581 files of the Lakh MIDI dataset (aka LMD-full). These MIDI files should live in data/unique_midi/mid. You can obtain this collection here.
  • uspop2002, cal10k, cal500, and msd, which should each contain audio files from each respective dataset (msd being the 7digital preview clips corresponding to the Million Song Dataset). The MP3 files should live in, e.g., data/uspop2002/mp3. Unfortunately, obtaining these MP3 files is non-trivial. If you need help tracking them down, please contact me directly.

File lists

All of the datasets in the data subdirectory (except for unique_midi) should have a corresponding file list in the file_lists subdirectory. The only one which is not included in this repository is msd.txt; you can obtain that from the MSD directly (it's distributed with the MSD as unique_tracks.txt) or you can also download it here and rename msd.txt.


All of the code in this repository is written for Python 2.7; it will likely need modification to work with Python 3.x. Here is a potentially incomplete list of the Python libraries used in this project:

  • numpy
  • scipy
  • librosa
  • pretty_midi
  • whoosh
  • joblib
  • deepdish
  • dhs
  • pse
  • msgpack
  • msgpack_numpy
  • lasagne
  • theano
  • sklearn
  • djitw
  • simple_spearmint
  • spearmint


All of this code was designed to be run on a server with 64 GB of ram, 12 CPU cores, an NVIDIA GTX 980 Ti GPU, and plenty of hard drive space. If your own setup has less resources, you may need to modify some of the scripts in various places so that they use an appopriate amount of RAM, parallel processes, etc. In any case, please note that running all of the experiments and steps from beginning to end will take a least a few weeks of compute time.


The general structure of this repository is as follows: Collections of shared utilities (,, live in the base level, one-time-use scripts for assembling data and performing the actual MIDI-to-audio matching live in the scripts directory, and experiments for evaluating the effectiveness of different matching techniques live in experiments. Any data/results generated by running these different files are written out to a results directory. To re-run all of the experiments, matching, etc., proceed as described below.

  1. Run This uses the file lists to create Whoosh indices, which allow for fuzzy text matching of metadata. We use this fuzzy text matching to create training data for different matching algorithms. The indices are written out to, e.g., data/msd/index/.
  2. Run This uses the Whoosh indices to match MIDI files from clean_midi (which ostensibly may have reliable metadata) to entries in the different audio datasets. It also takes care to group audio files which are recordings of the same song. The results are written to results/text_matches.js.
  3. Run This pre-computes constant-Q spectrograms for every entry in the Million Song Dataset, which saves time later on as we will need these for various steps throughout the process. They are written to data/msd/h5.
  4. Run This uses dynamic time warping (specifically the approach proposed in [3]) to align each MIDI-audio pair found by metadata matching. The results are written to results/clean_midi_aligned, and include both the aligned MIDI files in results/clean_midi_aligned/mid and "diagnostics files" in results/clean_midi_aligned/h5. The diagnostics files contain information about whether each match is truly a match (an incorrect match can be caused e.g. by incorrect metadata or a bad transcription).
  5. Run This splits the matches into train, validation, development, and test collections which are used for evaluating each of the different matching approaches implemented in experiments.
  6. Run This inspects the results of to find good matches and generates training data for different matching approaches in a convenient format. It essentially produces saved constant-Q spectrograms of audio files, aligned MIDI files, unaligned MIDI files, and aligned MIDI piano rolls, in various folders in results.
  7. Run the experiments! Each subdirectory in the experiments directory corresponds to a different MIDI-audio matching technique. Each of these experiments at least contains a script called, which uses the matching technique to match each MIDI file in either the development or test set to the MSD and writes out the results. Most of the experiments have a script called, which precomputes any necessary features/representation of entries in the development and test set. Finally, those experiments which are based on machine learning techniques also have a script which trains any models necessary for performing the matching. In short, to run each of these experiments, run if it exists, run, and finally run The results can be used to measure the effectiveness of each approach. There isn't a script which performs this analysis automatically, but there is a great deal of analysis in my thesis [1].
  8. To actually match the unique_midi collection to the Million Song Dataset, use the script. For flexibility, this script takes a few command line arguments - first, a glob to MIDI files you want to match, and second, a path to where to write the results. To match the entire unique_midi dataset to the MSD, call it like so: python ../data/unique_midi/mid/*/\*.mid output_path. This will produce (in output_path) one file for each MIDI file processed which lists potential matches in the MSD and the corresponding confidence scores.
  9. To assemble a collection of matched-and-aligned MIDI files, use the script This will find all MIDI-audio matches produced by which have a sufficiently high confidence score, re-align them, and write out the aligned MIDI file, along with the unaligned MIDI, MP3 file, and MSD H5, for convenience. In essence, this is how, at long last, each component of the Lakh MIDI dataset is produced.


  1. Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
  2. Colin Raffel and Daniel P. W. Ellis. "Large-Scale Content-Based Matching of MIDI and Audio Files". Proceedings of the 16th International Society for Music Information Retrieval Conference, 2015.
  3. Colin Raffel and Daniel P. W. Ellis. "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
  4. Colin Raffel and Daniel P. W. Ellis. "Pruning Subsequence Search with Attention-Based Embedding". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.