The goal of this project is to match and align a very large collection of MIDI files to a very large collection of audio files so that the MIDI data can be used to infer ground truth information about the audio. Alternatively, this repository contains code for reproducing most of the results in , which describes the goals, ideas, and research behind this project in much greater detail.
If you're looking for a high-level overview of the techniques used in this project and the results, take a look at chapter 1 of my thesis .
This repository contains code for performing the matching; if you're looking for the "Lakh MIDI Dataset" itself (the result of using this code to match a collection of 178,561 MIDI files to the Million Song Dataset), you can find that here.
If you just want a tutorial on potential uses of the Lakh MIDI dataset, take a look at the Tutorial.ipynb notebook.
Over time, this project has undergone some restructuring; if you're looking for the version of this repository used in the experiments in , check this tag.
Before utilizing the code in this repository, you need to gather some data and software.
Create a folder called
data in the root of this repository. In it, you need the following subdirectories:
clean_midi, which should contain the "clean MIDI subset", as described in section 5.2.1 of . These MIDI files should live in
data/clean_midi/mid. You can obtain this collection here.
unique_midi, which should contain LMD-full, the 176,581 files of the Lakh MIDI dataset (aka
LMD-full). These MIDI files should live in
data/unique_midi/mid. You can obtain this collection here.
msd, which should each contain audio files from each respective dataset (
msdbeing the 7digital preview clips corresponding to the Million Song Dataset). The MP3 files should live in, e.g.,
data/uspop2002/mp3. Unfortunately, obtaining these MP3 files is non-trivial. If you need help tracking them down, please contact me directly.
All of the datasets in the
data subdirectory (except for
unique_midi) should have a corresponding file list in the
file_lists subdirectory. The only one which is not included in this repository is
msd.txt; you can obtain that from the MSD directly (it's distributed with the MSD as
unique_tracks.txt) or you can also download it here and rename
All of the code in this repository is written for Python 2.7; it will likely need modification to work with Python 3.x. Here is a potentially incomplete list of the Python libraries used in this project:
All of this code was designed to be run on a server with 64 GB of ram, 12 CPU cores, an NVIDIA GTX 980 Ti GPU, and plenty of hard drive space. If your own setup has less resources, you may need to modify some of the scripts in various places so that they use an appopriate amount of RAM, parallel processes, etc. In any case, please note that running all of the experiments and steps from beginning to end will take a least a few weeks of compute time.
The general structure of this repository is as follows: Collections of shared utilities (
whoosh_search.py) live in the base level, one-time-use scripts for assembling data and performing the actual MIDI-to-audio matching live in the
scripts directory, and experiments for evaluating the effectiveness of different matching techniques live in
experiments. Any data/results generated by running these different files are written out to a
results directory. To re-run all of the experiments, matching, etc., proceed as described below.
create_whoosh_indices.py. This uses the file lists to create Whoosh indices, which allow for fuzzy text matching of metadata. We use this fuzzy text matching to create training data for different matching algorithms. The indices are written out to, e.g.,
text_match_datasets.py. This uses the Whoosh indices to match MIDI files from
clean_midi(which ostensibly may have reliable metadata) to entries in the different audio datasets. It also takes care to group audio files which are recordings of the same song. The results are written to
create_msd_cqts.py. This pre-computes constant-Q spectrograms for every entry in the Million Song Dataset, which saves time later on as we will need these for various steps throughout the process. They are written to
align_text_matches.py. This uses dynamic time warping (specifically the approach proposed in ) to align each MIDI-audio pair found by metadata matching. The results are written to
results/clean_midi_aligned, and include both the aligned MIDI files in
results/clean_midi_aligned/midand "diagnostics files" in
results/clean_midi_aligned/h5. The diagnostics files contain information about whether each match is truly a match (an incorrect match can be caused e.g. by incorrect metadata or a bad transcription).
split_training_data.py. This splits the matches into train, validation, development, and test collections which are used for evaluating each of the different matching approaches implemented in
create_training_data.py. This inspects the results of
align_text_matches.pyto find good matches and generates training data for different matching approaches in a convenient format. It essentially produces saved constant-Q spectrograms of audio files, aligned MIDI files, unaligned MIDI files, and aligned MIDI piano rolls, in various folders in
- Run the experiments! Each subdirectory in the
experimentsdirectory corresponds to a different MIDI-audio matching technique. Each of these experiments at least contains a script called
match_msd.py, which uses the matching technique to match each MIDI file in either the development or test set to the MSD and writes out the results. Most of the experiments have a script called
precompute.py, which precomputes any necessary features/representation of entries in the development and test set. Finally, those experiments which are based on machine learning techniques also have a script
parameter_search.pywhich trains any models necessary for performing the matching. In short, to run each of these experiments, run
parameter_search.pyif it exists, run
precompute.py, and finally run
match_msd.py. The results can be used to measure the effectiveness of each approach. There isn't a script which performs this analysis automatically, but there is a great deal of analysis in my thesis .
- To actually match the
unique_midicollection to the Million Song Dataset, use the
match.pyscript. For flexibility, this script takes a few command line arguments - first, a glob to MIDI files you want to match, and second, a path to where to write the results. To match the entire
unique_mididataset to the MSD, call it like so:
python match.py ../data/unique_midi/mid/*/\*.mid output_path. This will produce (in
output_path) one file for each MIDI file processed which lists potential matches in the MSD and the corresponding confidence scores.
- To assemble a collection of matched-and-aligned MIDI files, use the script
assemble_aligned_matches.py. This will find all MIDI-audio matches produced by
match.pywhich have a sufficiently high confidence score, re-align them, and write out the aligned MIDI file, along with the unaligned MIDI, MP3 file, and MSD H5, for convenience. In essence, this is how, at long last, each component of the Lakh MIDI dataset is produced.
- Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
- Colin Raffel and Daniel P. W. Ellis. "Large-Scale Content-Based Matching of MIDI and Audio Files". Proceedings of the 16th International Society for Music Information Retrieval Conference, 2015.
- Colin Raffel and Daniel P. W. Ellis. "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
- Colin Raffel and Daniel P. W. Ellis. "Pruning Subsequence Search with Attention-Based Embedding". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.