Home
Welcome to the Musical Corpora Register!
The purpose of this wiki is to collect links to published musical corpora including some explanations. Hopefully, it is useful to some students and researchers that study music. The corpora are not listed in a particular order, yet. Everybody is welcome to contribute!
RS200 Pop / Rock corpus of harmonic labels
- http://rockcorpus.midside.com
- by Trevor deClercq and David Temperley
- first published in 2011
- corpus of harmonic labels for Pop / Rock songs in standard roman numeral notation
- planned to increase to all 500 pieces of Rolling Stones collection
iReal Jazz chord sequences
- https://www.musiccognition.osu.edu/resources/
- by Yuri Broze and Daniel Shanahan
- in humdrum format
- corpus of chord sequences of Jazz standards from Realbooks
- community-based data set
McGill Billboard Project
- http://ddmal.music.mcgill.ca/research/billboard
- by John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga
- An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis
SUPRA (Stanford University Piano Roll Archive)
- https://supra.stanford.edu, library exhibit
- Digitized piano rolls from the Stanford Libraries, including scans, MIDI transcriptions, and audio renderings
- Expressive piano rolls (live-recorded) of professional piansts
- Currently 470 rolls from Welte Mignon T100 (red) rolls from 1904–1920. github, supra-rw dataset
- ISMIR 2020 paper
- CC BY-NC-SA 4.0 (derivates must be published under similar license)
ASAP Dataset (Aligned Scores and Performances)
- https://github.com/fosfrancesco/asap-dataset
- MIDI and audio performances temporally matched to sheet music (in MusicXML and MIDI)
- Including beat, downbeat, time signature, and key signature annotations
- 1068 MIDI performances, 520 audio performances (from MAESTRO), aligned to 222 pieces
MAESTRO (by Magenta)
- https://magenta.tensorflow.org/datasets/maestro
- ~200h of piano solo recordings with temporal matching to MIDI
TAVERN
- http://u.osu.edu/tavern/
- by Johanna Devaney, Claire Arthur, Nathaniel Condit-Schultz, and Kirsten Nisula
- theme and variation encodings with roman numerals
- themes and variations for piano by Mozart and Beethoven, divided into 1060 phrases
- annotated with roman numerals
Beatles Corpus
- http://isophonics.net/content/reference-annotations-beatles
- by Chris Harte
- annotations of Beatles songs
- annotated features: beats, chords, keys, and form
Yale Classical Archives Corpus
- http://ycac.yale.edu/downloads
- by Christopher White and Ian Quinn
- poster from ISMIR 2014
- pitch-class and time data from MIDI files contributed by users of http://classicalarchives.com
- data is presented using salami slices
ELVIS project
- https://elvisproject.ca
- part of SIMSSA, the Single Interface for Music Score Searching and Analysis project
- 2852 Pieces and 3358 Movements by 164 Composers
- symbolic data in formats such as MEI, MusicXML, MIDI, and others
RAMEAU
- https://github.com/kroger/rameau
- by Pedro Kröger, Alexandre Passos, Marcos Sampaio, and Givaldo de Cidra
- the paper that describes the data set
Band-in-a-Box Jazz standards
- Band-in-a-Box files available at http://bhs.minor9.com/
- converted by Keunwoo Choi, George Fazekas, and Mark Sandler into one .txt-file for the research presented in this article
- chords of Jazz standards with time information in beats
Weimar Jazz Database
- http://jazzomat.hfm-weimar.de/dbformat/dbcontent.html
- part of the Jazzomat Research Project
- time-annotated MIDI melodies from monophonic Jazz solos
- chords and transcriptions in staff notation included
Real world computing data base
GTTM Database
- http://gttm.jp/gttm/database/
- by Masatoshi Hamanaka, Keiji Hirata, and Satoshi Tojo
- 300 8-bar phrases of monophonic melodies from western classical music
- XML format
Kostka-Payne Korpus
- http://davidtemperley.com/kp-stats/
- by David Temperley
- corpus consisting of 46 chord-analyzed excerpts in the workbook accompanying the theory textbook Tonal Harmony by Stefan Kostka and Dorothy Payne
Dutch Folk Song Database (The Meertens Tune Collections / MTC-ANN)
- http://www.liederenbank.nl/mtc/
- kern, midi, lilypond, mp3
Essen Associative Code and Folksong Database
Finnish Folk Song Database
- http://esavelmat.jyu.fi/collection_download.html
- by Tuomas Eerola and Petri Toiviainen
Annotated jazz chord progression corpus
- http://jazzparser.granroth-wilding.co.uk/ParserPaper.html
- by Mark Granroth-Wilding and Mark Steedman
Verovio Humdrum Viewer Online Repertories
- http://doc.verovio.humdrum.org/repertory/
- scores in humdrum format, directly accessible using the Verovio Humdrum viewer
Kern Scores Music Collection
- http://kern.humdrum.org/cgi-bin/browse?l=/
- A library of virtual musical scores in the Humdrum **kern data format.
MuseData
- http://www.musedata.org/
- krn format.
- mostly Baroque and Classical music.
Henrik Norbeck's ABC Tunes
- http://www.norbeck.nu/abc/
- by Henrik Norbeck, Stockholm, Sweden.
- A free online tune book of mostly Irish and Swedish traditional music
- Sheet music and lyrics for more than 2800 tunes in ABC format
Collection of World Music Corpora
- http://compmusic.upf.edu/corpora
- Carnatic, Hindustani, Turkish-Maqam, Beijing Opera, and Arab-Andalusian
- mix of audio and symbolic formats
Harmonic analysis of Joseph Haydn's "Sun Quartets"
- https://github.com/napulen/haydn_op20_harm
- 6 Classical string quartets analyzed
- 5000+ chord annotations in the **harm syntax
- annotated by Nestor Napoles and Rafael Caro
Analyses of the Algomus group
- http://www.algomus.fr/data/
- sonata form structure and cadences (2000+ labels) of 32 Mozart string quartet movements
- S/CS/CS2 patterns, cadences, pedals (1000+ labels) of 24 Bach fugues + 12 Shostakovich fugues (op.57, 1952)
The Digital Mozart Score Viewer
- https://dme.mozarteum.at/movi/en
- High quality MEI scores that need to be downloaded individually
Digital Edition of Mozart Piano Sonatas
- https://github.com/craigsapp/mozart-piano-sonatas
- Humdrum encodings as well as PDFs of source scans
- CC BY-NC-SA 4.0 (derivates must be published under similar license)
Beethoven Piano Sonatas with Functional Harmony (BPS-FH)
- https://github.com/Tsung-Ping/functional-harmony
- chord and phrase annotations for first movements of 32 piano sonatas (Excel files)
- Explaining paper: http://ismir2018.ircam.fr/doc/pdfs/178_Paper.pdf
Digital Edition of Beethoven Piano Sonatas
- https://github.com/craigsapp/beethoven-piano-sonatas
- Humdrum encodings as well as PDFs of source scans
Beethoven-Werkstatt
- https://github.com/BeethovensWerkstatt/module2/tree/dev/data/works
- roughly 20 MEI scores (July 20)
Josquin Research Project
- Jesse Rodin, Craig Sapp, Clare Bokulich
- https://josquin.stanford.edu, github
- ca. 1200 movements from ca. 1420–1520
- collected in Humdrum (on GH), available in many other formats
- CC-BY-SA 4.0 (derivates must be published under similar license)
- web interface for analytic queries
Tasso in Music Project
- Emiliano Ricciardi, Craig Sapp
- https://www.tassomusic.org, github
- complete critical edition of musical settings of Torquato Tasso's poems
- ca. 750 madrigals and related genres from 1571-1649
- collected in Humdrum (on GH), available in many other formats
- web interface for analytic queries
- Music Encoding Conference 2020 paper
JKU Pattern Development Database
- for "Discovery of Repeated Themes & Sections" MIREX task
- http://tomcollinsresearch.net/research/data/mirex/JKUPDD-Aug2013.zip
kunstderfuge.com
- 19.300 MIDI files in total, 17.500 in "XL Zip Archive"
- requires "academic subscription"
- website info
Million Song Dataset
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
- The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
- The Million Song Dataset is also a cluster of complementary datasets contributed by the community:
- SecondHandSongs dataset -> cover songs
- musiXmatch dataset -> lyrics
- Last.fm dataset -> song-level tags and similarity
- Taste Profile subset -> user data
- thisismyjam-to-MSD mapping -> more user data
- tagtraum genre annotations -> genre labels
- Top MAGD dataset -> more genre labels
- Link
The Lakh MIDI Dataset
- http://colinraffel.com/projects/lmd/
- MIDI format
Art song vocal lines
- Collection of vocal lines from songs by 19th century French and German composers
- .krn format.
- Leigh Van Handel
ScoresOfScores - Lieder Encoding Project
- https://github.com/MarkGotham/ScoresOfScores
- 300 works completing some of the LvH vocal lines into full songs.
- xml and mscx formats.
OpenScore
- https://musescore.com/OpenScore
- Sheet music in MuseScore (mscx) format
Choral Wiki
- https://www.cpdl.org/wiki/
- Sheet music of choral music in various engraving formats
- community of music lovers, especially for Baroque and Renaissance
music21 Corpus
- http://web.mit.edu/music21/doc/about/referenceCorpus.html
- metacorpus
- formats parsed by music21
Nottingham dataset, cleaned version
Neuma
- http://neuma.huma-num.fr/
- MEI format
- 12 composers (Mozart, Haydn, Brumel, Bach, Berlioz....)
CMME (Computerized Mensural Music Editing)
- large collections of 16th century scores
- available on GitHub
- with tool to translate to MusiXML
- http://www.cmme.org/
Links
- List of musical corpora (also audio): http://musicalmetacreation.org/links/corpora/. The individual links listed there will be also incorporated into this list in the future.
- List of data sets by David Meredith: http://www.titanmusic.com/data.php
SymbTr (Turkish Maqam; symbolic)
- SymbTr-scores are provided in text, MusicXML, PDF, MIDI and mu2 formats
- https://github.com/MTG/SymbTr
Johann Crueger Cantional Settings
- Lilypond files
- https://miami.uni-muenster.de/Record/c8e13273-c323-4c20-93f3-e3e6caff3224 (ZIP file)