# Data Preparation 

This notebook contains code for downloading and preprocessing data to be used in all 
sessions of the course.

Copyright-free data will be shared in the repository, so you won't need to run this 
notebook.

In [None]:
import random
import shutil
from pathlib import Path

import kagglehub

# Set seed for reproducibility
random.seed(1234)

In [5]:
# Set global variables
SAMPLE_SIZE = 100

## MIDI Dataset
As main MIDI dataset for the course, we will use the [Lakh MIDI Dataset (LMD)](https://colinraffel.com/projects/lmd/), a collection of 176,581 unique MIDI files.
We will use the cleaned subset of the dataset, which contains 45,129 files, which is 
available for download on [Kaggle](https://www.kaggle.com/datasets/imsparsh/lakh-midi-clean).

For the sake of simplicity, we will create a subset of the dataset containing only 100 files.

In [None]:
# Define varaibles
MIDI_DATASET_PATH = Path("./midi")

In [2]:
# Download latest version
path = kagglehub.dataset_download("imsparsh/lakh-midi-clean")

print("Path to dataset files:", path)

Downloading to /Users/andreapoltronieri/.cache/kagglehub/datasets/imsparsh/lakh-midi-clean/1.archive...


100%|██████████| 226M/226M [00:06<00:00, 37.7MB/s] 

Extracting files...





Path to dataset files: /Users/andreapoltronieri/.cache/kagglehub/datasets/imsparsh/lakh-midi-clean/versions/1


In [12]:
# The dataset is organized in folders by track names. We'll iterate through all folders
# and collect MIDI files.

MIDI_DATASET_PATH.mkdir(exist_ok=True)

all_midi_files = list(Path(path).rglob("*.mid"))

# Sample a subset of files
sampled_files = random.sample(all_midi_files, SAMPLE_SIZE)
for midi_file in sampled_files:
    destination = MIDI_DATASET_PATH / midi_file.name
    shutil.copy(midi_file, destination)

print(f"Sampled {SAMPLE_SIZE} MIDI files to {MIDI_DATASET_PATH.resolve()}")

Sampled 100 MIDI files to /Users/andreapoltronieri/Documents/Projects/SyMP-CM/2026/data/midi


## MusicXML Dataset

As MusicXML dataset, we will use a subset of [PDMX](https://zenodo.org/records/13763756), a large collection (~250k) of public domain MusicXML files from [musescore.com](https://musescore.com/).

We will use a small subset of the dataset containing only 100 files. The dataset id available for download on [Zenodo](https://zenodo.org/record/13763756). First, we will download the dataset and extract the files. Use the dataset path in code block below to allow processing. 

In [None]:
# Set variables for MusicXML dataset
ZENODO_DATASET_PATH = Path("/home/must/Documents/PDMX/mxl")

MUSICXML_DATASET_PATH = Path("./musicxml")

In [12]:
MUSICXML_DATASET_PATH.mkdir(exist_ok=True)

# iterate over all folders and subfolders to collect MusicXML files
all_musicxml_files = list(ZENODO_DATASET_PATH.rglob("*.mxl"))
print(
    f"Found {len(all_musicxml_files)} MusicXML files in {ZENODO_DATASET_PATH.resolve()}"
)

# Sample a subset of files
sampled_musicxml_files = random.sample(all_musicxml_files, SAMPLE_SIZE)

for musicxml_file in sampled_musicxml_files:
    destination = MUSICXML_DATASET_PATH / musicxml_file.name
    shutil.copy(musicxml_file, destination)

print(f"Sampled {SAMPLE_SIZE} MusicXML files to {MUSICXML_DATASET_PATH.resolve()}")

Found 254035 MusicXML files in /home/must/Documents/PDMX/mxl
Sampled 100 MusicXML files to /home/must/Projects/SyMP-CM/2026/data/musicxml


## ABC Dataset

As ABC notation dataset, we will use the [Nottingham dataset](http://abc.sourceforge.net/NMD/), a collection of folk tunes in ABC notation format.

For simplicity, we'll download the dataset from [Kaggle](https://www.kaggle.com/datasets/tishyatripathi/nottingham-music-dataset), which contains 1200 files.

In [19]:
# Set variables for ABC dataset
ABC_DATASET_PATH = Path("./abc")

In [16]:
# Download latest version
path = kagglehub.dataset_download("tishyatripathi/nottingham-music-dataset")

print("Path to dataset files:", path)

Path to dataset files: /home/must/.cache/kagglehub/datasets/tishyatripathi/nottingham-music-dataset/versions/2


In [20]:
# Parse the datset, which is a single .txt file containing all tunes
nottingham_txt_file = Path(path) / "input_revised.txt"

# Read the file and split tunes by "X:" which indicates the start of a new tune
with open(nottingham_txt_file, "r") as f:
    content = f.read()
tunes = content.split("\nX:")[1:]  # Skip the first empty split
print(f"Found {len(tunes)} tunes in the Nottingham dataset.")

Found 338 tunes in the Nottingham dataset.


In [21]:
# Randomply sample a subset of tunes
sampled_tunes = random.sample(tunes, SAMPLE_SIZE)

# Save sampled tunes to individual .abc files
ABC_DATASET_PATH.mkdir(exist_ok=True)

for i, tune in enumerate(sampled_tunes):
    abc_file = ABC_DATASET_PATH / f"tune_{i + 1}.abc"
    with open(abc_file, "w") as f:
        f.write("X:" + tune)  # Add back the "X:" prefix
print(f"Sampled {SAMPLE_SIZE} ABC files to {ABC_DATASET_PATH.resolve()}")

Sampled 100 ABC files to /home/must/Projects/SyMP-CM/2026/data/abc
