# Data Preparation 

This notebook contains code for downloading and preprocessing data to be used in all 
sessions of the course.

Copyright-free data will be shared in the repository, so you won't need to run this 
notebook.

In [11]:
import random
import shutil
from pathlib import Path

import kagglehub

# Set seed for reproducibility
random.seed(1234)

In [None]:
# Set global variables
SAMPLE_SIZE = 100

## MIDI Dataset
As main MIDI dataset for the course, we will use the [Lakh MIDI Dataset (LMD)](https://colinraffel.com/projects/lmd/), a collection of 176,581 unique MIDI files.
We will use the cleaned subset of the dataset, which contains 45,129 files, which is 
available for download on [Kaggle](https://www.kaggle.com/datasets/imsparsh/lakh-midi-clean).

For the sake of simplicity, we will create a subset of the dataset containing only 100 files.

In [None]:
# Define varaibles
MIDI_DATASET_PATH = Path("./midi")

In [None]:
# Download latest version
path = kagglehub.dataset_download("imsparsh/lakh-midi-clean")

print("Path to dataset files:", path)

Downloading to /Users/andreapoltronieri/.cache/kagglehub/datasets/imsparsh/lakh-midi-clean/1.archive...


100%|██████████| 226M/226M [00:06<00:00, 37.7MB/s] 

Extracting files...





Path to dataset files: /Users/andreapoltronieri/.cache/kagglehub/datasets/imsparsh/lakh-midi-clean/versions/1


In [12]:
# The dataset is organized in folders by track names. We'll iterate through all folders
# and collect MIDI files.

MIDI_DATASET_PATH.mkdir(exist_ok=True)

all_midi_files = list(Path(path).rglob("*.mid"))

# Sample a subset of files
sampled_files = random.sample(all_midi_files, SAMPLE_SIZE)
for midi_file in sampled_files:
    destination = MIDI_DATASET_PATH / midi_file.name
    shutil.copy(midi_file, destination)

print(f"Sampled {SAMPLE_SIZE} MIDI files to {MIDI_DATASET_PATH.resolve()}")

Sampled 100 MIDI files to /Users/andreapoltronieri/Documents/Projects/SyMP-CM/2026/data/midi


## MusicXML Dataset