<a href="https://colab.research.google.com/github/atsigman/data_pipeline_tutorial/blob/main/music_data_pipeline_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Music Data Pipeline Tutorial**

First, clone the tutorial repo and install the music_data_pipeline package:

In [1]:


!pip install  git+https://github.com/atsigman/data_pipeline_tutorial.git@main

Collecting git+https://github.com/atsigman/data_pipeline_tutorial.git@main
  Cloning https://github.com/atsigman/data_pipeline_tutorial.git (to revision main) to /tmp/pip-req-build-7yuqk4h2
  Running command git clone --filter=blob:none --quiet https://github.com/atsigman/data_pipeline_tutorial.git /tmp/pip-req-build-7yuqk4h2
  Resolved https://github.com/atsigman/data_pipeline_tutorial.git to commit 06b9547072e2d0bbcc84b91260804bc8b051592e
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch==2.6.0 (from music-data-pipeline==0.1)
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchaudio==2.6.0 (from music-data-pipeline==0.1)
  Downloading torchaudio-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0->music-data-pipeline==0.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==

Next, download text and audio data (mounting the Google drive does not always work):

In [2]:
!pip install -q gdown
!gdown 1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ -O /content/dpt_data.zip

Downloading...
From (original): https://drive.google.com/uc?id=1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ
From (redirected): https://drive.google.com/uc?id=1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ&confirm=t&uuid=7898964e-4ca7-4e9c-8dee-f3518054f376
To: /content/dpt_data.zip
100% 111M/111M [00:01<00:00, 65.3MB/s]


Unzip the data archive and save to the appropriate subdirectory:

In [3]:
import os
import zipfile

In [4]:
ZIP_PATH = "/content/dpt_data.zip"   # location of the zip in Colab VM
EXTRACT_TO = "/content"
DATA_DIR = "/content/data"
SUBDIR_NAME = os.path.basename(ZIP_PATH)[:-4]

In [5]:
def unzip_file() -> None:
  """
  Unzips archive to target directory.
  """
  os.makedirs(DATA_DIR, exist_ok=True)

  # Unzip:
  with zipfile.ZipFile(ZIP_PATH, "r") as zip_ref:
    zip_ref.extractall(EXTRACT_TO)

  # Rename subdir to "data":
  if SUBDIR_NAME != "data":
    ORIGINAL_DIR = os.path.join(EXTRACT_TO, SUBDIR_NAME)
    if os.path.exists(DATA_DIR):
      import shutil
      shutil.rmtree(DATA_DIR)  # remove if already exists
      os.rename(ORIGINAL_DIR, DATA_DIR)

  print(f"✅ Audio and text metadata extracted to: {DATA_DIR}")




In [6]:
unzip_file()

✅ Audio and text metadata extracted to: /content/data


In [7]:
# Imports:

import ast
import json
import pandas as pd
import random
import uuid

import torch

from typing import Tuple

from torch.utils.data import Dataset, Subset, DataLoader
from torchaudio.transforms import MFCC

from music_data_pipeline.audio_dataset import AudioDataset
from music_data_pipeline.util.pipeline_utils import (
  validate_prune_data,
  find_similar_audio,
  add_silent_regions,
  chunk_audio,
  tokenize_metadata,
  extract_blacklisted_genres,
)

Now we are ready to explore the dataset and construct a data pipeline!

# **I. Data Preprocessing Pipeline**

The goal of this module will be to analyse and preprocess the audio data and text metadata.

As we shall see, it may be necessary to a) prune the metadata in the event of invalid entries, b) add metadata, and/or c) generate new audio files.

Ultimately, the input data CSV will be converted to a dictionary, which will be saved as a JSON in the `/content/data` directory.

(In the "real world", this data would be stored to a DB, but for the sake of simplicity, we will just serialise it to a file in this tutorial.)

Let's begin by reading in and inspecting the dataset CSV:

## **A. EDA/Dataframe Operations**

In [8]:
df = pd.read_csv(os.path.join(DATA_DIR, "input_data.csv"))

In [9]:
df.shape

(106, 7)

In [9]:
df.columns

Index(['track_id', 'artist', 'album_title', 'genres', 'track_title', 'tempo',
       'audio_path'],
      dtype='object')

In [10]:
df.isna().sum()

Unnamed: 0,0
track_id,0
artist,0
album_title,0
genres,0
track_title,0
tempo,0
audio_path,1


In [9]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,data/audio/000001.mp3
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,data/audio/000193.mp3
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,data/audio/001197.mp3
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,data/audio/001683.mp3


OK, so it looks as though there are 106 samples, and 7 columns (features). 1 audio path is missing.

What do you notice about the data structures for each column?

The first necessary manipulation: now that the data dir lives under `/content`, "content" should be prepended to each `audio_path`





In [10]:
df["audio_path"] = df["audio_path"].apply(lambda x: "/content/" + x if isinstance(x, str) else "")

In [14]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,/content/data/audio/000001.mp3
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,/content/data/audio/000193.mp3
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,/content/data/audio/001197.mp3
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,/content/data/audio/001683.mp3


**1. Blacklist Flag Column**

This is just a repository for any "warnings" about entries
that will assist with training data filtering downstream.

In [11]:
df["blacklist_flags"] = [[] for _ in range(len(df))]

In [11]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path,blacklist_flags
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,data/audio/000001.mp3,[]
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,data/audio/000193.mp3,[]
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,,[]
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,data/audio/001197.mp3,[]
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,data/audio/001683.mp3,[]


**2. Add`_id` column**

Assign each sample a unique  `_id`

In [12]:
df["_id"] = [str(uuid.uuid1()) for _ in range(len(df))]

In [13]:
df.columns

Index(['track_id', 'artist', 'album_title', 'genres', 'track_title', 'tempo',
       'audio_path', 'blacklist_flags', '_id'],
      dtype='object')

**3. Convert "genres" values from string to list**

In [13]:
df["genres"] = df["genres"].apply(lambda x: ast.literal_eval(x))

# **4. Convert the Dataframe to a List of Dictionaries**

For all subsequent operations, the data should be in dictionary format. (This also avoids dealing with the idiomatic quirks of pandas.)

In [14]:
entries = df.to_dict(orient="records")

In [24]:
type(entries)

list

In [25]:
entries[0]

{'track_id': 1,
 'artist': 'Alexander Sigman',
 'album_title': 'VURT Cycle',
 'genres': ['Experimental', 'Contemporary Classical'],
 'track_title': 'dlxsf',
 'tempo': 92,
 'audio_path': '/content/data/audio/000001.mp3',
 'blacklist_flags': [],
 '_id': '8f94abea-01f1-11f1-b94e-0242ac1c000c'}

## **B. Data Validation and Pruning**

As this preprocessing pipeline is positioned upstream, and given the limited dataset size, it would be best to take a conservative approach to making executive data filtering decisions.

So let's consider: under which conditions is a given entry simply not usable as training data?


1.   No audio filepath (remember: we found one such example)
2.   Absolutely no relevant metadata
3.   Duplicate audio path (put a pin in this for later...)

In [15]:
entries = validate_prune_data(entries)

In [27]:
print(f"{len(entries)} remaining entries")

105 remaining entries


## **C. Duplicate Detection**




In [16]:
entries = find_similar_audio(entries)

Computing embeddings: 100%|██████████| 105/105 [00:20<00:00,  5.01it/s]
Duplicate detection: 100%|██████████| 105/105 [00:00<00:00, 2737.17it/s]

Deleting 2 redundant entries...
103 remaining entries.





## **D. Text Tokenization**


In [17]:
entries = tokenize_metadata(entries)

Tokenizing metadata: 100%|██████████| 103/103 [00:00<00:00, 34135.06it/s]


In [20]:
# Example entry:
entries[0]

{'track_id': 1,
 'artist': 'alexander sigman',
 'album_title': 'vurt cycle',
 'genres': "['experimental', 'contemporary classical']",
 'track_title': 'dlxsf',
 'tempo': 92,
 'audio_path': '/content/data/audio/000001.mp3',
 'blacklist_flags': [],
 '_id': '9d5c5514-01ee-11f1-b55a-0242ac1c000c',
 'duration': 734.864}

## **E. Blacklisted Genre Extraction**

In [18]:
entries = extract_blacklisted_genres(entries)

In [45]:
# Entries for which "bad_genre" flag exists:
bad_genre_entries = [e for e in entries if "bad_genre" in e["blacklist_flags"]]
print(bad_genre_entries)

[{'track_id': 133528, 'artist': 'walker j. sheldon', 'album_title': 'collected narratives', 'genres': ['electronic', 'podcast', 'audiobook'], 'track_title': 'two plus two makes crazy', 'tempo': 60, 'audio_path': '/content/data/audio/133528.mp3', 'blacklist_flags': ['bad_genre', 'bad_genre'], '_id': '7327bb58-01c1-11f1-8e66-0242ac1c000c', 'duration': 217.447}]


## **F. Chunk/Segment Entries with Long Audio Tracks**

In [19]:
entries = chunk_audio(entries)

In [47]:
entries[-6:]

[{'track_id': 133528,
  'artist': 'walker j. sheldon',
  'album_title': 'collected narratives',
  'genres': ['electronic', 'podcast', 'audiobook'],
  'track_title': 'two plus two makes crazy',
  'tempo': 60,
  'audio_path': '/content/data/audio/133528_0.wav',
  'blacklist_flags': ['bad_genre', 'bad_genre'],
  '_id': '7327bb58-01c1-11f1-8e66-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0},
 {'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': ['experimental', 'contemporary classical'],
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_1.wav',
  'blacklist_flags': [],
  '_id': '015493b4-01c3-11f1-8e66-0242ac1c000c',
  'duration': 180.0,
  'partition': 1,
  'start_sec': 180},
 {'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': ['experimental', 'contemporary classical'],
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_

## **G. Silent Region Detection**

In [20]:
entries = add_silent_regions(entries)

Silent region detection: 100%|██████████| 108/108 [00:13<00:00,  8.00it/s]


In [33]:
silent_region_entries = [e for e in entries if e["silent_regions"]]
len(silent_region_entries)

4

In [50]:
silent_region_entries

[{'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': ['experimental', 'contemporary classical'],
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_0.wav',
  'blacklist_flags': [],
  '_id': '7327297c-01c1-11f1-8e66-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0,
  'silent_regions': [(2.74, 7.43),
   (10.879, 14.036),
   (24.404, 26.645),
   (38.51, 42.574),
   (127.501, 130.891),
   (2.74, 7.43),
   (10.879, 14.036),
   (24.404, 26.645),
   (38.51, 42.574),
   (127.501, 130.891)]},
 {'track_id': 133528,
  'artist': 'walker j. sheldon',
  'album_title': 'collected narratives',
  'genres': ['electronic', 'podcast', 'audiobook'],
  'track_title': 'two plus two makes crazy',
  'tempo': 60,
  'audio_path': '/content/data/audio/133528_0.wav',
  'blacklist_flags': ['bad_genre', 'bad_genre'],
  '_id': '7327bb58-01c1-11f1-8e66-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0,
 

## **H. Serialize to JSON**

In [21]:
with open("/content/data/training_data.json", "w") as f:
  json.dump(entries, f, indent=4)

In [2]:
# (Run this cell in case of runtime disconnection, etc., or just to validate serialization):
with open("/content/data/training_data.json", "r") as f:
  entries = json.load(f)

# **II. Audio Dataset/DataLoader**

Now that we have preprocessed the training data, the next step is to determine how data samples will be varied, transformed, and represented during model training and validation.

## **A. AudioDataset**
Let's construct an AudioDataset. Each `__getitem__()` call returns a waveform tensor and a `TextCondition`.

 The only required argument is the list of entries (training data collection).


In [22]:
ds = AudioDataset(entries)

Pre-filter: 108 entries.
Post-filter: 104 entries.


To inspect output properties, let's print the shape of the audio tensor and the `TextCondition` for one `__getitem__()` call. (Note that audio is mixed down to mono, as the dataset *may* contain a combination of mono and stereo files.)





In [24]:
audio, cond = ds[0]
print(f"Audio tensor shape: {audio.shape}, Text condition: {cond}")

Audio tensor shape: torch.Size([1, 160000]), Text condition: TextCondition(artist='alexander sigman', album_title='vurt cycle', track_title='dlxsf', genres=['experimental', 'contemporary classical'], description='A music track by the artist Alexander Sigman from the album Vurt Cycle at tempo 92 in the experimental genre.', tempo=92)


There is a script in tests called `test_dataset.py`, which iterates thrrough the dataset, and prints/collects any errors, but we can )(partially) replicate the functionality here:

In [28]:
def validate_dataset(ds: Dataset, max_idx = 10) -> None:
  """
  Iterates through dataset. Prints audio shapes and text conditions. If there are any
  exceptions thrown, prints the exception.
  """
  for i in range(max_idx):
    try:
      audio, cond = ds[i]
      print(f"{i}: Audio tensor shape: {audio.shape}, Text condition: {cond}")
      if audio.shape[0] != 1:
        print(f"n_channel mismatch at index {i}: should be mono.")
    except Exception as e:
      print(e)

  print("Dataset tests passed!")



In [29]:
validate_dataset(ds)

0: Audio tensor shape: torch.Size([1, 160000]), Text condition: TextCondition(artist='alexander sigman', album_title='vurt cycle', track_title='dlxsf', genres=['experimental', 'contemporary classical'], description='A music track by the artist Alexander Sigman from the album Vurt Cycle at tempo 92 in the experimental genre.', tempo=92)
1: Audio tensor shape: torch.Size([1, 160000]), Text condition: TextCondition(artist='ed askew', album_title='blue piano', track_title='here with you', genres=['folk'], description='A music track by the artist Ed Askew from the album Blue Piano at 49 BPM best categorized as folk.', tempo=49)
2: Audio tensor shape: torch.Size([1, 160000]), Text condition: TextCondition(artist='mount eerie', album_title='seven new songs', track_title='my burning', genres=['folk'], description='A music track by the artist Mount Eerie from the album Seven New Songs at 60 BPM in the folk genre.', tempo=60)
3: Audio tensor shape: torch.Size([1, 160000]), Text condition: TextCo

## **B. AudioDataset with Transform**

There is an option to apply an audio transform to the input. Let's experiment with an MFCC transform.

In [30]:
mfcc_transform = MFCC(
  sample_rate=44100,
  n_mfcc=13,
  melkwargs={"n_fft": 2048, "hop_length": 512, "n_mels": 23}
)


In [31]:
mfcc_ds = AudioDataset(entries, transform=mfcc_transform)

Pre-filter: 108 entries.
Post-filter: 104 entries.


In [32]:
mfcc_spect, cond = mfcc_ds[0]
print(f"Audio tensor shape: {mfcc_spect.shape}, Text condition: {cond}")

Audio tensor shape: torch.Size([1, 13, 313]), Text condition: TextCondition(artist='alexander sigman', album_title='vurt cycle', track_title='dlxsf', genres=['experimental', 'contemporary classical'], description='A music track by the artist Alexander Sigman from the album Vurt Cycle at tempo 92 in the experimental genre.', tempo=92)


**Dataset TODOs:**


1.   Support silent region-aware cropping
2.   Chain augmentations (i.e., select > 1)
3.   ?Other audio representations?


*   Discrete (precomputed?) codebooks?
*   Continuous or discretized embeddings? (Precomputed, or extracted on-the-fly?)












## **C. Train/Test Split**
The Dataset will be split in train and validation subsets. (Be sure that sample are randomly selected for each.)


In [33]:
def train_test_split(ds: Dataset, split_ratio: float = 0.8) -> Tuple[Subset, Subset]:
  """
  Splits a Dataset into training and validation subsets.
  """
  all_idxs = range(len(ds))
  train_len = int(0.8 * len(ds))
  rand_idxs = random.sample(range(len(ds)), k=train_len)
  train_ds = Subset(ds, rand_idxs)

  val_idxs = list(set(all_idxs) - set(rand_idxs))
  val_ds = Subset(ds, val_idxs)

  return train_ds, val_ds


In [34]:
train_ds, val_ds = train_test_split(ds)

In [35]:
print(f"Train dataset size: {len(train_ds)}")
print(f"Validation dataset size: {len(val_ds)}")

Train dataset size: 83
Validation dataset size: 21


## **D. DataLoaders**

In order to transfer data efficiently from CPU (and not load all data into RAM), a `DataLoader` is recommended.

Let's create one for each `Dataset`.

**Collate function**

For correct data batching, a `collate_fn` is necessary to define. As the audio = tensors of uniform dimensions, these can be stacked. The `TextConditions`, are not tensors, however, and should just be concatenated to a list:

In [36]:
def collate_fn(data):
    audio, conditions = zip(*data)
    return torch.stack(audio), list(conditions)

**Train and Validation DataLoaders**

The main difference between the training and validation DataLoader kwargs is that, for the validation set, shuffling data is not required. For the purposes of Colab notebook execution, be sure to set `num_workers` to 0.





In [39]:
train_dl = DataLoader(train_ds, batch_size=4, collate_fn=collate_fn, num_workers=0, pin_memory=False, shuffle=True)
val_dl = DataLoader(val_ds, batch_size=4, collate_fn=collate_fn, num_workers=0, pin_memory=False, shuffle=False)

In [41]:
def validate_dataloader(dl: DataLoader, max_iter: int = 5) -> None:
  for i in range(max_iter):
    audio_batch, cond_batch = next(iter(dl))
    print(f"Audio dims: {audio_batch.shape}, Text batch size: {len(cond_batch)}")

  print("DataLoader tests passed!")

In [42]:
validate_dataloader(val_dl)

Audio dims: torch.Size([4, 1, 160000]), Text batch size: 4
Audio dims: torch.Size([4, 1, 160000]), Text batch size: 4
Audio dims: torch.Size([4, 1, 160000]), Text batch size: 4
Audio dims: torch.Size([4, 1, 160000]), Text batch size: 4
Audio dims: torch.Size([4, 1, 160000]), Text batch size: 4
DataLoader tests passed!


# **Conclusion/Next Steps**

With the initialized DataLoaders, one could run a model training/validation loop over some number of epochs, reading samples from CPU in batche and transferring to GPU.

For each epoch, for a given sample, a unique crop and augmentation type will be selected, thereby both expanding the training dataset and teaching the model to generate to and from any valid onset, and be robust to imperceptible (or scarcely perceptible) but computationally distinct representations of the "same" input. (As discussed earlier, the applied augmentations should not degrade the audio samples--this is "ground truth" audio, and the model should be steered towards generating high-quality outputs.)

In the "real world", the dataset would be vastly larger (in most cases, at least 20K hours ++), with greater variance in input duration and data source, but the fundamental data pipeline stages and Dataset design process covered in this tutorial would be fundamentally equivalent.

Please feel free to experiment with your own data, pipeline enhancements, and Dataset modifications!