<a href="https://colab.research.google.com/github/atsigman/data_pipeline_tutorial/blob/main/music_data_pipeline_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Music Data Pipeline Tutorial**

First, clone the tutorial repo and install the music_data_pipeline package:

In [2]:
!pip uninstall -y music_data_pipeline

[0m

In [3]:


!pip install  git+https://github.com/atsigman/data_pipeline_tutorial.git@main

Collecting git+https://github.com/atsigman/data_pipeline_tutorial.git@main
  Cloning https://github.com/atsigman/data_pipeline_tutorial.git (to revision main) to /tmp/pip-req-build-w8sg7sz9
  Running command git clone --filter=blob:none --quiet https://github.com/atsigman/data_pipeline_tutorial.git /tmp/pip-req-build-w8sg7sz9
  Resolved https://github.com/atsigman/data_pipeline_tutorial.git to commit d814421be925cf6f13ec072cdfab0365db1ca6cd
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch==2.6.0 (from music-data-pipeline==0.1)
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchaudio==2.6.0 (from music-data-pipeline==0.1)
  Downloading torchaudio-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0->music-data-pipeline==0.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==

Next, download text and audio data (mounting the Google drive does not always work):

In [4]:
!pip install -q gdown
!gdown 1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ -O /content/dpt_data.zip

Downloading...
From (original): https://drive.google.com/uc?id=1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ
From (redirected): https://drive.google.com/uc?id=1duHFr4-O12aRQZbZtBvc0NVr_tUwiNoQ&confirm=t&uuid=ce783f88-e62e-4e7b-a818-8b7d5a8722c6
To: /content/dpt_data.zip
100% 111M/111M [00:01<00:00, 65.4MB/s] 


Unzip the data archive and save to the appropriate subdirectory:

In [5]:
import os
import zipfile

In [6]:
ZIP_PATH = "/content/dpt_data.zip"   # location of the zip in Colab VM
EXTRACT_TO = "/content"
DATA_DIR = "/content/data"
SUBDIR_NAME = os.path.basename(ZIP_PATH)[:-4]

In [7]:
def unzip_file() -> None:
  """
  Unzips archive to target directory.
  """
  os.makedirs(DATA_DIR, exist_ok=True)

  # Unzip:
  with zipfile.ZipFile(ZIP_PATH, "r") as zip_ref:
    zip_ref.extractall(EXTRACT_TO)

  # Rename subdir to "data":
  if SUBDIR_NAME != "data":
    ORIGINAL_DIR = os.path.join(EXTRACT_TO, SUBDIR_NAME)
    if os.path.exists(DATA_DIR):
      import shutil
      shutil.rmtree(DATA_DIR)  # remove if already exists
      os.rename(ORIGINAL_DIR, DATA_DIR)

  print(f"✅ Audio and text metadata extracted to: {DATA_DIR}")




In [8]:
unzip_file()

✅ Audio and text metadata extracted to: /content/data


In [29]:
# Imports:

import json
import pandas as pd
import random
import uuid

from torch.utils.data import Subset, DataLoader

from music_data_pipeline.audio_dataset import AudioDataset
from music_data_pipeline.util.pipeline_utils import (
  validate_prune_data,
  find_similar_audio,
  add_silent_regions,
  chunk_audio,
  tokenize_metadata,
  extract_blacklisted_genres,
)

Now we are ready to explore the dataset and construct a data pipeline!

# **I. Data Preprocessing Pipeline**

The goal of this module will be to analyse and preprocess the audio data and text metadata.

As we shall see, it may be necessary to a) prune the metadata in the event of invalid entries, b) add metadata, and/or c) generate new audio files.

Ultimately, the input data CSV will be converted to a dictionary, which will be saved as a JSON in the `/content/data` directory.

(In the "real world", this data would be stored to a DB, but for the sake of simplicity, we will just serialise it to a file in this tutorial.)

Let's begin by reading in and inspecting the dataset CSV:

## **A. EDA/Dataframe Operations**

In [10]:
df = pd.read_csv(os.path.join(DATA_DIR, "input_data.csv"))

In [11]:
df.shape

(106, 7)

In [12]:
df.columns

Index(['track_id', 'artist', 'album_title', 'genres', 'track_title', 'tempo',
       'audio_path'],
      dtype='object')

In [13]:
df.isna().sum()

Unnamed: 0,0
track_id,0
artist,0
album_title,0
genres,0
track_title,0
tempo,0
audio_path,1


In [14]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,data/audio/000001.mp3
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,data/audio/000193.mp3
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,data/audio/001197.mp3
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,data/audio/001683.mp3


OK, so it looks as though there are 106 samples, and 7 columns (features). 1 audio path is missing.

What do you notice about the data structures for each column?

The first necessary manipulation: now that the data dir lives under `/content`, "content" should be prepended to each `audio_path`





In [15]:
df["audio_path"] = df["audio_path"].apply(lambda x: "/content/" + x if isinstance(x, str) else "")

In [16]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,/content/data/audio/000001.mp3
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,/content/data/audio/000193.mp3
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,/content/data/audio/001197.mp3
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,/content/data/audio/001683.mp3


**1. Blacklist Flag Column**

This is just a repository for any "warnings" about entries
that will assist with training data filtering downstream.

In [17]:
df["blacklist_flags"] = [[] for _ in range(len(df))]

In [18]:
df.head()

Unnamed: 0,track_id,artist,album_title,genres,track_title,tempo,audio_path,blacklist_flags
0,1,Alexander Sigman,VURT Cycle,"['Experimental', 'Contemporary Classical']",dlxsf,92,/content/data/audio/000001.mp3,[]
1,193,Ed Askew,Blue Piano,['Folk'],Here With You,49,/content/data/audio/000193.mp3,[]
2,207,John Cage,Cage Classics,['Experimental'],4'33,63,,[]
3,1197,Mount Eerie,Seven New Songs,['Folk'],My Burning,60,/content/data/audio/001197.mp3,[]
4,1683,The Sounds of Taraab,"Zanzibar, New York",['International'],Mapenzi Matamu,100,/content/data/audio/001683.mp3,[]


**2. Add`_id` column**

Assign each sample a unique  `_id`

In [19]:
df["_id"] = [str(uuid.uuid1()) for _ in range(len(df))]

In [20]:
df.columns

Index(['track_id', 'artist', 'album_title', 'genres', 'track_title', 'tempo',
       'audio_path', 'blacklist_flags', '_id'],
      dtype='object')

**3. Convert the Dataframe to a List of Dictionaries**

For all subsequent operations, the data should be in dictionary format. (This also avoids dealing with the idiomatic quirks of pandas.)

In [21]:
entries = df.to_dict(orient="records")

In [22]:
type(entries)

list

In [23]:
entries[0]

{'track_id': 1,
 'artist': 'Alexander Sigman',
 'album_title': 'VURT Cycle',
 'genres': "['Experimental', 'Contemporary Classical']",
 'track_title': 'dlxsf',
 'tempo': 92,
 'audio_path': '/content/data/audio/000001.mp3',
 'blacklist_flags': [],
 '_id': '927a998a-007c-11f1-a7d0-0242ac1c000c'}

## **B. Data Validation and Pruning**

As this preprocessing pipeline is positioned upstream, and given the limited dataset size, it would be best to take a conservative approach to making executive data filtering decisions.

So let's consider: under which conditions is a given entry simply not usable as training data?


1.   No audio filepath (remember: we found one such example)
2.   Absolutely no relevant metadata
3.   Duplicate audio path (put a pin in this for later...)

In [24]:
entries = validate_prune_data(entries)

In [25]:
print(f"{len(entries)} remaining entries")

105 remaining entries


## **C. Duplicate Detection**




In [26]:
entries = find_similar_audio(entries)

Computing embeddings: 100%|██████████| 105/105 [00:22<00:00,  4.76it/s]
Duplicate detection: 100%|██████████| 105/105 [00:00<00:00, 4594.22it/s]

Deleting 2 redundant entries...
103 remaining entries.





## **D. Text Tokenization**


In [27]:
entries = tokenize_metadata(entries)

Tokenizing metadata: 100%|██████████| 103/103 [00:00<00:00, 36642.35it/s]


In [28]:
# Example entry:
entries[0]

{'track_id': 1,
 'artist': 'alexander sigman',
 'album_title': 'vurt cycle',
 'genres': "['experimental', 'contemporary classical']",
 'track_title': 'dlxsf',
 'tempo': 92,
 'audio_path': '/content/data/audio/000001.mp3',
 'blacklist_flags': [],
 '_id': '927a998a-007c-11f1-a7d0-0242ac1c000c',
 'duration': 734.864}

## **E. Blacklisted Genre Extraction**

In [30]:
entries = extract_blacklisted_genres(entries)

In [31]:
# Entries for which "bad_genre" flag exists:
bad_genre_entries = [e for e in entries if "bad_genre" in e["blacklist_flags"]]
print(bad_genre_entries)

[{'track_id': 133528, 'artist': 'walker j. sheldon', 'album_title': 'collected narratives', 'genres': "['electronic', 'podcast', 'audiobook']", 'track_title': 'two plus two makes crazy', 'tempo': 60, 'audio_path': '/content/data/audio/133528.mp3', 'blacklist_flags': ['bad_genre'], '_id': '927ae5f2-007c-11f1-a7d0-0242ac1c000c', 'duration': 217.447}]


## **F. Chunk/Segment Entries with Long Audio Tracks**

In [32]:
entries = chunk_audio(entries)

In [33]:
entries[-6:]

[{'track_id': 133528,
  'artist': 'walker j. sheldon',
  'album_title': 'collected narratives',
  'genres': "['electronic', 'podcast', 'audiobook']",
  'track_title': 'two plus two makes crazy',
  'tempo': 60,
  'audio_path': '/content/data/audio/133528_0.wav',
  'blacklist_flags': ['bad_genre'],
  '_id': '927ae5f2-007c-11f1-a7d0-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0},
 {'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': "['experimental', 'contemporary classical']",
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_1.wav',
  'blacklist_flags': [],
  '_id': '11619eb4-007e-11f1-a7d0-0242ac1c000c',
  'duration': 180.0,
  'partition': 1,
  'start_sec': 180},
 {'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': "['experimental', 'contemporary classical']",
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_2.wav',

## **G. Silent Region Detection**

In [35]:
entries = add_silent_regions(entries)

Silent region detection: 100%|██████████| 108/108 [00:15<00:00,  6.88it/s]


In [36]:
silent_region_entries = [e for e in entries if e["silent_regions"]]
len(silent_region_entries)

4

In [37]:
silent_region_entries

[{'track_id': 1,
  'artist': 'alexander sigman',
  'album_title': 'vurt cycle',
  'genres': "['experimental', 'contemporary classical']",
  'track_title': 'dlxsf',
  'tempo': 92,
  'audio_path': '/content/data/audio/000001_0.wav',
  'blacklist_flags': [],
  '_id': '927a998a-007c-11f1-a7d0-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0,
  'silent_regions': [(2.74, 7.43),
   (10.879, 14.036),
   (24.404, 26.645),
   (38.51, 42.574),
   (127.501, 130.891),
   (2.74, 7.43),
   (10.879, 14.036),
   (24.404, 26.645),
   (38.51, 42.574),
   (127.501, 130.891)]},
 {'track_id': 133528,
  'artist': 'walker j. sheldon',
  'album_title': 'collected narratives',
  'genres': "['electronic', 'podcast', 'audiobook']",
  'track_title': 'two plus two makes crazy',
  'tempo': 60,
  'audio_path': '/content/data/audio/133528_0.wav',
  'blacklist_flags': ['bad_genre'],
  '_id': '927ae5f2-007c-11f1-a7d0-0242ac1c000c',
  'duration': 180.0,
  'partition': 0,
  'start_sec': 0,
  'silent_

## **H. Serialize to JSON**

In [38]:
with open("/content/data/training_data.json", "w") as f:
  json.dump(entries, f, indent=4)