# 0. Data Acquisition

This notebook handles:
1. Downloading MTG-Jamendo metadata from GitHub
2. Applying labeling rules for our 4 target classes
3. Creating artist-disjoint train/val/test splits
4. Smoke testing Jamendo API
5. Downloading audio previews

**Target Classes:**
- House
- Techno
- Trance
- Drum and Bass

In [1]:
# Imports
import sys
sys.path.insert(0, '..')

from pathlib import Path
import pandas as pd

from src import config
from src.data_utils import (
    download_mtg_metadata,
    load_genre_metadata,
    build_labeled_dataset,
    artist_disjoint_split,
    JamendoClient,
    smoke_test_jamendo,
    download_audio_batch,
    save_manifest,
    load_manifest
)

## 1. Download MTG-Jamendo Metadata

In [2]:
# Download metadata files from GitHub
genre_path, meta_path = download_mtg_metadata()
print(f"\nGenre TSV: {genre_path}")
print(f"Meta TSV: {meta_path}")

Genre TSV already exists: c:\Users\adria\Desktop\zum_project\notebooks\..\data\raw\autotagging_genre.tsv
Downloading raw.meta.tsv...


Meta TSV: 7.85MB [00:00, 35.4MB/s]                           


Genre TSV: c:\Users\adria\Desktop\zum_project\notebooks\..\data\raw\autotagging_genre.tsv
Meta TSV: c:\Users\adria\Desktop\zum_project\notebooks\..\data\raw\raw.meta.tsv





In [3]:
# Load genre metadata
genre_df = load_genre_metadata(genre_path)
print(f"\nShape: {genre_df.shape}")
print(f"\nColumns: {genre_df.columns.tolist()}")
genre_df.head()

Loaded 55215 tracks from genre metadata

Shape: (55215, 6)

Columns: ['track_id', 'artist_id', 'album_id', 'path', 'duration', 'tags']


Unnamed: 0,track_id,artist_id,album_id,path,duration,tags
0,track_0000214,artist_000014,album_000031,14/214.mp3,124.6,[genre---punkrock]
1,track_0000215,artist_000014,album_000031,15/215.mp3,151.4,[genre---metal]
2,track_0000216,artist_000014,album_000031,16/216.mp3,234.9,[genre---metal]
3,track_0000217,artist_000014,album_000031,17/217.mp3,127.9,[genre---punkrock]
4,track_0000218,artist_000014,album_000031,18/218.mp3,180.7,[genre---punkrock]


In [4]:
# Show sample tags
print("Sample tags:")
for i in range(5):
    print(f"  {genre_df.iloc[i]['track_id']}: {genre_df.iloc[i]['tags']}")

Sample tags:
  track_0000214: ['genre---punkrock']
  track_0000215: ['genre---metal']
  track_0000216: ['genre---metal']
  track_0000217: ['genre---punkrock']
  track_0000218: ['genre---punkrock']


## 2. Apply Labeling Rules

- Match tags to our 4 target classes
- Keep only tracks with exactly 1 matched label
- Cap at 1000 samples per class

In [5]:
# Show labeling configuration
print("Target classes:", config.TARGET_CLASSES)
print("\nTag synonyms:")
for label, synonyms in config.TAG_SYNONYMS.items():
    print(f"  {label}: {synonyms}")
print(f"\nMax per class: {config.MAX_PER_CLASS}")
print(f"Min per class: {config.MIN_PER_CLASS}")

Target classes: ['house', 'techno', 'trance', 'drum_and_bass']

Tag synonyms:
  house: ['house']
  techno: ['techno']
  trance: ['trance']
  drum_and_bass: ['drum and bass', 'drum & bass', 'drum_and_bass', 'dnb', "drum'n'bass", 'drum n bass', 'drumandbass']

Max per class: 1000
Min per class: 300


In [6]:
# Build labeled dataset
# Set allow_low_counts=True if you want to proceed even with <300 samples per class
labeled_df = build_labeled_dataset(
    genre_df,
    max_per_class=config.MAX_PER_CLASS,
    min_per_class=config.MIN_PER_CLASS,
    seed=config.RANDOM_SEED,
    allow_low_counts=True  # Set to False to abort if counts too low
)
labeled_df.head()


Filtered to 4434 single-label tracks

Class distribution before capping:
  house: 1594
  techno: 1402
  trance: 1006
  drum_and_bass: 432

Final class distribution:
  house: 1000
  techno: 1000
  trance: 1000
  drum_and_bass: 432

Total: 3432 tracks


Unnamed: 0,track_id,artist_id,album_id,duration,label,label_idx,tags_raw
0,track_1283297,artist_416786,album_153040,294.0,house,0,genre---dance;genre---house
1,track_1353216,artist_490851,album_159538,180.0,house,0,genre---easylistening;genre---house;genre---lo...
2,track_1371096,artist_490499,album_160413,181.1,house,0,genre---dance;genre---electronic;genre---house
3,track_0986496,artist_428046,album_116107,352.2,house,0,genre---dance;genre---electronic;genre---funk;...
4,track_1420555,artist_497202,album_165686,211.0,house,0,genre---chillout;genre---house;genre---reggae


## 3. Create Artist-Disjoint Split

In [7]:
# Create train/val/test split by artist
manifest_df = artist_disjoint_split(
    labeled_df,
    train_ratio=config.TRAIN_RATIO,
    val_ratio=config.VAL_RATIO,
    seed=config.RANDOM_SEED
)
manifest_df.head()


Split statistics:
  train: 2493 tracks, 408 artists
    house: 616
    techno: 726
    trance: 818
    drum_and_bass: 333
  val: 445 tracks, 87 artists
    house: 145
    techno: 160
    trance: 98
    drum_and_bass: 42
  test: 494 tracks, 88 artists
    house: 239
    techno: 114
    trance: 84
    drum_and_bass: 57


Unnamed: 0,track_id,artist_id,album_id,duration,label,label_idx,tags_raw,split
0,track_1283297,artist_416786,album_153040,294.0,house,0,genre---dance;genre---house,train
1,track_1353216,artist_490851,album_159538,180.0,house,0,genre---easylistening;genre---house;genre---lo...,test
2,track_1371096,artist_490499,album_160413,181.1,house,0,genre---dance;genre---electronic;genre---house,train
3,track_0986496,artist_428046,album_116107,352.2,house,0,genre---dance;genre---electronic;genre---funk;...,train
4,track_1420555,artist_497202,album_165686,211.0,house,0,genre---chillout;genre---house;genre---reggae,val


In [8]:
# Verify no artist leakage
train_artists = set(manifest_df[manifest_df['split'] == 'train']['artist_id'])
val_artists = set(manifest_df[manifest_df['split'] == 'val']['artist_id'])
test_artists = set(manifest_df[manifest_df['split'] == 'test']['artist_id'])

print(f"Train artists: {len(train_artists)}")
print(f"Val artists: {len(val_artists)}")
print(f"Test artists: {len(test_artists)}")
print(f"\nOverlap train-val: {len(train_artists & val_artists)}")
print(f"Overlap train-test: {len(train_artists & test_artists)}")
print(f"Overlap val-test: {len(val_artists & test_artists)}")

Train artists: 408
Val artists: 87
Test artists: 88

Overlap train-val: 0
Overlap train-test: 0
Overlap val-test: 0


## 4. Smoke Test Jamendo API

Before downloading all audio, test that the API works with a few sample track IDs.

In [9]:
# Select 3 random track IDs for smoke test
sample_tracks = manifest_df.sample(3, random_state=42)['track_id'].tolist()
print("Sample track IDs for smoke test:")
for t in sample_tracks:
    print(f"  {t}")

Sample track IDs for smoke test:
  track_0714710
  track_0669129
  track_1248252


In [10]:
# Run smoke test
client = JamendoClient()
passed = smoke_test_jamendo(sample_tracks, client)

if not passed:
    raise RuntimeError("Smoke test failed! Check API configuration before proceeding.")


JAMENDO API SMOKE TEST

Testing track_id: track_0714710
  ✓ PASSED: Audio URL accessible
    Name: Chaotic Grid
    URL: https://prod-1.storage.jamendo.com/?trackid=714710&format=mp...

Testing track_id: track_0669129
  ✓ PASSED: Audio URL accessible
    Name: Nightlounge (Plant of things Remix)
    URL: https://prod-1.storage.jamendo.com/?trackid=669129&format=mp...

Testing track_id: track_1248252
  ✓ PASSED: Audio URL accessible
    Name: Cheetah Chase
    URL: https://prod-1.storage.jamendo.com/?trackid=1248252&format=m...

✓ ALL SMOKE TESTS PASSED



## 5. Download Audio Previews

**This will take a while!**

Audio files are cached, so re-running will skip already downloaded files.

In [1]:
# Add filepath column (initially empty)
manifest_df['filepath'] = ''

# Download audio
# This updates the filepath column with local paths
manifest_df = download_audio_batch(
    manifest_df,
    output_dir=config.AUDIO_DIR,
    client=client
)

NameError: name 'manifest_df' is not defined

In [None]:
# Check download status
downloaded = manifest_df['filepath'].apply(lambda x: Path(x).exists() if x else False)
print(f"Successfully downloaded: {downloaded.sum()} / {len(manifest_df)}")

# Filter to only downloaded tracks
manifest_df = manifest_df[downloaded].reset_index(drop=True)
print(f"\nFinal manifest size: {len(manifest_df)}")

Successfully downloaded: 2517 / 3432

Final manifest size: 2517


## 6. Save Manifest

In [13]:
# Save manifest to CSV
save_manifest(manifest_df)
print(f"\nSaved to: {config.PROCESSED_DATA_DIR / 'manifest.csv'}")

Saved manifest to: c:\Users\adria\Desktop\zum_project\notebooks\..\data\processed\manifest.csv

Saved to: c:\Users\adria\Desktop\zum_project\notebooks\..\data\processed\manifest.csv


In [14]:
# Final summary
print("\n" + "="*60)
print("DATA ACQUISITION COMPLETE")
print("="*60)
print(f"\nTotal tracks: {len(manifest_df)}")
print(f"\nClass distribution:")
for label in config.TARGET_CLASSES:
    count = len(manifest_df[manifest_df['label'] == label])
    print(f"  {label}: {count}")
print(f"\nSplit distribution:")
for split in ['train', 'val', 'test']:
    count = len(manifest_df[manifest_df['split'] == split])
    print(f"  {split}: {count}")
print("="*60)


DATA ACQUISITION COMPLETE

Total tracks: 2517

Class distribution:
  house: 597
  techno: 750
  trance: 863
  drum_and_bass: 307

Split distribution:
  train: 1919
  val: 344
  test: 254
