# Genre Mapping: Spotify Genres → 15 Training Genres

This notebook documents how we assigned a genre label to each artist and how we mapped Spotify’s long-tail genre taxonomy
into a small set of learnable conditioning genres for a genre-conditioned MIDI generation model.

## Why this is needed
Spotify genres are fine-grained and long-tailed (e.g., "classic rock", "italo dance", "norteño-sax").  
For conditional generative modeling, we want a small number of coarse genres that have enough examples to be learnable.

We therefore:
1) retrieved artist genres using Spotify API
2) inspected the distribution of genres in our dataset
3) created a keyword-based mapping to broad genres (`genre_broad`)
4) applied a hard whitelist to get final training genres (`genre_final`) with everything else mapped to `other`

In [1]:
import os
import json
import pandas as pd
import numpy as np

In [2]:
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_colwidth", 100)

DATA_ROOT = "../dataset"          # folder with artist subfolders (if needed)
ARTIST_GENRES_CSV = "../artist_genres.csv"
INDEX_PARQUET = "../data/index.parquet"

## 1) Retrieving genres from Spotify

We queried Spotify’s artist search endpoint per artist name (derived from folder names).
Spotify returns a list of genres per artist.

Important: **Never commit Spotify credentials**. Use environment variables.

In [22]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

client_id = os.getenv("SPOTIFY_CLIENT_ID")
client_secret = os.getenv("SPOTIFY_CLIENT_SECRET")

assert client_id and client_secret, "Set SPOTIFY_CLIENT_ID and SPOTIFY_CLIENT_SECRET in your environment."

sp = spotipy.Spotify(
    client_credentials_manager=SpotifyClientCredentials(
        client_id=client_id,
        client_secret=client_secret
    )
)

def get_artist_genres(artist_name: str):
    results = sp.search(
        q=f"artist:{artist_name}",
        type="artist",
        limit=10
    )

    artists = results["artists"]["items"]
    if not artists:
        return []

    # Sort by popularity (descending)
    artists = sorted(artists, key=lambda a: a["popularity"], reverse=True)

    return artists[0].get("genres", []) 

# Example
artist = "drake"
genres = get_artist_genres(artist)
print(f"{artist} → {genres}")

drake → ['rap']


In [5]:
gdf = pd.read_csv(ARTIST_GENRES_CSV)
gdf.head()

Unnamed: 0,artist,genre
0,VOF_de_Kunst,nederpop
1,Eubie_Blake,ragtime
2,"Wilson,Jackie",northern soul
3,The_Escape_Club,Unknown
4,Morricone_Ennio,soundtrack


In [6]:
# The genre column may be a single label or a list-like string. We'll normalize it into a list.

import ast

def parse_genre_cell(cell):
    if pd.isna(cell):
        return []
    if isinstance(cell, list):
        return [str(x).strip().lower() for x in cell]
    s = str(cell).strip()
    if not s:
        return []
    # list-like?
    if s.startswith("[") and s.endswith("]"):
        try:
            arr = json.loads(s)
            if isinstance(arr, list):
                return [str(x).strip().lower() for x in arr]
        except Exception:
            pass
        try:
            arr = ast.literal_eval(s)
            if isinstance(arr, list):
                return [str(x).strip().lower() for x in arr]
        except Exception:
            pass
    # separators
    if "|" in s:
        return [x.strip().lower() for x in s.split("|") if x.strip()]
    if "," in s:
        parts = [x.strip().lower() for x in s.split(",") if x.strip()]
        if len(parts) > 1:
            return parts
    return [s.lower()]

gdf["genre_list"] = gdf["genre"].apply(parse_genre_cell)

# How many unique raw Spotify genre tags exist?
all_raw = [g for lst in gdf["genre_list"].tolist() for g in lst]
raw_counts = pd.Series(all_raw).value_counts()

print("Unique raw Spotify genre tags:", raw_counts.shape[0])
raw_counts.head(30)

Unique raw Spotify genre tags: 258


unknown                      684
nederpop                      51
doo-wop                       50
eurodance                     45
schlager                      44
italian singer-songwriter     43
classic rock                  42
disco                         40
motown                        36
soft rock                     33
new wave                      30
country                       26
classical                     24
christmas                     24
rockabilly                    23
blues                         21
jazz                          21
synthpop                      20
folk rock                     20
smooth jazz                   20
r&b                           20
glam metal                    19
ragtime                       19
new jack swing                18
glam rock                     18
progressive rock              18
big band                      17
italo dance                   17
adult standards               16
philly soul                   16
Name: coun

## 2) Why we map raw Spotify genres to a smaller set

Spotify genres are:
- extremely fine-grained
- long-tailed (many labels appear only a handful of times)
- inconsistent across similar artists

A conditional music model needs labels that are:
- stable
- coarse enough to learn
- supported by many examples

So we map raw tags to a **small set of broad genres**.

In [7]:
BROAD_GENRE_MAPPING = {
    "rock": [
        "classic rock", "soft rock", "progressive rock", "folk rock",
        "hard rock", "punk", "grunge", "new wave",
        "alternative rock", "indie rock", "garage rock",
        "rock",
    ],
    "pop": [
        "italian singer-songwriter",
        "synthpop", "electropop", "dance pop", "pop rock",
        "schlager", "doo-wop",
        "pop",
    ],
    "hip-hop": [
        "memphis rap", "gangster rap", "southern hip hop",
        "hip hop", "trap", "drill", "grime", "g-funk",
        "rap",
    ],
    "electronic": [
        "eurodance", "italo dance", "happy hardcore", "bassline",
        "electronic", "edm", "house", "techno", "dubstep", "trance",
        "synthwave", "ambient", "new age", "idm", "downtempo",
        "trip hop", "big beat", "gabber", "uk garage",
    ],
    "jazz": [
        "adult standards", "easy listening",
        "smooth jazz", "jazz fusion", "bebop", "hard bop",
        "big band", "swing music",
        "jazz",
    ],
    "classical": [
        "baroque", "chamber", "orchestra", "symphony", "neoclassical",
        "opera", "choral", "early music", "classical",
        "gregorian chant",
    ],
    "metal": [
        "deathcore", "black metal", "death metal", "doom metal",
        "industrial metal", "nu metal",
        "metal",
    ],
    "r&b": [
        "motown", "neo soul", "new jack swing", "quiet storm",
        "r&b", "soul", "funk", "blues",
        "gospel",
    ],
    "country": [
        "traditional country", "honky tonk", "texas country", "red dirt",
        "americana", "country",
    ],
    "folk": [
        "singer-songwriter", "acoustic", "traditional music",
        "bluegrass", "celtic", "folk",
    ],
    "latin": [
        "corrido", "banda", "norteño", "vallenato", "bolero", "son cubano",
        "trova", "forró", "duranguense", "cuarteto", "sierreño",
        "latin", "salsa", "bossa nova", "reggaeton", "latin pop",
        "bachata", "merengue", "cumbia", "sertanejo", "tejano",
    ],
    "disco": ["nu disco", "italo disco", "post-disco", "hi-nrg", "disco"],
    "world": [
        "world", "chanson", "flamenco", "calypso", "gnawa", "dangdut",
        "variété française", "canzone napoletana", "exotica",
    ],
    "holiday": ["christmas", "holiday"],
}

# Build keyword pairs longest-first
keyword_pairs = []
for broad, kws in BROAD_GENRE_MAPPING.items():
    for kw in kws:
        keyword_pairs.append((kw.lower(), broad))
keyword_pairs.sort(key=lambda x: len(x[0]), reverse=True)

def map_genres_to_broad(genre_list):
    if not genre_list:
        return "unknown"
    for raw in genre_list:
        g = raw.lower()
        for kw, broad in keyword_pairs:
            if kw in g:
                return broad
    # preserve first raw genre if nothing matches (audit-friendly)
    return genre_list[0].lower()

gdf["genre_broad"] = gdf["genre_list"].apply(map_genres_to_broad)
gdf["genre_broad"].value_counts().head(25)

genre_broad
unknown                684
rock                   329
pop                    262
r&b                    171
jazz                   125
electronic             111
classical               66
disco                   61
latin                   51
country                 48
metal                   45
world                   43
hip-hop                 36
folk                    33
holiday                 24
ragtime                 19
musicals                15
reggae                  12
polka                    8
neue deutsche welle      8
ska                      4
children's music         4
ccm                      3
comedy                   3
lullaby                  2
Name: count, dtype: int64

In [8]:
TRAIN_GENRES = {
    "rock",
    "pop",
    "classical",
    "r&b",
    "electronic",
    "hip-hop",
    "jazz",
    "metal",
    "latin",
    "world",
    "disco",
    "country",
    "folk",
    "holiday",
}

def finalize_genre(broad):
    b = str(broad).strip().lower()
    return b if b in TRAIN_GENRES else "other"

gdf["genre_final"] = gdf["genre_broad"].apply(finalize_genre)
gdf["genre_final"].value_counts()

genre_final
other         791
rock          329
pop           262
r&b           171
jazz          125
electronic    111
classical      66
disco          61
latin          51
country        48
metal          45
world          43
hip-hop        36
folk           33
holiday        24
Name: count, dtype: int64

## 3) How we picked the final 15 training genres

We chose a small, learnable label set based on:
1) Frequency: genres should have enough examples to learn stable conditioning embeddings.
2) Musical interpretability: genres should be intuitive to users (UI dropdown).
3) Coverage: the chosen set covers most of the dataset mass.
4) Simplicity: micro-genres and ambiguous tags are collapsed into `other`.

Final training genres:
- rock, pop, classical, r&b, electronic, hip-hop, jazz, metal,
  latin, world, disco, country, folk, holiday, other

In [10]:
out = {
    "train_genres": sorted(list(TRAIN_GENRES)) + ["other"],
    "broad_genre_mapping": BROAD_GENRE_MAPPING,
    "note": "Keyword-based broad mapping + hard whitelist to 15 final training genres.",
}

with open("../data/genre_mapping_decisions.json", "w") as f:
    json.dump(out, f, indent=2)

print("Saved: data/genre_mapping_decisions.json")

Saved: data/genre_mapping_decisions.json


In [11]:
# Optional: If you already built data/index.parquet, show final distribution there
try:
    idx = pd.read_parquet(INDEX_PARQUET)
    display(idx["genre_final"].value_counts())
except Exception as e:
    print("index.parquet not found or not readable yet:", e)

genre_final
rock          5601
other         4148
pop           2081
electronic     885
r&b            799
classical      670
jazz           563
metal          492
hip-hop        459
world          382
disco          322
holiday        245
country        225
folk           185
latin          175
Name: count, dtype: int64

## Summary

- We used Spotify API to retrieve a list of genre tags per artist.
- Spotify tags are long-tailed and too fine-grained for conditional generative modeling.
- We mapped raw tags to broad categories using substring keyword matching (`genre_broad`).
- We then applied a hard whitelist to create a small, learnable conditioning space of 15 genres (`genre_final`).
- All decisions and mappings are saved to `data/genre_mapping_decisions.json` for reproducibility.
