# Abstract

Vinyl is a sequential deep learning model (recursive neural network or RNN) that classifies the genre (or genres) of a song based on its groove, where the groove is defined by the musical qualities of the notes played in a song most closely related to the rhythm and beat.

I collect data from Spotify for playlists from close to 3,000 musical genres, and obtain audio analysis data for 100-200 songs per genre. By training a specialized RNN on a sequence of each note played in each song, I build a model that is able to identify the genres most strongly represented by a song. 

Genres are not well defined at a fine level of detail, and this methodology is intended to use the vague boundaries between musical genres to produce a map or genealogy of the evolution of musical styles. This is useful for understanding the history and development of music, and for helping listeners to explore new musical styles that should be similar to their tastes.

## Problem Statement
- Who are your customers?
- What is the problem?
- What solution do you propose?

## Work Plan
- What data will you collect?
- What models can you use to analyze it?
- How will you know that your models work?

# Obtain the Data

To build this model, I pull data from a website called [Every Noise At Once](http://everynoise.com/engenremap.html), which maps out close to 3,000 genres of music in a space that is roughly characterized by instrumentation that trends from organic to electric on the vertical axis, anda musical quality that ranges from dense and atmospheric to spiky and bouncy along the horizontal axis.

Each of these genres has its own page that contains a word cloud of popular artists in the genre, as well as links to Spotify playlists with 100-200 songs that represent the genre's style. I scrape Every Noise to collect a list of playlist URIs for each genre, and then I use the Spotify API to collect audio analysis files for the songs in each playlist.

*After completing this step, be sure to edit `references/data_dictionary` to include descriptions of where you obtained your data and what information it contains.*

## Map the Data Pipeline


## Build the Pipeline Tools

Make sure these steps are reproducible by code. Put some thought into the directory structures and filepaths you are using to save your data, so it's easy to load files you need.

In [54]:
import os, pickle, re, requests
import pandas as pd
from bs4 import BeautifulSoup

import functools

# FUNCTIONS TO GET GENRES AND PLAYLIST LINKS FROM EVERY-NOISE-AT-ONCE

def load_or_make(creator):
    """
    Loads data that is pickled at filepath if filepath exists;
    otherwise, calls creator(*args, **kwargs) to create the data 
    and pickle it at filepath.
    Returns the data in either case.
    
    Inputs:
    - filepath: path to where data is / should be stored
    - creator: function to create data if it is not already pickled
    - *args, **kwargs: arguments passed to creator()
    
    Outputs:
    - item: the data that is stored at filepath
    """
    @functools.wraps(creator)
    def cached_creator(filepath, *args, **kwargs):
        if os.path.isfile(filepath):
            with open(filepath, 'rb') as pkl:
                item = pickle.load(pkl)
        else:
            item = creator(*args, **kwargs)
            with open(filepath, 'wb') as pkl:
                pickle.dump(item, pkl)
        return item
    return cached_creator

@load_or_make
def scrape_all_links(domain, index, target_pattern):
    """
    Scrapes a website and compiles a list of urls that match a target pattern.
    
    Inputs: 
    - domain: domain of the website you want to scrape
    - index: path to the page that you want to scrape from `domain`
    - target_pattern: regex that specifies the types of links you want to collect
    
    Outputs:
    - target_urls: list of all the links on domain/index that match target_pattern
    """
    main_page = '/'.join(['http:/', domain, index])
    response = requests.get(main_page)

    if response.status_code != 200:
        raise ConnectionError(f"Failed to connect to {main_page}.")

    soup = BeautifulSoup(response.text, "lxml")

    target_regex = re.compile(target_pattern)
    target_urls = ['/'.join(['http:/', domain, x['href']])
                    for x in soup.find_all('a', {'href':target_regex})]

    return target_urls

@load_or_make
def scrape_links_from_each_page(urls, target_pattern, labeler=(lambda x:x)):
    """
    Loops over a list of urls and finds links that matches a target pattern from each page.
    
    Inputs:
    - urls: the list of urls to scrape links from
    - target_pattern: regex that specifies the types of links you want to collect
    - labeler: function that parses a url and returns a label for that page
    
    Outputs:
    - links: a dictionary with key/value pairs {url_label:[scraped_links]}
    """
    links = {}

    for url in urls:
        response = requests.get(url)
        label = labeler(url)

        if response.status_code != 200:
            raise ConnectionError(f"Failed to connect to {url}.")

        soup = BeautifulSoup(response.text, "lxml")

        target_regex = re.compile(target_pattern)
        target_urls = [x['href'] for x in soup.find_all('a', {'href':target_regex})]

        links[label] = target_urls
    
    return links



In [102]:
# FUNCTIONS TO GET PLAYLIST METADATA FROM SPOTIFY

def get_tags(track):
    '''
    Parse metadata for a spotify track
    From a user_playlist json file, a track can be found via:
        user_playlist['tracks']['items'][i]
    '''
    tags =  {
        'id': track['id'],
        'album': track['album']['name'],
        'track': track['track_number'],
        'title': track['name'],
        'artist': track['artists'][0]['name'],
        'duration': int(track['duration_ms']/1000),
        'preview_mp3': track['preview_url'],
        'is_explicit': track['explicit'],
        'isrc_number': track['external_ids'].get('isrc', ''),
        'release_date': track['album']['release_date']
    }
    if track['album']['images']:
        tags['cover_art_url'] = track['album']['images'][0]['url']
    return tags

def build_metadata_df(tracks, client):
    #metadata = []
    #for track in tracks['items']:
    #    # read tags from the playlist JSON
    #    metadata.append(get_tags(track['track']))
    metadata = [get_tags(item['track']) for item in tracks['items'] if item['track']]
    metadata_df = pd.DataFrame(metadata)
    # add more features from the tracks' audio features JSON
    features = client.audio_features(list(metadata_df['id']))
    features_df = pd.DataFrame(features)
    metadata_df = pd.merge(metadata_df, features_df)

    return metadata_df

def download_playlist_metadata(user, pid, pname, client):
    # get metadata for playlist 'pname' by 'user'
    results = client.user_playlist(user, pid, fields="tracks,next")
    tracks = results['tracks']

    all_dfs = []
    batch_df = build_metadata_df(tracks, client)
    all_dfs.append(batch_df)

    while tracks['next']:
        tracks = client.next(tracks)
        batch_df = build_metadata_df(tracks, client)
        all_dfs.append(batch_df)
    metadata = pd.concat(all_dfs)
    metadata.reset_index(drop=True, inplace=True)

    return metadata


def parse_sos_pid(playlists):
    return [x.split('/')[-1] for x in playlists if 'thesoundsofspotify' in x][0]

def download_all_genres_metadata(genre_playlists, client):
    for k,v in genre_playlists.items():
        pname = k
        filepath = f'../data/interim/genre_metadata/{pname}_metadata.tsv'
        if os.path.isfile(filepath):
            continue
        pid = parse_sos_pid(v)
        metadata = download_playlist_metadata('thesoundsofspotify', pid, pname, client)
        metadata.to_csv(filepath, sep='\t', index=False)

## Run the Pipeline

Set this up so that you won't need to download datasets that you already have on your computer when you re-run the pipeline.

In [105]:
import os
import spotipy.oauth2 as oauth2
from dotenv import load_dotenv

load_dotenv('.env')


def generate_token():
    """ Generate the token. Please respect these credentials :) """
    credentials = oauth2.SpotifyClientCredentials(
        client_id=os.getenv("SPOTIPY_CLIENT_ID"),
        client_secret=os.getenv("SPOTIPY_CLIENT_SECRET"))
    token = credentials.get_access_token()
    return token

token=generate_token()

In [4]:
import spotipy.util as util

username='djconxn'

# Cache needs to be clear to load a new token.
try:
    os.remove(f".cache-{username}")
except:
    pass


In [5]:
import spotipy

def run_data_pipeline():
    """
    - scrape genre page urls from everynoise.com/engenremap.html,
        save as a list in ../data/raw/everynoise_genre_urls.pkl
        
    - scrape genre playlist urls from each genre page on everynoise.com,
        save as a dictionary in ../data/raw/thesoundsofspotify_playlist_urls.pkl
        
    - download playlist metadata for each playlist from Spotify,
        save as TSV files in ../data/raw/thesoundsofspotify/[genre].tsv
        
    - download audio_analysis files for each song in a list of playlists
        (not necessarily all playlists because there are 100s of 1000s in the full set)
        save as audio_analysis dictionaries in ../data/raw/audio_analysis/[song_uri].pkl
    
    TODO: include a progress indicator?
    """
    genre_urls = scrape_all_links(
        '../data/raw/everynoise_genre_urls.pkl',
        domain='everynoise.com', 
        index='engenremap.html', 
        target_pattern='engenremap-[a-z]*')
    
    genre_playlists = scrape_links_from_each_page(
        '../data/raw/thesoundsofspotify_playlist_urls.pkl',
        urls=genre_urls,
        labeler=(lambda url: url.split('/')[-1].split('-')[-1].split('.')[0]),
        target_pattern='open.spotify.com')
    
    sp = spotipy.Spotify(auth=token)
    
    download_all_genres_metadata(genre_playlists, sp)
    

    

In [107]:
len(genre_playlists)

2911

In [108]:
run_data_pipeline()

# Scrub the Data

So far I have:
- a directory full of `genre_metadata` DataFrames, the id of which maps to the filenames in...
- a directory full of preview mp3s for songs in each of the genre playlists.

I want to convert the audio data from these mp3s into features that I can use to train a model to classify new songs as representing one of the genres in my training set.

For my MVP model, I'll work with [MFCC's](https://musicinformationretrieval.com/mfcc.html), which are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.


In [2]:
import os
import random
import librosa
import numpy as np
import pandas as pd

## Load Data

In [3]:
# get paths for all genre metadataframes
genre_metadata_dir = "../data/interim/genre_metadata"
sample_mp3_dir = "../data/raw/mp3s"

In [4]:
def get_genre_songids(genre_metadata_dir, limit=None):
    genre_songids_dict = {}
    files = os.listdir(genre_metadata_dir)
    for x in files:
        genre = x.replace('_metadata.tsv', '')
        if genre == x:
            # filename did not contain _metadata.tsv
            continue
        metadata_path = os.path.join(genre_metadata_dir, x)
        songids = list(pd.read_csv(metadata_path, sep='\t').dropna()['id'])
        if limit:
            songids = songids[:limit]
        # shuffle the list of ids for safer train/test splits
        random.shuffle(songids)
        genre_songids_dict[genre] = songids
    return genre_songids_dict


In [5]:
genre_songids_dict = get_genre_songids(genre_metadata_dir)  #, limit=10)

FileNotFoundError: [Errno 2] No such file or directory: '../data/interim/genre_metadata'

## Verify Data Integrity

## Engineer Features

In [None]:
# End goal: dataframe with 
# 12 feature columns, a label column, and an id column
# Perform train/test split on label+id columns, (must convert id to int)
# then use id split to extract train/test matrices from dataframe
genre_features.tail()

In [None]:
def get_mfcc_from_mp3(songid, *args, **kwargs):
    filepath = os.path.join(sample_mp3_dir, songid) + ".mp3"
    x, sr = librosa.load(filepath)
    mfcc = pd.DataFrame(librosa.feature.mfcc(x, sr, *args, **kwargs).T)
    mfcc['id'] = songid
    return mfcc

def collect_genre_mfccs(genre_songids_dict, n_mfcc=12):
    genre_mfccs = []
    for genre in genre_songids_dict.keys():
        mfccs = [get_mfcc_from_mp3(x, n_mfcc=n_mfcc)
                 for x in genre_songids_dict[genre]]
        genre_mfcc_df = pd.concat(mfccs)
        genre_mfcc_df['genre'] = genre
        genre_mfccs.append(genre_mfcc_df)
    genre_mfccs = pd.concat(genre_mfccs)
    genre_mfccs.reset_index(inplace=True, drop=True)
    return genre_mfccs


In [None]:
genre_features = collect_genre_mfccs(genre_songids_dict)
#test_genre_features = collect_genre_mfccs(test_genre_songpaths_dict)

*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import librosa.display

## Inspect Raw Data

In [None]:
# load 3 random songs
songid_opera = random.choice(genre_songids_dict['opera'])
filename_opera = os.path.join(sample_mp3_dir, songid_opera) + ".mp3"
x_opera, sr_opera = librosa.load(filename_opera)

songid_techno = random.choice(genre_songids_dict['techno'])
filename_techno = os.path.join(sample_mp3_dir, songid_techno) + ".mp3"
x_techno, sr_techno = librosa.load(filename_techno)

songid_kpop = random.choice(genre_songids_dict['kpop'])
filename_kpop = os.path.join(sample_mp3_dir, songid_kpop) + ".mp3"
x_kpop, sr_kpop = librosa.load(filename_kpop)

In [None]:
# plot waveform
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x_opera, sr_opera)

In [None]:
ipd.Audio(x_opera, rate=sr_opera)

In [None]:
# plot waveform
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x_techno, sr_techno)

In [None]:
ipd.Audio(x_techno, rate=sr_techno)

In [None]:
# plot waveform
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x_kpop, sr_kpop)

In [None]:
ipd.Audio(x_kpop, rate=sr_kpop)

## Inspect Features for Predictive Patterns

In a regression model, we are looking for clear relationships between our features and targets.

In a classification model, we are looking for features that separate the population into distinct distributions.

In [None]:
sns.pairplot(genre_features, hue='genre')

- *What did you learn about your data?*
- *Does it look like there are clear patterns and relationships among your features that will allow you to make good predictions?*
- *Which features do you think will be most helpful?*

# Model the Data

*Describe the algorithms that you are considering. How do they work? Why are they good choices for this data and problem space?*

*What nuances in the data will you have to be aware of in order to avoid introducing bias to your model? What steps will you need to take to prevent overfitting? What risks are there for data leakage?*

In [None]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

## Train Test Split

In [None]:
# Since we are taking many observations from each song, 
# we don't want songs to appear in both the train and test set.
ids_genres = genre_features[['id', 'genre']].drop_duplicates()

id_train, id_test, y_train, y_test = train_test_split(
    ids_genres['id'], ids_genres['genre'], stratify=ids_genres['genre'])

train_select = genre_features['id'].isin(id_train)

X_train = genre_features[train_select]
X_test = genre_features[~train_select]
genre_train = genre_features[train_select]['genre']
genre_test = genre_features[~train_select]['genre']

## Preprocessing

### Label Encoding

In [None]:
genre_encoder = LabelEncoder()
genre_encoder.fit(genre_features['genre'])
y_train = genre_encoder.transform(genre_train)
y_test = genre_encoder.transform(genre_test)

### Feature Scaling

In [None]:
feature_scaler = StandardScaler()
feature_scaler.fit(X_train.iloc[:,:-2])
X_train_scaled = feature_scaler.transform(X_train.iloc[:,:-2])
X_test_scaled = feature_scaler.transform(X_test.iloc[:,:-2])

## Build and Train Model

In [None]:
model_linear_svc = sklearn.svm.LinearSVC(max_iter=10000)
model_linear_svc.fit(X_train_scaled, y_train)

_Write down any thoughts you may have about working with these algorithms on this data. What looks to have been the most successful design choices? What pain points are you running into? What other ideas do you want to try out as you iterate on this pipeline?_

## Predict and Score

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
score = model_linear_svc.score(X_test_scaled, y_test)
score

## Inspect Errors

In [None]:
y_pred = model_linear_svc.predict(X_test_scaled)
conf_mat = confusion_matrix(y_test, y_pred)

conf_df = pd.DataFrame(conf_mat, index=genre_encoder.classes_+'_true', 
                       columns=genre_encoder.classes_+'_pred')

sns.heatmap(conf_df.div(conf_df.sum(axis=1), axis=0), annot=True)
plt.yticks(rotation=0)

In [None]:
# We have over 1000 mfcc's per song, and one prediction for each of these
# Let's convert our predictions to the most commonly predicted genre
# out of >1000 predictions per song.
results = pd.DataFrame({'true':y_test, 'pred':y_pred, 
                        'id':genre_features[~train_select]['id'],
                        'genre':genre_features[~train_select]['genre']})
modes = results.groupby(['true', 'id', 'genre']).agg(pd.Series.mode)

conf_mat = confusion_matrix(modes.index.codes[0], modes.pred)

conf_df = pd.DataFrame(conf_mat, index=genre_encoder.classes_+'_true', 
                       columns=genre_encoder.classes_+'_pred')

sns.heatmap(conf_df.div(conf_df.sum(axis=1), axis=0), annot=True)
plt.yticks(rotation=0)

In [None]:
modes = modes.reset_index()
kpop_pred_techno = modes[(modes['true']==0) & (modes['pred']==2)]['id'].iloc[0]
techno_pred_kpop = modes[(modes['true']==2) & (modes['pred']==0)]['id'].iloc[0]

In [None]:
songid_techno = techno_pred_kpop
filename_techno = os.path.join(sample_mp3_dir, songid_techno) + ".mp3"
x_techno, sr_techno = librosa.load(filename_techno)

songid_kpop = kpop_pred_techno
filename_kpop = os.path.join(sample_mp3_dir, songid_kpop) + ".mp3"
x_kpop, sr_kpop = librosa.load(filename_kpop)

In [None]:
# is techno, classified as k-pop
ipd.Audio(x_techno, rate=sr_techno)

In [None]:
# is k-pop, classified as techno
ipd.Audio(x_kpop, rate=sr_kpop)

# iNterpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_

## Strengths and Weaknesses

## What Else Can We Do?

- try new models, using cross-validation
- many observations per song: stacked model to classify from output probabilities
- engineer new audio features (check librosa docs, other audio analysis or DSP libraries, mir.com tutorials)
- engineer feature on MFCC series, create one observation per song
- use deep learning sequence models
- topic modeling

Other features:
- https://librosa.github.io/librosa/generated/librosa.core.load.html
- https://musicinformationretrieval.com/energy.html
- https://librosa.github.io/librosa/generated/librosa.feature.tempogram.html