# Genre Classification - Data Collection

This project aims to predict the genre of songs given their audio features and lyrical content. Audio features, as well as other track-wise information, are obtained from [Spotify](spotify.com). Since Spotify API doesn't offer genre information, I turned to [The Sounds of Spotify](https://open.spotify.com/user/thesoundsofspotify) to generate the dataset of songs and their genre labels that will be used to train the predictive models. More about this music genre project can be found [here](https://artists.spotify.com/blog/how-spotify-discovers-the-genres-of-tomorrow). Lyrics, on the other hand, are obtained from [Genius](genius.com)

## Contents
* [Imports](#Imports)
* [API Connections](#API)
* [Helper Functions](#Helpers)
* [Data Download](#DataDownload)
  * [Pop](#Pop)
  * [R&B](#R&B)
  * [Hip Hop](#HipHop)
  * [Latin](#Latin)
  * [Reggae](#Reggae)
  * [EDM](#EDM)
  * [Indie](#Indie)
  * [Rock](#Rock)
  * [Metal](#Metal)
  * [Country](#Country)
  * [Jazz](#Jazz)
  * [Classical](#Classical)
  * [Concatenation](#Concatenation)
  * [Collect Lyrics](#Lyrics)
* [Data Definitions](#DataDefinitions)

### Imports<a id='Imports'></a>

Importing required packages

In [None]:
import os
import re
import math
from datetime import datetime, timedelta, date
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
from spotify_client import *
import lyricsgenius

### API Connections<a id='API'></a>

Connecting to Spotify API, which will be used to fetch each of the songs from the Sounds of Spotify playlists as well as their audio features.

In [None]:
sp_client_id = '665fe23a8c264e8d969f97d1666c3c49'
sp_client_secret = 'cd9f621c0b904a40b55aae8092ab0bad'
sp = SpotifyAPI(sp_client_id, sp_client_secret)

Connecting to Genius API, which will be used to fetch each songs' lyrics.

In [None]:
ge_client_id = 'P3-oej0e3s4DOetE6IJOwCtuS61xFT1Q04gtF9DrDU8RudsfBZF01BD0Z3NdrI-4'
ge_client_secret = 'QBj5U83_bkukCNgEBv26m9xHxzL2mZhFaJn9J-2vLK2fLsA59J7WrORn5171xbn21Otin1U3lIJNHV67raqRiQ'
gen = lyricsgenius.Genius('BNXZ6rt8-NIcKLnGurr1YOkhWi5WVU5-fWktELHm7xpp-lJY2jJ-uLUwqGs9wZ3L')

### Helper Functions<a id='Helpers'></a>

Functions to collect playlist tracks, their audio features, and their lyrics for a major genre from each subgenre's playlist

In [None]:
def get_playlist_id(subgenre):
    '''
    Returns the ID for a given subgenre's Spotify playlist
    '''
    playlist_id = str()
    query = 'The Sound of ' + subgenre
    playlists = sp.search(query, search_type='playlist')['playlists']
    for playlist in playlists['items']:
        if query.lower() == playlist['name'].lower():
            playlist_id = playlist['id']
            break
    if (subgenre == 'viral pop'): playlist_id = '0tAsyMQoefUL8DWNn6xkAk'
    if (subgenre == 'classic country pop'): playlist_id = '6lOCvTH6vW5Jc7oyryNom4'
    if (subgenre == 'neoclassicism'): playlist_id = '1qJiG40Pdhyt3Mxslpk41M'
    return playlist_id

def get_track_features(track_id):
    ''' 
    Returns audio features of a given track
    '''
    audio_features = {}
    try:
        audio_features = sp.get_audio_features(track_id)
        features = [audio_features['danceability'], audio_features['energy'],
                  audio_features['loudness'], audio_features['speechiness'],
                  audio_features['acousticness'], audio_features['instrumentalness'],
                  audio_features['liveness'], audio_features['valence'],
                  audio_features['tempo'], audio_features['mode']]
        return features
    except len(audio_features) < 1:
        print('Timeout occured')
        return None
    
    return features

def get_tracks(genre, subgenre):
    ''' 
    Returns playlist songs for each subgenre of a major genre
    '''
    df = pd.DataFrame(columns=cols)
    playlist_id = get_playlist_id(subgenre)
    playlist_tracks = sp.get_playlist_tracks(playlist_id)['items']
    for i in np.arange(0, len(playlist_tracks)):
        track_id = playlist_tracks[i]['track']['id']
        track = playlist_tracks[i]['track']['name']
        artist = playlist_tracks[i]['track']['artists'][0]['name']
        album = playlist_tracks[i]['track']['album']['name']
        release_date = playlist_tracks[i]['track']['album']['release_date']
        duration_ms = playlist_tracks[i]['track']['duration_ms']
        popularity = playlist_tracks[i]['track']['popularity']
        attributes = [track_id, track, artist, album, release_date, genre, subgenre,
                      pd.to_numeric(duration_ms), pd.to_numeric(popularity)]
        features = get_track_features(track_id)
        if (features is None):
            features = get_track_features(track_id)
        df = df.append(pd.Series(attributes+features, index=cols, name=i))
    return df

def get_track_lyrics(df):
    ''' 
    Returns lyrics of a song using Genius API
    '''
    for i in np.arange(0,len(df),1):
        # cleaning the track name for our search query
        track_name = df.loc[i, 'track'].split("-", 1)[0].split("(", 1)[0]
        artist = df.loc[i, 'artist']
        print(track_name, artist)
        # skipping the track if it's instrumental
        if (df.loc[i, 'instrumentalness'] > 0.5):
            df.loc[i, 'lyrics_raw'] = 'Instrumental'
            continue
        try:
            # making a GET request through our API endpoint to fetch the track's lyrics
            song = gen.search_song(track_name, artist)
            df.loc[i, 'lyrics_raw'] = song.lyrics
        except:
            df.loc[i, 'lyrics_raw'] = None
            
    return df

### Data Download<a id='DataDownload'></a>

Collecting the audio features and lyrics of songs from each subgenre within each major genre (retrieved from [here](http://everynoise.com/everynoise1d.cgi?scope=all)). There are twelve subgenre playlists per major genre including the major genre itself, and each playlist contains one hundred songs.

In [None]:
cols = ['track_id', 'track', 'artist', 'album', 'release_date', 'genre', 'subgenre',
        'duration_ms', 'popularity', 'danceability', 'energy', 'loudness', 'speechiness',
        'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'mode']
df = pd.DataFrame(columns=cols)

#### Pop<a id='Pop'></a>

In [None]:
pop = pd.DataFrame()
pop_subgenres = ['pop', 'pop dance', 'dance pop', 'post-teen pop', 'electropop',
                 'social media pop', 'viral pop', 'boy band', 'girl group',
                 'indie cafe pop', 'tropical house', 'neo mellow']

In [None]:
for sg in pop_subgenres:
    pop = pop.append(get_tracks('pop', sg)).reset_index(drop=True)
    
pop.head()

#### R&B<a id='R&B'></a>

In [None]:
rnb = pd.DataFrame()
rnb_subgenres = ['r&b', 'urban contemporary', 'hip pop', 'neo soul',
                 'new jack swing', 'new jack smooth', 'deep smooth r&b',
                 'quiet storm', 'funk', 'soul', 'pop r&b', 'alternative r&b']

In [None]:
for sg in rnb_subgenres:
    rnb = rnb.append(get_tracks('r&b', sg)).reset_index(drop=True)
    
rnb.head()

#### Hip Hop<a id='HipHop'></a>

In [None]:
hiphop = pd.DataFrame()
hiphop_subgenres = ['hip hop', 'rap', 'pop rap', 'trap', 'melodic rap',
                    'alternative hip hop', 'gangster rap', 'hardcore hip hop', 'boom bap',
                    'conscious hip hop', 'underground hip hop', 'old school hip hop']

In [None]:
for sg in hiphop_subgenres:
    hiphop = hiphop.append(get_tracks('hip hop', sg)).reset_index(drop=True)
    
hiphop.head()

#### Latin<a id='Latin'></a>

In [None]:
latin = pd.DataFrame()
latin_subgenres = ['latin', 'latin pop', 'tropical', 'reggaeton', 'reggaeton flow',
                   'latin hip hop', 'trap latino', 'latin alternative',
                   'bachata', 'ranchera', 'mariachi', 'salsa']

In [None]:
for sg in latin_subgenres:
    latin = latin.append(get_tracks('latin', sg)).reset_index(drop=True)
    
latin.head()

#### Reggae<a id='Reggae'></a>

In [None]:
reggae = pd.DataFrame()
reggae_subgenres = ['reggae', 'roots reggae', 'dub', 'ska', 'ska revival'
                    'rock steady', 'lovers rock', 'modern reggae', 'early reggae',
                    'reggae fusion', 'dancehall', 'old school dancehall']

In [None]:
for sg in reggae_subgenres:
    reggae = reggae.append(get_tracks('reggae', sg)).reset_index(drop=True)
    
reggae.head()

#### EDM<a id='EDM'></a>

In [None]:
edm = pd.DataFrame()
edm_subgenres = ['edm', 'pop edm', 'electronic trap', 'dubstep', 'brostep',
                 'electro house', 'progressive electro house', 'complextro',
                 'house', 'progressive house', 'big room', 'deep house']

In [None]:
for sg in edm_subgenres:
    edm = edm.append(get_tracks('edm', sg)).reset_index(drop=True)
    
edm.head()

#### Indie<a id='Indie'></a>

In [None]:
indie = pd.DataFrame()
indie_subgenres = ['indie pop', 'indie poptimism', 'lo-fi', 'stomp and holler',
                   'indie folk', 'shimmer pop', 'indietronica', 'chillwave',
                   'indie rock', 'modern rock', 'modern alternative rock', 'dance-punk']

In [None]:
for sg in indie_subgenres:
    sg_songs = get_tracks('indie', sg)
    indie = indie.append(sg_songs).reset_index(drop=True)
    
indie.head()

#### Rock<a id='Rock'></a>

In [None]:
rock = pd.DataFrame()
rock_subgenres = ['rock', 'classic rock', 'mellow gold', 'permanent wave',
                  'album rock', 'soft rock', 'hard rock', 'art rock', 'pop rock',
                  'heartland rock', 'alternative rock', 'psychedelic rock']

In [None]:
for sg in rock_subgenres:
    rock = rock.append(get_tracks('rock', sg)).reset_index(drop=True)
    
rock.head()

#### Metal<a id='Metal'></a>

In [None]:
metal = pd.DataFrame()
metal_subgenres = ['metal', 'alternative metal', 'nu metal',
                   'speed metal', 'death metal', 'glam metal',
                   'black metal', 'power metal', 'neo classical metal'
                   'thrash metal', 'old school thrash', 'crossover thrash']

In [None]:
for sg in metal_subgenres:
    metal = metal.append(get_tracks('metal', sg)).reset_index(drop=True)
    
metal.head()

#### Country<a id='Country'></a>

In [None]:
country = pd.DataFrame()
country_subgenres = ['country', 'contemporary country', 'country pop', 'classic country pop',
                     'country road', 'country rock', 'modern country rock', 'country dawn',
                     'outlaw country', 'redneck', 'country rap', 'nashville sound']

In [None]:
for sg in country_subgenres:
    country = country.append(get_tracks('country', sg)).reset_index(drop=True)
    
country.head()

#### Jazz<a id='Jazz'></a>

In [None]:
jazz = pd.DataFrame()
jazz_subgenres = ['jazz', 'cool jazz', 'soul jazz', 'bebop', 'hard bop',
                  'contemporary post-bop', 'contemporary jazz', 'big band',
                  'swing', 'jazz fusion', 'free jazz', 'avant-garde jazz']

In [None]:
for sg in jazz_subgenres:
    jazz = jazz.append(get_tracks('jazz', sg)).reset_index(drop=True)
    
jazz.head()

#### Classical<a id='Classical'></a>

In [None]:
classical = pd.DataFrame()
classical_subgenres = ['classical', 'classical era', 'early music', 'renaissance', 'baroque',
                       'late romantic era', 'post-romantic era', 'early modern classical',
                       'avant-garde', 'neoclassicism', 'contemporary classical', 'impressionism']

In [None]:
for sg in classical_subgenres:
    classical = classical.append(get_tracks('classical', sg)).reset_index(drop=True)
    
classical.head()

#### Concatenation<a id='Concatenation'></a>

Combining the songs from all of the major genres into a data frame

In [None]:
df = pd.concat([pop, rnb, hiphop, latin, reggae, edm, indie, rock, metal, country, jazz, classical])
df.release_date = pd.to_datetime([rd.split('-')[0] for rd in df.release_date])
for col in list(df.columns[7:]):
    df[col] = pd.to_numeric(df[col])
df.info()

Adding extra features

In [None]:
df['speechiness'] = df[['speechiness']].apply(lambda x: (x-min(x))/(max(x)-min(x)))
df['release_year'] = df.release_date.apply(lambda x: x.year)
df['duration_min'] = df.duration_ms/(1000*60)
df['duration_minsec'] = df.duration_min.apply(lambda x: str(math.floor(x))+' m, '+str(math.floor((x-math.floor(x))*60))+' s')
df.head()

Download dataframe to a csv file

In [None]:
df.to_csv('data/tracks.csv', index=False)

### Collect Lyrics<a id='Lyrics'></a>

Collecting the lyrics of each song in our dataframe from [genius.com](genius.com).

In [None]:
df = pd.read_csv('data/tracks.csv')
df['lyrics_raw'] = None
df.head()

In [None]:
df = get_track_lyrics(df)
df.info()

Download dataframe to a csv file

In [None]:
df.to_csv('data/tracks.csv', index=False)

## Data Definitions<a id='DataDefinitions'></a>

Definitions of the features in the audio features dataframe and the lyrics 

In [None]:
af_definitions = pd.DataFrame(['Unique identifier', 'Track name', 'Artist name', 'Album name',
                              'Date released', 'Subgenre of the major genre', 'Major genre',
                              'The duration of the track in milliseconds',
                               'The popularity of the track on Spotify',
                               'How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.',
                              'A perceptual measure of intensity and activity on a scale of 0.0 to 1.0 . Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.',
                              'The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.',
                              'Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.',
                              '	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.',
                              'Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.',
                              'Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.',
                              'A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). ',
                              'The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.',
                              'Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.',
                              'Year released', 'The duration of the track in minutes', 'The duration of the track in minutes and seconds'],
                                  index=af.columns,
                                  columns=['definition'])

lyrics_definitions = pd.DataFrame(['Unique identifier', 'Track name', 'Artist name', 'Song Lyrics'],
                                  index=['track_id', 'track', 'artist', 'lyrics'],
                                  columns=['definition'])