## **Back** to the Future: Evolution of Music Moods from 1992 to Present - Data Collection

The purpose of this notebook is data collection. For this project we will aggregate data from annual Billboard Top 100 lists via Wikipedia, and various track related data points from the Spotify API. The output of this notebook with be a .csv file from each data source that we can use in additional notebooks for further data cleaning, analysis, and visualization.

Steps in this notebook:
<br>
- [Initial Imports & Installation](#first-bullet)
<br>
- [Billboard Data Collection - Top 100 Charts since 1992](#second-bullet)
<br>
- [Spotify Data Collection - Track attributes and mood data](#third-bullet)

### **Initial Imports & Installation**<a class="anchor" id="first-bullet"></a>

In [None]:
import warnings
import pandas as pd 
import numpy as np
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:
#Install Wikipedia API
!pip install wikipedia

#Install Spotify API
!pip install spotipy

# Install BeautifulSoup: a library to parse HTML documents
!pip install beautifulsoup4

# install Requests: a library to handle api requests
!pip install requests

You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
#imports for Wikipedia / Billboard data
import requests 
import wikipedia
from bs4 import BeautifulSoup 

#imports for Spotify API
import os
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2

### **Billboard Data Collection - Top 100 Charts since 1992**<a class="anchor" id="second-bullet"></a>
*Scraping Wikipedia Tables for Billboard Hot 100 Year-End Data*

Wikipedia uses a standard format for all Billboard Hot 100 Year End charts. This made it easy for us to streamline collecting the data for each year we wanted to analyze. The result we're looking for is one single datafile of all Top 100 songs for the last 20 years (1992 to 2022).

In [None]:

#function to parse table data from Billboard Year-End Hot 100 wiki page of its corresponding year

def get_YearEnd_Hot100(year):
    #Finding url of page that contains chart data
    wikipage=wikipedia.page('Billboard Year-End Hot 100 singles of {}'.format(year))
    wikiurl = wikipage.url
    response=requests.get(wikiurl)
    
    #Parsing html table contents into a list
    soup = BeautifulSoup(response.text, 'html.parser')
    billboardtable=soup.find('table',{'class':"wikitable"})
    df=pd.read_html(str(billboardtable))
    
    # convert list to dataframe
    df=pd.DataFrame(df[0])
    df["Billboard Year"]=year
   
    return df

In [None]:
#setup a loop to grab data for each year 1992-2022
last_30 = [2022 - i for i in range(31)]
songs_dfs = []

for year in last_30:
    songs_df=get_YearEnd_Hot100(year)
    songs_dfs.append(songs_df)

billboard_df=pd.concat(songs_dfs)

billboard_df.columns

Index(['No.', 'Title', 'Artist(s)', 'Billboard Year', '№'], dtype='object')

In [None]:
#Resulting dataframe has two "No." ranking columns
#need to replace NaN values with corresponding value in the other column. 
billboard_df.reset_index()
billboard_df["No."].fillna(billboard_df['№'], inplace=True)
del billboard_df['№']

billboard_df

Unnamed: 0,No.,Title,Artist(s),Billboard Year
0,1.0,"""Heat Waves""",Glass Animals,2022
1,2.0,"""As It Was""",Harry Styles,2022
2,3.0,"""Stay""",The Kid Laroi and Justin Bieber,2022
3,4.0,"""Easy on Me""",Adele,2022
4,5.0,"""Shivers""",Ed Sheeran,2022
...,...,...,...,...
95,96.0,"""I Will Remember You""",Amy Grant,1992
96,97.0,"""We Got a Love Thang""",CeCe Peniston,1992
97,98.0,"""Let's Get Rocked""",Def Leppard,1992
98,99.0,"""They Want EFX""",Das EFX,1992


In [None]:
#Splitting Artist(s) column
def split_artist(artists):
    if "featuring" in artists:
        return artists.split("featuring",1)
    elif "," in artists:
        return artists.split(",",1)
    elif "(" in artists:
        return artists.split("(",1)
    elif "and " in artists:
        return artists.split("and ",1)
    
    else:
        return [artists,None]
    


billboard_df[["Artist_1","Artist_2"]]= [split_artist(x) for x in billboard_df["Artist(s)"]]
billboard_df

Unnamed: 0,No.,Title,Artist(s),Billboard Year,Artist_1,Artist_2
0,1.0,"""Heat Waves""",Glass Animals,2022,Glass Animals,
1,2.0,"""As It Was""",Harry Styles,2022,Harry Styles,
2,3.0,"""Stay""",The Kid Laroi and Justin Bieber,2022,The Kid Laroi,Justin Bieber
3,4.0,"""Easy on Me""",Adele,2022,Adele,
4,5.0,"""Shivers""",Ed Sheeran,2022,Ed Sheeran,
...,...,...,...,...,...,...
95,96.0,"""I Will Remember You""",Amy Grant,1992,Amy Grant,
96,97.0,"""We Got a Love Thang""",CeCe Peniston,1992,CeCe Peniston,
97,98.0,"""Let's Get Rocked""",Def Leppard,1992,Def Leppard,
98,99.0,"""They Want EFX""",Das EFX,1992,Das EFX,


In [None]:
#convert to csv to combine with mood info and retain year
billboard_df.to_csv("/data/workspace_files/billboard_songs.csv", index=False)

### **Spotify Data Collection - Track Attributes and Mood Data**<a class="anchor" id="third-bullet"></a>
*Accessing Spotify API*

Spotipy is a python client we used for accessing the Spotify API. Below we have setup the required authentication and stored credentials are separately for security. We've also defined our target market for the Spotify data. 

In [None]:
### Accessing Spotify API
market = [ "AD", "AR", "AT", "AU", "BE", "BG", "BO", "BR", "CA", "CH", "CL", "CO", "CR", "CY", 
      "CZ", "DE", "DK", "DO", "EC", "EE", "ES", "FI", "FR", "GB", "GR", "GT", "HK", "HN", "HU", 
      "ID", "IE", "IS", "IT", "JP", "LI", "LT", "LU", "LV", "MC", "MT", "MX", "MY", "NI", "NL", 
      "NO", "NZ", "PA", "PE", "PH", "PL", "PT", "PY", "SE", "SG", "SK", "SV", "TH", "TR", "TW", 
      "US", "UY", "VN" ]

credentials = oauth2.SpotifyClientCredentials(
        client_id=os.environ["CLIENT_ID"],
        client_secret=os.environ["CLIENT_SECRET"])

token = credentials.get_access_token()['access_token']
spotify = spotipy.Spotify(auth=token)

In [None]:
#define a function to query spotify api for song ids given a list of songs
#returns a list of dicts

song_ids = []
def get_song_id(df):
    df["titleArtist"]=df["Title"]+" "+df["Artist_1"]
    songs = df["titleArtist"].tolist()
    for song in songs:
        track_search = spotify.search(song, type="track", market=market, limit=1)
        track_id = track_search["tracks"]["items"][0]["id"]
        artist=track_search["tracks"]["items"][0]["artists"][0]['name']
        title=track_search["tracks"]["items"][0]["name"]
        song_ids.append({"Title": title, 'Main Artist': artist, "ID": track_id})

    return song_ids

song_id_dict = get_song_id(billboard_df)

In [None]:
#define a function that returns mood data for a given list of song dicts
#returns a list of dicts

song_mood_list = []
song_segments = []

def get_mood_detail(song_id_dicts):
    for s in song_id_dicts:
        audio_features = spotify.audio_features(s["ID"])

       
        if audio_features[0] is not None:
            energy = audio_features[0]["energy"]
            loudness = audio_features[0]["loudness"]
            valence = audio_features[0]["valence"]
            tempo = audio_features[0]["tempo"]
            danceability = audio_features[0]["danceability"]
            key = audio_features[0]["key"]
        
            song_mood_list.append({"Title": s["Title"],
                                   "Main Artist":s["Main Artist"],
                                   "Energy": energy,
                                   "Loudness": loudness,
                                   "Valence": valence,
                                   "Tempo": tempo,
                                   "Danceability": danceability,
                                   "Pitch/Key": key
                                   }
                                  )
        else:
            song_mood_list.append({"Title": s["Title"],
                                   "Main Artist":s["Main Artist"],
                                   "Energy": None,
                                   "Loudness": None,
                                   "Valence": None,
                                   "Tempo": None,
                                   "Danceability": None,
                                   "Pitch/Key": None
                                   }
                                  )

    
    moods_df = pd.DataFrame.from_records(song_mood_list)
    
    return moods_df

mood_df = get_mood_detail(song_id_dict)
mood_df.head()

Unnamed: 0,Title,Main Artist,Energy,Loudness,Valence,Tempo,Danceability,Pitch/Key
0,Heat Waves,Glass Animals,0.525,-6.9,0.531,80.87,0.761,11.0
1,As It Was,Harry Styles,0.731,-5.338,0.662,173.93,0.52,6.0
2,STAY (with Justin Bieber),The Kid LAROI,0.764,-5.484,0.478,169.928,0.591,1.0
3,Easy On Me,Adele,0.366,-7.519,0.13,141.981,0.604,5.0
4,Shivers,Ed Sheeran,0.859,-2.724,0.822,141.02,0.788,2.0


### **Evaluating Mood Metrics to Determine Overall Music Mood**

Metric definitions and values returned as defined by Spotify:
- Valence is a measure of musical positiveness. This returns a value from 0.0 to 1.0
- Energy measures the intensity and activity of a song, and evaluates the dynamic range of loudness, timbre,  onset rate, and entropy. This returns a value from 0.0 to 1.0.
- Pitch / Key is the key the track is in. The key is returned as a value from -1 (unknown) to 11 that maps to standard Pitch Class notation.
- Loudness measures the intensity of a song. It's the physical strength and amplitude averaged over the song, and returned as a value from -60 to 0 DB.
- Tempo is the speed of the song and is typically compared to human heartbeat in BPM. Frantic, excited, or energetic songs have high tempo. Calm, depressing, or sad songs have low tempo.
- Danceability is how suitable a song is for dancing. It considers tempo, rhythm, beat, and regularity and is returned as a value from 0.0 to 1.0.

<br>
In our evaluation we considered multiple models of mood classification based on these metrics.
<br>
<br>
**For Hindi Music - Referenced Bhar(2014)**

| Mood of Hindi Music        | Valence | Pitch/Key | Energy | Loudness | Danceability |
|-------------|---------|-------|--------|-----------|-------|
| Happy       | VHigh   | High | High   | Med       |  Med |
| Exuberant   | VHigh   | High  | VHigh   | High      |  High |
| Energetic   | High    |  Med  | VHigh   | VHigh     | VHigh |
| Frantic     | Low     |  Low  | High    | High      | High |
| Anxious/Sad | VLow    |  Low | Med    | Med       | Med  |
| Depression  | VLow    |  Low  | VLow    | Low       |  Low  |
| Calm        | Med     |  Med  | VLow    | VLow      |  VLow |
| Contentment | Med     | High | Low    | Low       |  Low |

<br>
<br>
**For Western Music - Derived from Bhar(2014)**

|Mood |Mean Intensity|Mean Timbre|Mean Pitch|Mean Rhythm|
|---------------------|--------------|-----------|----------|-----------|
| Happy |	0.2055 |	0.4418 |	967.47 |	209.01 |
|Exuberent |	0.317|	0.4265|	611.94|	177.7|
|Energetic |	0.4564|	0.319|	381.65|	163.14|
|Frantic |	0.2827|	0.6376|	239.78|	189.03|
|Sad	|0.2245|	0.1572|	95.654|	137.23|
|Depression|	0.1177|	0.228|	212.65|	122.65|
|Calm	|0.0658	|0.1049	|383.49	|72.23|
|Contentment	|0.1482	|0.2114	|756.65|	101.73|


<br>
<br>
Intensity and timbre within Table 2 are given as normalized values – they should simply be viewed relative to one another.

Pitch is given as a frequency in Hz, number of cycles per second, and rhythm is given as a number of beats per minute.

The models above were difficult to translate using spotify metrics. Too many emotional tags will make the mood classification complex. Thus, in most existing researches of music mood classification, the music moods are normally divided according to the two-dimensional emotion model. This model consists of two dimensions: valence (negative/positive) and arousal (low/high).

The updated model from 2017 proved to be more accurate at classifying music moods. For example, happy is positive valence with medium arousal; calm is low arousal with neutral valence.



**The 2-D emotion moods, derived from Munoz-De-Escalona (2017)**

| Mood        | Valence | Energy/Arousal |
|-------------|---------|-------|
| Alert      | Med   | High | 
| Excited   | High   | High  |
| Happy   | High    |  Med  |
| Relaxed    | Med    |  Med | 
| Calm | Med    |  Low | 
| Sad  | Low    |  Low  |
| Depression | Low    |  Med  |
| Afraid | Low     |  High  | 


<br>
This final model is what we used for evaluating our data. Our final step is to add an additional column with this mood calculation for each track.

In [None]:
#using bounds, determine levels for each metric

arousal_bins = [0.0, 1/3, 2/3, 1.0]    #Arousal/Energy
valence_bins=[0.0, 1/3, 2/3 , 1.0]              #Valence

name = ["Low", "Med", "High" ]

mood_df["EL"] = pd.cut(mood_df["Energy"], arousal_bins, labels=name)
mood_df["VL"] = pd.cut(mood_df["Valence"], valence_bins, labels=name)
mood_df

Unnamed: 0,Title,Main Artist,Energy,Loudness,Valence,Tempo,Danceability,Pitch/Key,EL,VL
0,Heat Waves,Glass Animals,0.525,-6.900,0.531,80.870,0.761,11.0,Med,Med
1,As It Was,Harry Styles,0.731,-5.338,0.662,173.930,0.520,6.0,High,Med
2,STAY (with Justin Bieber),The Kid LAROI,0.764,-5.484,0.478,169.928,0.591,1.0,High,Med
3,Easy On Me,Adele,0.366,-7.519,0.130,141.981,0.604,5.0,Med,Low
4,Shivers,Ed Sheeran,0.859,-2.724,0.822,141.020,0.788,2.0,High,High
...,...,...,...,...,...,...,...,...,...,...
3095,I Will Remember You,Amy Grant,0.658,-6.432,0.336,177.625,0.513,0.0,Med,Med
3096,We Got A Love Thang,CeCe Peniston,0.644,-10.470,0.764,120.012,0.669,0.0,Med,High
3097,Let's Get Rocked,Def Leppard,0.888,-6.763,0.526,91.989,0.564,5.0,High,Med
3098,They Want EFX,Das EFX,0.459,-12.840,0.597,98.454,0.755,5.0,Med,Med


In [None]:
#now assign each mood based on mood matrix

def mood_category(mood_df):
    if (mood_df["EL"] == "High") and (mood_df["VL"] == "Med") :
        return "alert"
    elif (mood_df["EL"] == "High") and (mood_df["VL"] == "Low"):
        return "afraid"
    elif (mood_df["EL"] == "Low") and (mood_df["VL"] == "Low"):
        return "sad"
    elif (mood_df["EL"] == "Med") and (mood_df["VL"] == "Low"):
        return "depressed"
    elif (mood_df["EL"] == "Low") and (mood_df["VL"] == "Med"):
        return "calm"
    elif (mood_df["EL"] == "Med") and (mood_df["VL"] == "Med"):
        return "relaxed"
    elif (mood_df["EL"] == "Med") and (mood_df["VL"] == "High"):
        return "happy"
    elif (mood_df["EL"] == "High") and (mood_df["VL"] == "High"):
        return "excited"
    else:
        return "None"
    

mood_df["Mood"] = mood_df.apply(mood_category, axis=1)

In [None]:
#convert the dataframe to a csv for export and further analysis
mood_df.to_csv("/data/workspace_files/92_22_data.csv", index=False)