# Music Valence Prediction
Spotify data | Gradient boosting | 

***

## Project description

### The objective

The goal of this task is to develop a Python-based module to predict the valence of newly released pop songs. Two approaches are to use as an input:
- the audio data (e.g. .wav files)
- the songs lyrics

Publicly available datasets can be used for training and testing.

### Audio features

For model training, the following audio features will be used:
- Mel Frequency Cepstral Coefficients
- Mel Spectogram
- Spectral Contrast
- Root Mean Square
- Chroma Vector
- Tonal Centroid Features

### The dataset

In this project, data from Spotify will be used:   

Music attributes info is taken from this article on Medium: 
[Spotify Music Data Analysis: Part 3](https://medium.com/analytics-vidhya/spotify-music-data-analysis-part-3-9097829df16e)

<a id="music-attributes"></a>
### Music Attributes

**Tempo:** The tempo of the song. The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.

**Energy:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Higher the value more energetic the song.

**Danceability:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. The value ranges from 0 to 1. Higher the value more suitable the song is for dancing.

**Loudness:** Loudness values are averaged across the entire track. It is the quality of a song. It ranges from -60 to 0 DB. Higher the value, the louder the song.

**Valence:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

*** 

## Imports and settings

In [41]:
import requests
import re
import os
from os import listdir

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as BS
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials 

In [2]:
plt.style.use("seaborn-v0_8")
sns.set_style("dark", {"axes.facecolor": "0.95"})
sns.set_palette("mako")

plt.rcParams["figure.figsize"] = (9, 4)
%config InlineBackend.figure_format = "retina"

***

## Data scraping

### Top artists and songs

Create a list of 1000 most popular artists using the data from the [Spotify Chart History](https://kworb.net/spotify/artists.html) web page. 

In [3]:
chart_hist_URL = "https://kworb.net/spotify/artists.html"

req = requests.get(chart_hist_URL)
req.status_code

200

In [4]:
# encode the page content for proper artist names displaying
content = req.text.encode("latin-1")

soup = BS(content, 'lxml')

In [5]:
# find the top artists table
tabular_data = soup.find_all('table', attrs={"id":"spotifyartistindex"})
len(tabular_data) # check how many found

1

In [6]:
# keep only the first 1000 rows (without header)
table_rows = tabular_data[0].find_all("tr")[1:1001]
len(table_rows)

1000

In [7]:
# get the atrist names and number of streams
artists = []
streams = []

for row in table_rows:
    
    artist = row.find("a")
    artist = re.search("(?<=>).+(?=</a>)", str(artist)).group()
    artists.append(artist)
    
    num_streams = row.find_all("td")[-1]
    num_streams = re.search("(?<=<td>).+(?=</td>)", str(num_streams)).group()
    num_streams = int("".join(re.split(",", num_streams)))
    streams.append(num_streams)

In [8]:
# create a dataframe
top_artists = pd.DataFrame(dict(artist=artists, num_of_srtreams=streams))
top_artists.head()

Unnamed: 0,artist,num_of_srtreams
0,Bad Bunny,33481443936
1,Drake,27276371805
2,Ed Sheeran,20113241185
3,Justin Bieber,20109642835
4,J Balvin,19123851607


***

### Spotify: obtain available tracks

#### Login items

In [9]:
# read Spotify login items
login_items = pd.read_csv("spotify_data/spotify_login_items.csv", index_col=0)
login_items = login_items.squeeze() # convert to a series

In [10]:
# authentification
credentials = SpotifyClientCredentials(client_id=login_items.client_id,
                                       client_secret=login_items.client_secret)

sp = spotipy.Spotify(client_credentials_manager=credentials)

#### Get artist's URIs

In [11]:
for i in tqdm(range(1000)):

    arist_name = top_artists.loc[i,"artist"]
    try:
        artists = sp.search(q="artist:" + arist_name, type="artist", limit=1)
        artist_URI = artists["artists"]["items"][0]["uri"]
        top_artists.loc[i, "artist_uri"] = artist_URI
    except:
        top_artists.loc[i, "artist_uri"] = None

100%|███████████████████████████████████████| 1000/1000 [02:31<00:00,  6.62it/s]


In [12]:
top_artists = top_artists.dropna().reset_index(drop=True)

length = len(top_artists)
print(f"Artists found: {length}")

top_artists.head()

Artists found: 971


Unnamed: 0,artist,num_of_srtreams,artist_uri
0,Bad Bunny,33481443936,spotify:artist:4q3ewBCX7sLwd24euuV69X
1,Drake,27276371805,spotify:artist:3TVXtAsR1Inumwj472S9r4
2,Ed Sheeran,20113241185,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V
3,Justin Bieber,20109642835,spotify:artist:1uNFoZAHBGtllmzznpCI3s
4,J Balvin,19123851607,spotify:artist:1vyhD5VmyZ7KMfW5gqLgo5


#### Get 10 songs for each artist

In [85]:
data = pd.DataFrame()
idx = 0

for i in tqdm(range(length)):

    artist_URI = top_artists.loc[i, "artist_uri"]
    tracks = sp.artist_top_tracks(artist_URI)["tracks"]

    for j in range(10):
        try:
            data.loc[idx, "artist"] = top_artists.loc[i,"artist"]
            data.loc[idx, "artist_uri"] = top_artists.loc[i,"artist_uri"]
            data.loc[idx, "track_name"] = tracks[j]["name"]
            data.loc[idx, "track_id"] = tracks[j]["id"]
            data.loc[idx, "track_uri"] = tracks[j]["uri"]
            data.loc[idx, "preview_url"] = tracks[j]["preview_url"]
            idx += 1
        except:
            pass

data.info()

100%|█████████████████████████████████████████| 971/971 [03:23<00:00,  4.77it/s]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9607 entries, 0 to 9606
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   artist       9607 non-null   object
 1   artist_uri   9607 non-null   object
 2   track_name   9607 non-null   object
 3   track_id     9607 non-null   object
 4   track_uri    9607 non-null   object
 5   preview_url  7814 non-null   object
dtypes: object(6)
memory usage: 783.4+ KB





#### Music attributes

Get [music attributes](#music-attributes) for the tracks in the resutlting dataset.

In [98]:
data.shape[0]
music_attributes = ["tempo","energy","danceability", "loudness","valence"]

for i in tqdm(range(data.shape[0])):

    track_uri = data.loc[i, "track_uri"]
    audio_features = sp.audio_features(track_uri)[0]
    for attribute in music_attributes:
        try:
            data.loc[i, attribute] = audio_features[attribute]

        except:
            data.loc[i, attribute] = None

data.info()

100%|███████████████████████████████████████| 7814/7814 [15:17<00:00,  8.51it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7814 entries, 0 to 7813
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   artist        7814 non-null   object 
 1   artist_uri    7814 non-null   object 
 2   track_name    7814 non-null   object 
 3   track_id      7814 non-null   object 
 4   track_uri     7814 non-null   object 
 5   preview_url   7814 non-null   object 
 6   tempo         7811 non-null   float64
 7   energy        7811 non-null   float64
 8   danceability  7811 non-null   float64
 9   loudness      7811 non-null   float64
 10  valence       7811 non-null   float64
dtypes: float64(5), object(6)
memory usage: 671.6+ KB





In [103]:
# look at the list of excluded artists
dropped_out = top_artists[
                    ~top_artists["artist"]
                    .isin(data["artist"])
             ]["artist"]
print("Dropped out total: ", len(dropped_out))
dropped_out.head()

Dropped out total:  53


10      Billie Eilish
13         Juice WRLD
20    Imagine Dragons
28       Shawn Mendes
34          21 Savage
Name: artist, dtype: object

#### Download audio

In [10]:
for i in tqdm(range(data.shape[0])):
    
    url = data.loc[i, "preview_url"]
    track_id = data.loc[i, "track_id"]
    file_path = f"spotify_data/mp3_files/{track_id}.mp3"
    
    try:
        r = requests.get(url, timeout=2)
        with open(file_path, "wb") as f:
            f.write(r.content)
        data.loc[i, "local_file_path"] = file_path
    except:
        pass

100%|█████████████████████████████████████| 7811/7811 [1:00:33<00:00,  2.15it/s]


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7811 entries, 0 to 7810
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   artist           7811 non-null   object 
 1   artist_uri       7811 non-null   object 
 2   track_name       7811 non-null   object 
 3   track_id         7811 non-null   object 
 4   track_uri        7811 non-null   object 
 5   preview_url      7811 non-null   object 
 6   tempo            7811 non-null   float64
 7   energy           7811 non-null   float64
 8   danceability     7811 non-null   float64
 9   loudness         7811 non-null   float64
 10  valence          7811 non-null   float64
 11  local_file_path  7811 non-null   object 
dtypes: float64(5), object(7)
memory usage: 732.4+ KB


In [42]:
# check how many files donwloaded
downloaded = [s[:-4] for s in listdir("spotify_data/mp3_files/")]
len(downloaded)

6801

In [43]:
# see how many tracks are duplicates
data.duplicated(subset=["track_name","track_id","local_file_path"]).sum()

1011

In [45]:
# drop duplicated tracks from the dataframe
data.drop_duplicates(subset=["track_name","track_id","local_file_path"], 
                     inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6800 entries, 0 to 7810
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   artist           6800 non-null   object 
 1   artist_uri       6800 non-null   object 
 2   track_name       6800 non-null   object 
 3   track_id         6800 non-null   object 
 4   track_uri        6800 non-null   object 
 5   preview_url      6800 non-null   object 
 6   tempo            6800 non-null   float64
 7   energy           6800 non-null   float64
 8   danceability     6800 non-null   float64
 9   loudness         6800 non-null   float64
 10  valence          6800 non-null   float64
 11  local_file_path  6800 non-null   object 
dtypes: float64(5), object(7)
memory usage: 690.6+ KB


In [47]:
# data = pd.read_csv("spotify_data/songs_data.csv")
# data.info()

In [46]:
# data = data.dropna().reset_index(drop=True)

# save for the further use
data.to_csv("spotify_data/songs_data.csv", index=False)

Now we have data and audio files for 6800 songs by the artists with the most streams.

***