# Music Valence Prediction
Spotify data | Gradient boosting | 

***

## Project description

### The objective

The goal of this task is to develop a Python-based module to predict the valence of newly released pop songs. Two approaches are to use as an input:
- the audio data (e.g. .wav files)
- the songs lyrics

Publicly available datasets can be used for training and testing.

### Audio features

For model training, the following audio features will be used:
- Mel Frequency Cepstral Coefficients
- Mel Spectogram
- Spectral Contrast
- Root Mean Square
- Chroma Vector
- Tonal Centroid Features

### The dataset

In this project, data from Spotify will be used: 

*** 

## Imports and configurations

In [1]:
import requests
import re

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as BS
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials 

In [2]:
plt.style.use("seaborn-v0_8")
sns.set_style("dark", {"axes.facecolor": "0.95"})
sns.set_palette("mako")

plt.rcParams["figure.figsize"] = (9, 4)
%config InlineBackend.figure_format = "retina"

***

## Data scraping

### Top artists and songs

Create a list of 1000 most popular artists using the data from the [Spotify Chart History](https://kworb.net/spotify/artists.html) web page. 

In [3]:
chart_hist_URL = "https://kworb.net/spotify/artists.html"

req = requests.get(chart_hist_URL)
req.status_code

200

In [4]:
# encode the page content for proper artist names displaying
content = req.text.encode("latin-1")

soup = BS(content, 'lxml')

In [5]:
# find the top artists table 
tabular_data = soup.find_all('table', attrs={"id":"spotifyartistindex"})
len(tabular_data) # check how many found

1

In [6]:
# keep only the first 1000 rows (without header)
table_rows = tabular_data[0].find_all("tr")[1:1001]
len(table_rows)

1000

In [7]:
# get the atrist names and number of streams
artists = []
streams = []

for row in table_rows:
    
    artist = row.find("a")
    artist = re.search("(?<=>).+(?=</a>)", str(artist)).group()
    artists.append(artist)
    
    num_streams = row.find_all("td")[-1]
    num_streams = re.search("(?<=<td>).+(?=</td>)", str(num_streams)).group()
    num_streams = int("".join(re.split(",", num_streams)))
    streams.append(num_streams)

In [8]:
# create a dataframe
top_artists = pd.DataFrame(dict(artist=artists, num_of_srtreams=streams))
top_artists.head()

Unnamed: 0,artist,num_of_srtreams
0,Bad Bunny,33446817351
1,Drake,27268738872
2,Ed Sheeran,20109467130
3,Justin Bieber,20106414174
4,J Balvin,19122158631


***

### Spotify ...

#### Login items

In [9]:
# read Spotify login items
login_items = pd.read_csv("spotify_data/spotify_login_items.csv", index_col=0)
login_items = login_items.squeeze() # convert to a series

In [10]:
# authentification
credentials = SpotifyClientCredentials(client_id=login_items.client_id,
                                       client_secret=login_items.client_secret)

sp = spotipy.Spotify(client_credentials_manager=credentials)

#### Get artist's URIs

In [11]:
for i in tqdm(range(1000)):
    
    arist_name = top_artists.loc[i,"artist"]
    try:
        artists = sp.search(q="artist:" + arist_name, type="artist", limit=1)
        artist_URI = artists["artists"]["items"][0]["uri"]
        top_artists.loc[i, "artist_uri"] = artist_URI
    except:
        top_artists.loc[i, "artist_uri"] = None

100%|███████████████████████████████████████| 1000/1000 [01:47<00:00,  9.31it/s]


In [12]:
top_artists = top_artists.dropna().reset_index(drop=True)

length = len(top_artists)
print(f"Artists found: {length}")

top_artists.head()

Artists found: 971


Unnamed: 0,artist,num_of_srtreams,artist_uri
0,Bad Bunny,33446817351,spotify:artist:4q3ewBCX7sLwd24euuV69X
1,Drake,27268738872,spotify:artist:3TVXtAsR1Inumwj472S9r4
2,Ed Sheeran,20109467130,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V
3,Justin Bieber,20106414174,spotify:artist:1uNFoZAHBGtllmzznpCI3s
4,J Balvin,19122158631,spotify:artist:1vyhD5VmyZ7KMfW5gqLgo5


#### Get 10 songs for each artist

In [41]:
data = pd.DataFrame()
idx = 0

for i in tqdm(range(length)):
    
    artist_URI = top_artists.loc[i, "artist_uri"]
    tracks = sp.artist_top_tracks(artist_URI)["tracks"]
    
    for j in range(10):
        try:
            data.loc[idx, "artist"] = top_artists.loc[i,"artist"]
            data.loc[idx, "artist_uri"] = top_artists.loc[i,"artist_uri"]
            data.loc[idx, "track_name"] = tracks[j]["name"]
            data.loc[idx, "track_id"] = tracks[j]["id"]
            data.loc[idx, "url"] = tracks[j]["preview_url"]
            idx += 1
        except:
            pass
        
data.info()

100%|█████████████████████████████████████████| 971/971 [02:54<00:00,  5.55it/s]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9607 entries, 0 to 9606
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   artist      9607 non-null   object
 1   artist_uri  9607 non-null   object
 2   track_name  9607 non-null   object
 3   track_id    9607 non-null   object
 4   url         8654 non-null   object
dtypes: object(5)
memory usage: 708.4+ KB





In [43]:
data = data.dropna().reset_index(drop=True)

# save for further use
data.to_csv("spotify_data/songs_dataframe.csv", index=False)