# Predicting Song Popularity
***
<img src= '../IMAGES/header.jpg' width=10000 />

## Introduction
***
In the age of technology, it has become a lot easier for artists to upload their music to streaming platforms and gain popularity. It seems that almost everyday, artists come out with new songs, but generating a lot of music does not necessarily mean your tracks will be popular. I wanted to understand what constitutes a popular song.  The popular music streaming service, Spotify, has an API that allows access to several of their databases. One of the datasets examines audio features of thousands of tracks dating back to the 1920s. Reviewing aspects of these audio features that make a song popular can help artists create pieces that their audience will enjoy. This analysis's objective was to build classifying models that could predict a song's popularity given various audio features obtained from the Spotify API in hopes of helping artists gain popularity.

## Overview of the Data
***
Spotify is one of the most popular music streaming services around. They have an emmense collection of songs dating back to 1921. I obtained a dataset from the [kaggle website](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) which contains over 175,000 songs between the years 1921-2020. This data was obtained from the Spotify API and I obtained further data using the API to get songs that were newly added in 2021. Spotify characterizes each of these songs with 13 audio features and also assigns each song a popularity score ranging from 0-100. 

The dataset contained:
* 170,000+ tracks
* About 30,000+ artists
* 17 track audio_features

### Audio Features and their descriptions obtained from [Spotify API website](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features)

#### Content
The "data.csv" file contains more than 170.000 songs collected from Spotify Web API, and also you can find data grouped by artist, year, or genre in the data section.

#### Primary:
- id 
    - Id of track generated by Spotify
    
#### Numerical:
- acousticness (Ranges from 0 to 1): The positiveness of the track
    - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability (Ranges from 0 to 1)
    - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- energy (Ranges from 0 to 1)
    - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- duration_ms (Integer typically ranging from 200k to 300k)
    - The duration of the track in milliseconds.
- instrumentalness (Ranges from 0 to 1)
    - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- valence (Ranges from 0 to 1)
    - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- popularity (Ranges from 0 to 100)
    - The popularity of the album. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated from the popularity of the album’s individual tracks.
- tempo (Float typically ranging from 50 to 150)
    - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- liveness (Ranges from 0 to 1)
    - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- loudness (Float typically ranging from -60 to 0)
    - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- speechiness (Ranges from 0 to 1)
    - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- year 
    - Ranges from 1921 to 2020

#### Dummy:
- mode (0 = Minor, 1 = Major)
    - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- explicit (0 = No explicit content, 1 = Explicit content)

#### Categorical:
- key (All keys on octave encoded as values ranging from 0 to 11
    - The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
- artists (List of artists mentioned)
    - The artists of the album. Each artist object includes a link in href to more detailed information about the artist.
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
    - The date the album was first released, for example “1981-12-15”. Depending on the precision, it might be shown as “1981” or “1981-12”.
- name 
    - The name of the album. In case of an album takedown, the value may be an empty string.
    
## Exploratory Data Analysis
***

### Popularity Distribution
I first took a look at the distribution of the popularity scores. I noticed that a majority of the songs in this dataset are not that popular. Since I used this column to create a binary column for classification, I determined a good threshold would be a value of 35.

<img src= '../IMAGES/dist.png' width=700>

### Top 10 most popular tracks
With such an interesting dataset at my disposal, I wanted to see what were the top 10 most popular tracks on Spotify

<img src = '../IMAGES/top10.png'>

### Top 20 most popular artists

I also wanted to know which artists were the most popular
<img src='../IMAGES/top20.png'>

### Time series analysis of audio features over time
I was interested to see how these audio features changed over time so I performed a time series analysis

<img src='../IMAGES/ts.png'>


### Time series analysis of popularity over time
I also wanted to see how popularity looked over time. Most songs from 1920s - early 1950s did not receive high popularity ratings. When you think about it that makes sense. Many people using spotify are not really gonna be listening to music from the 1920s - late 1940s. 

<img src='../IMAGES/pop_ts.png'>


### Audio Features Distribution

I wanted to look at the distribution of each individual audio feature. Judging by some of these features, it looks like performing linear regression may be difficult.

<img src='../IMAGES/af_dist.png'>

### Heatmap
Lastly, I wanted to take a look at the correlation between all the audio features to see if there was any possible multicollinearity. Year and popularity were very highly correlated.

<img src='../IMAGES/hm.png'>

# Classification Models
***

In [6]:
%load_ext autoreload
%autoreload 2
from NOTEBOOKS.functions import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


ModuleNotFoundError: No module named 'NOTEBOOKS'

In [None]:
# Import necessary libraries/packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import itertools
from sklearn import metrics

from sklearn.model_selection import (train_test_split, GridSearchCV,
                                     RandomizedSearchCV, cross_val_score)

from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
                              
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (BaggingClassifier, RandomForestClassifier,
RandomForestRegressor)

from sklearn.metrics import (classification_report, confusion_matrix, 
                             plot_confusion_matrix, precision_score, 
                             accuracy_score, recall_score, f1_score, roc_curve, 
                             auc)

from scipy.special import logit

plt.style.use('seaborn')

import shap
shap.initjs()

from alibi.explainers import KernelShap
from scipy.special import logit

from sklearn.feature_extraction.text import TfidfVectorizer

***
# Preprocessing data
***

In [None]:
# Load dataset and ceate pd dataframes
raw_df = pd.read_csv('DATA/data.csv')

In [None]:
raw_df.describe().round(2)

In [None]:
raw_df.shape

## Popularity Distribution
Using classification models, I want to predict the popularity of a song given the features of tha data set. This data set includes a column for song popularity, which is ranges from 0-100, with 100 being the most popular. I will plot the popularity distribution of these scores using a popularity threshold of 35. Any song with a popularity <= 35 will be deemed unpopular(0) and any song with a popularity >= 35 will be deemed popular(1).

In [None]:
fig = plt.figure(figsize=(10,5))
sns.set(style="darkgrid") 
sns.distplot(raw_df['popularity'], label="Popularity", bins='auto')
plt.xlabel("Popularity")
plt.ylabel("Density")
plt.title("Distribution of Popularity Scores")
plt.axvline(35)
plt.show()

## Create caterogrical (binary) target
In order to create a variable to be the target of this classification analysis, I decided to use a popularity of 35 as a threshold value. In this step, I will create a new binary column named "popular". This column will have a threshold of 35 popularity. If the song popularity is greater than or equal to 35, then it will be classified a popular song (1). Otherwise, the song is not popular (0). I will build other models that have different threshold values and compare model performance.

In [None]:
raw_df['popular'] = (raw_df['popularity'] >= 35).astype('int')
raw_df['popular'].value_counts(1)

In [None]:
raw_df.head()

In [None]:
# Save raw dataframe with 'popular column' as csv file and store in DATA folder
raw_df.to_csv('DATA/raw_df.csv') 

## Make a new dataframe with necessary information

In [None]:
df = raw_df[['valence', 'year', 'acousticness', 'danceability', 'duration_ms',
             'energy', 'instrumentalness', 'liveness', 'loudness', 
          'speechiness', 'tempo', 'key', 'popular']]

In [None]:
df.head()

In [None]:
df.info()

# Logistic Regression Models
***

## LR Model 1: Baseline model

### Define X and y

In [None]:
X = df[['valence', 'acousticness', 'danceability', 'duration_ms',
             'energy', 'instrumentalness', 'liveness', 'loudness', 
          'speechiness', 'tempo', 'key']]

y = df['popular']

### Train Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42)

### Standardize train and test sets

In [None]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns,
                       index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test),columns=X.columns,
                      index=X_test.index)

In [None]:
X_train.describe()

### Instantiate classifier and fit

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

### Predict

In [None]:
pred = logreg.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
plot_shap(logreg, X_train)

X-axis: does it help the model more towards the positive outcome (popular) or negative outcome (not popular).

The newer songs are often more popular 

### Model Coefficients
"Generally, positive coefficients make the event more likely and negative coefficients make the event less likely. An estimated coefficient near 0 implies that the effect of the predictor is small."

“For every one-unit increase in [X variable], the odds that the observation is in (y class) are [coefficient] times as large as the odds that the observation is not in (y class) when all other variables are held constant.”

In [None]:
find_coeffs(logreg, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(logreg, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC
"ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s."

In [None]:
roc_auc(logreg, X_train, X_test, y_train, y_test)

AUC is looking pretty good but could be better. Also the ROC curve could be more perpendicular

***
## LR Model 2: LogisticRegressionCV

### Instantiate classifier and fit model

In [None]:
logregcv = LogisticRegressionCV()
logregcv.fit(X_train, y_train)

### Predict

In [None]:
pred = logregcv.predict(X_test)

### Summary Plot and Mean absolute error 

In [None]:
plot_shap(logregcv, X_train)

### Model  coefficients

In [None]:
find_coeffs(logregcv, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(logregcv, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
roc_auc(logregcv, X_train, X_test, y_train, y_test)

***
## LR Model 3: GridSearchCV

### Instantiate classifier

In [None]:
logreg = LogisticRegression()

### Create Parameter Grid 

In [None]:
log_param_grid = {
    'penalty' : ['l1', 'l2'],
    'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

### Instantiate GridSearchCV and fit

In [2]:
gs_log = GridSearchCV(logreg, log_param_grid, cv=3, return_train_score=True,
                      n_jobs=-1)

NameError: name 'GridSearchCV' is not defined

In [None]:
gs_log.fit(X_train, y_train)

### Best parameters

In [None]:
print("Best Parameter Combination Found During Grid Search:")
gs_log.best_params_

### Predict

In [None]:
pred = gs_log.predict(X_test)

### Summary Plot and Mean absolute error 

In [None]:
logreg_gs = LogisticRegression(C= 0.01, penalty= 'l2')

In [None]:
logreg_gs.fit(X_train, y_train)

In [None]:
plot_shap(logreg_gs, X_train)

### Model Performance

In [None]:
model_performance(gs_log, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
roc_auc(gs_log, X_train, X_test, y_train, y_test)

***
# Decision Trees Models
***

## DT Model 1: Baseline DecisionTree Model

### Instantiate classifier and fit model

In [None]:
dtree_clf = DecisionTreeClassifier(max_depth=10) 
dtree_clf.fit(X_train, y_train)

### Predict

In [None]:
pred = dtree_clf.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
plot_shap_tree(dtree_clf, X_train, X)

### Model Performance

In [None]:
model_performance(dtree_clf, X_train, X_test, y_train, y_test, pred)

### Feature Importances

In [None]:
plot_feature_importances(dtree_clf, X_train, X)

### ROC Curve and AUC

In [None]:
label = 'Baseline DecisionTrees Model'

roc_dt_rf(y_test, pred, label=label)

***
## DT Model 2: Bagged DecisionTree

### Instantiate classifier and fit

In [None]:
bagged_tree =  BaggingClassifier(DecisionTreeClassifier(max_depth=5))
bagged_tree.fit(X_train, y_train)

### Predict

In [None]:
pred = bagged_tree.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
#plot_shap_tree(bagged_tree, X_train, X)

### Model Performance

In [3]:
model_performance(bagged_tree, X_train, X_test, y_train, y_test, pred)

NameError: name 'model_performance' is not defined

### ROC Curve and AUC

In [None]:
label = 'Bagged DecisionTrees Model'

roc_dt_rf(y_test, pred, label=label)

***
## DT Model 3: DecisionTree GridSearch 

### Instantiate classifier

In [None]:
dtree_model = DecisionTreeClassifier() 

### Create Parameter Grid

In [None]:
dt_param_grid = {
     'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 5, 6]
}

### Instantiate GridSearchCV and fit

In [None]:
dt_grid_search = GridSearchCV(dtree_model, dt_param_grid, cv=3,
                              return_train_score=True, n_jobs=-1)

In [None]:
dt_grid_search.fit(X_train, y_train)

### Best parameters

In [None]:
print("Best Parameter Combination Found During Grid Search:")
dt_grid_search.best_params_

### Predict

In [None]:
pred = dt_grid_search.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
dtree_gs = DecisionTreeClassifier(criterion='gini',
                                  max_depth= 10,
                                  min_samples_leaf= 2,
                                  min_samples_split= 2)

In [None]:
dtree_gs.fit(X_train, y_train)

In [None]:
plot_shap_tree(dtree_gs, X_train, X)

### Model Performance

In [None]:
model_performance(dt_grid_search, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'DecisionTrees GridSearchCV Model'

roc_dt_rf(y_test, pred, label=label)

***
# Random Forests Models
***

## RF Model 1: Baseline Model

### Instantiate classifier and fit

In [None]:
forest = RandomForestClassifier(max_depth=10)


In [None]:
forest.fit(X_train, y_train)

### Predict

In [None]:
pred = forest.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
plot_shap_tree(forest, X_train, X)

### Model Performance

In [None]:
model_performance(forest, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'Baseline RandomForests Model'
roc_dt_rf(y_test, pred, label=label)

***
## RF Model 2: GridSearchCV Model

### Instantiate classifier

In [None]:
rforest_model = RandomForestClassifier()

### Create Parameter Grid

In [None]:
rf_param_grid = {
    'n_estimators': [10, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 5, 10],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [3, 6]
}

### Instantiate GridSearchCV and fit

In [None]:
rf_grid_search = GridSearchCV(rforest_model, rf_param_grid, cv=3,
                            return_train_score=True, n_jobs=-1)

In [None]:
rf_grid_search.fit(X_train, y_train)

### Best parameters

In [4]:
print("Best Parameter Combination Found During Grid Search:")
rf_grid_search.best_params_

Best Parameter Combination Found During Grid Search:


NameError: name 'rf_grid_search' is not defined

### Predict

In [None]:
pred = rf_grid_search.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
rtree_gs = DecisionTreeClassifier(criterion= 'entropy',
                                  max_depth= 10,
                                  min_samples_leaf= 6,
                                  min_samples_split= 5)

In [None]:
rtree_gs.fit(X_train, y_train)

In [None]:
plot_shap_tree(rtree_gs, X_train, X)

### Model Performance

In [None]:
model_performance(rf_grid_search, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'RandomForests GridSearchCV Model'

roc_dt_rf(y_test, pred, label=label)

*** 
### Conclusions
The Baseline Random Forests model performed the best of all models at an accuracy of 77.68%.

For the Logistic Regression models, acousticness, valence, danceability, loudness, and speechiness were ranked to be the five most important features.

As for the Decision Trees and Random Forests models, acousticness, loudness, duration_ms, valence, and danceability were ranked to be the five most important features.

All best models agree on the following: 
* When the level of acousticness of a track is low, it has a positive shap value and is more likely to be "popular"
* When the level of loudness of a track is high, it has a positive shap value and is more likely to be "popular"
* When the level of valence of a track is low, it has a positive shap value and is more likely to be "popular"
* When the level of danceability of a track is high, it has a positive shap value and is more likely to be "popular"
*  When the level of speechiness of a track is high, it has a negative shap value and is less likely to be "popular
* key and tempo are the least important features

For an artist that wants to create popular music I would recommend to create sogns with low acoustics, a high loudness level, low valence, and high danceability.

### Problems I ran into
I ran into many issues while trying to model. One of the main problems I had was that my models were taking a long time to fit the Decision Trees and Random Forests models, especially with GridSearchCV. One thing I had to do to lower the processing time was to make shallower trees by establishing a max_depth of 10. This made it hard to improve my models significantly. This could definitely have impacted the accuracy of my results.


### Recommendations to improve models / Future Work
While modeling I realized that that recently added songs would not have high popularity scores since popularity is based on the amount of listens a song gets. I believe looking at the date and time when a song was uploaded to Spotify would improve the models.

I used a popularity threshold of 35, I would like to use the same modeling techniques on a different threshold value to see if that improves models predictions

Most songs from 1920s - early 1950s did not receive high popularity ratings. When you think about it, that makes sense. Many people using Spotify are not really listening to music from the 1920s - late 1940s. I would like to see if removing those songs would improve the model.