# Using Classification Models to Predict Song Popularity
***

<img src= 'IMAGES/header.jpg' width=10000 />

## Introduction
***
In the age of technology, it has become a lot easier for artists to upload their music to streaming platforms and gain popularity. It seems that almost everyday, artists come out with new songs, but generating a lot of music does not necessarily mean your tracks will be popular. I wanted to understand what constitutes a popular song. The popular music streaming service, Spotify, has an API that allows access to several of their databases. One of the datasets examines audio features of thousands of tracks dating back to the 1920s. Reviewing aspects of these audio features that make a song popular can help artists create pieces that their audience will enjoy. This analysis's objective was to build classifying models that could predict a song’s popularity given various audio features obtained from the Spotify API in hopes of helping artists gain popularity.

## Overview of the Data
***

Spotify is one of the most popular music streaming services around. They have an immense collection of songs and podcasts. I obtained a dataset from the [Kaggle website](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks), which contains songs released between 1921 and 2020. This data was obtained from the Spotify API. Spotify characterizes each of these songs with 13 audio features and also assigns each song a popularity score ranging from 0–100.

The dataset contained:
* 170,000+ tracks
* About 30,000+ artists
* Track audio features

### Audio Features and their descriptions obtained from [Spotify API website](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features)

#### Content
The "data.csv" file contains more than 170.000 songs collected from Spotify Web API, and also you can find data grouped by artist, year, or genre in the data section.

#### Primary:
- id 
    - Id of track generated by Spotify
    
#### Numerical:
- acousticness (Ranges from 0 to 1): The positiveness of the track
    - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability (Ranges from 0 to 1)
    - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- energy (Ranges from 0 to 1)
    - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- duration_ms (Integer typically ranging from 200k to 300k)
    - The duration of the track in milliseconds.
- instrumentalness (Ranges from 0 to 1)
    - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- valence (Ranges from 0 to 1)
    - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- popularity (Ranges from 0 to 100)
    - The popularity of the album. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated from the popularity of the album’s individual tracks.
- tempo (Float typically ranging from 50 to 150)
    - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- liveness (Ranges from 0 to 1)
    - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- loudness (Float typically ranging from -60 to 0)
    - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- speechiness (Ranges from 0 to 1)
    - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- year 
    - Ranges from 1921 to 2020

#### Dummy:
- mode (0 = Minor, 1 = Major)
    - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- explicit (0 = No explicit content, 1 = Explicit content)

#### Categorical:
- key (All keys on octave encoded as values ranging from 0 to 11
    - The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
- artists (List of artists mentioned)
    - The artists of the album. Each artist object includes a link in href to more detailed information about the artist.
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
    - The date the album was first released, for example “1981-12-15”. Depending on the precision, it might be shown as “1981” or “1981-12”.
- name 
    - The name of the album. In case of an album takedown, the value may be an empty string.

## Exploratory Data Analysis
***

### Popularity Distribution
I first took a look at the distribution of the popularity scores. I noticed that a majority of the songs in this dataset are not that popular. Since I used this column to create a binary column for classification, I determined a good threshold would be a value of 35.

<img src= 'IMAGES/dist.png' width=700>

### Top 10 most popular tracks
With such an exciting dataset at my disposal, I wanted to see the top 10 most popular tracks on Spotify.

<img src = 'IMAGES/top10.png'>

### Top 20 most popular artists
I also wanted to know which artists were the most popular
<img src='IMAGES/top20_artists.png'>

### Time series analysis of audio features over time
I was interested to see how these audio features changed over time so I performed a time series analysis

<img src='IMAGES/ts.png'>


### Time series analysis of popularity over time
I also wanted to see how popularity looked over time. Most songs from the 1920s to the early 1950s did not receive high popularity ratings. When you think about it, that makes sense. Many people using Spotify are not going to be listening to music from the 1920s to the late 1940s. 

<img src='IMAGES/pop_ts.png'>

### Audio Features Distribution

I wanted to look at the distribution of each individual audio feature. Judging by some of these features, it looks like performing linear regression may be difficult.

<img src='IMAGES/af_dist.png'>

### Heatmap
Lastly, I wanted to look at the correlation between all the audio features to see any possible multicollinearity. Year and popularity were very positively correlated.
<img src='IMAGES/hm.png'>

## Classification Models
***

I decided to build various classification models to compare their accuracies. The models included were: a baseline LogisticRegression model, a LogisiticRegressionCV model, a Baseline DecisionTrees model, a Bagged Decision Tree, and a Baseline RandomForests model. I also performed GridSearchCV on each model to get the best parameters and compared the accuracies.

### Best Logistic Regression Model

Overall, all the Logistic Regression models had performed the same. The one that performed the best was the LogisticRegressionCV model.
**The LogisticRegressionCV model had a performance accuracy of 75.7%**, which is not much larger than the other models. Below is the confusion matrix produced from that model.

<img src= 'IMAGES/cfm.png' >

This model had issues predicting song popularity. While the amount is low, there are false negatives and false positives, meaning that the model struggles to predict values correctly. I calculated the misclassification rate to be 0.245. **The AUC value calculated from the ROC Curve for this model was 0.82***. An AUC near 1 means that the model is better at predicting 0s as 0s and 1s as 1s.


I wanted to get more information about the audio features within the dataset and how they affect my models. Using the Shap Python package, I created plots that would help me better understand my models. Shap uses Shapley values to help explain machine learning models. More information on the Shap package can be found at the [Shap documentation website](https://shap.readthedocs.io/en/latest/index.html). Below is a simplified summary plot created for this model. More information about this can be found [here](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d).

<img src="IMAGES/sumplot.png">

***The summary plot above has ranked acousticness, valence, danceability, loudness, and speechiness to be the five most important features.*** For this plot, feature importances are displayed by ranking the variables in descending order. ***The top features contribute more to the model than the lower features, meaning they have higher predictive power***

Below is a more elaborate summary plot.

<img src= 'IMAGES/logreg_sumplot.png' >

A couple of things to note about the summary plot above: 
* Feature importances are displayed by ranking the variables in descending order
* The horizontal line indicates whether or not the effect of the value is associated with a higher or lower prediction
* The color demonstrates whether the variable is high (red) or low (blue) for that observation

The summary plot summarizes the following:
* When the level of acousticness of a track is low, it has a positive shap value and is more likely to be "popular"
*  When the level of valence of a track is low, it has a positive shap value and is more likely to be "popular"
*  When the level of danceability of a track is high, it has a positive shap value and is more likely to be "popular"
*  When the level of loudness of a track is high, it has a positive shap value and is more likely to be "popular"
*  When the level of speechiness of a track is high, it has a negative shap value and is less likely to be "popular"

Below are the calculated coefficients for the LogisticRegressionCV model.

<img src='IMAGES/coeff.png'>

These coefficients confirm the interpretations of the plots above. Positive coefficients make the event more likely, and negative coefficients make the event less likely. An estimated coefficient near 0 implies that the effect of the predictor is small. The positive scores indicate a feature that predicts class 1(popular), whereas the negative scores indicate a feature that predicts class 0 (not popular).

A song's danceability is more likely to be predicted to be popular since it has a high positive value, and we can judge by the deep red color that it is highly impactful to the model. The same is valid with loudness.

However, as we get to tempo, duration_ms, and key, we can see that their values are closer to zero, meaning that their impact on the model is low. We can see this in the summary plots above, where these audio features are ranked lowest for feature importances.

We see that acoustiness has a high negative value, and deep blue color, meaning that it is very impactful to the model, and it is most likely to predict a song not to be popular. This is a similar case for valence and speechiness.


*** 
### Conclusions
The LogisitcRegressionCV model performed the best of all models at an accuracy of 75.7% and a misclassification rate = 0.245.

For this model, acousticness, valence, danceability, loudness, and speechiness were ranked to be the five most important features. These features are the most impactful to the model performance.

Most Impactful Audio Features on Model
1. Acousticness
2. Valence
3. Danceability
4. Loudness
5. Speechiness


Summary of Top Audio Features:
* When the level of acousticness of a track is low, it is more likely to be popular
* When the level of valence of a track is low, it is more likely to be popular
* When the level of danceability of a track is high, it is more likely to be popular
* When the level of loudness of a track is high, it is more likely to be popular
* When the level of speechiness of a track is high, it is less likely to be "popular"


### Problems I ran into
I ran into many issues while trying to model. One of the main problems I had was that my models were taking a long time to fit the Decision Trees and Random Forests models, especially with GridSearchCV. One thing I had to do to lower the processing time was to make shallower trees by establishing a max_depth of 10. This made it hard to improve my models significantly. This could definitely have impacted the accuracy of my results.


### Recommendations 
For an artist that wants to create popular music I would recommend to create songs with little to no acoustics, more electronic beats. Data shows that songs with a high acoustic levels are not popular so there should be a focus on creating electronic beats. Songs that are sad / gloomy tend to be more popular with listeners so try creating songs about heartbreak or betrayal (more despressing topics). Listeners enjoy music that they can dance to, so create songs that have a fast tempo, strong beats, and a stable rhythm. Popular songs have lyrics, but not too many, an artist should create songs with simple, repetitive lyrics.


### Future Work
While modeling, I realized that recently added songs would not have high popularity scores since popularity is based on the amount of listens a song gets. Looking at the date and time when a song was uploaded to Spotify would improve the models.

I used a popularity threshold of 35, and I would like to use the same modeling techniques on a different threshold value to see if that improves model predictions.

Most songs from the 1920s to the early 1950s did not receive high popularity ratings. When you think about it, that makes sense. Many people using Spotify are not listening to music from the 1920s to the late 1940s. I would like to see if removing those songs would improve the model.