# Song popularity on Spotify

Data Camp | M2 Data Sciences 2021-2022

Barrère Elise, Chouaieb Youssef, Cortés Carlos, Denis Matthieu, Haoula Narjes, Milliot Emma 

## Table of Content:
* [Introduction](#intro)
    * [Objective](#obj)
    * [The data](#data)
    * [Metrics](#metrics)

<img src="spotify_image.jpg" width="700">

## Introduction
<a class="anchor" id="intro"></a>
Music occupies a central place in the lives of many. It has been shown that the average American hears or listens to music up to 27 hours per week.  The online music streaming market has been booming since the 2010s and the major player in this market is Spotify. With 406 million monthly active users and 180 million premium subscribers in 2021 and more than 50 million songs available on the platform, the service is really popular. 


The Swedish entreprise is extremely keen on data, and it makes no doubt that its popularity over its opponents is largely due to its understanding of custumers and its almost-magic recommendation system. Without being a record label, Spotify is calling the shots on the popularity of the songs and the artists, because 30% of its users listen to the personalized playlists generated automatically by the platform.

What makes a song popular ? How can we describe songs ? How can we know what a user will like listening to? It seems that Spotify knows the answers to all these questions and is willing to share part of its knowledge with us. 


### Objective :
<a class="anchor" id="obj"></a>

Our objective here is to be able to predict the popularity of song based on its audio features. 

### The data
<a class="anchor" id="data"></a>

Spotify provides an API which allows users to get a multitude of informations regarding the tracks present on the platform. The dta used for this project was exclusively collected using this API. An example jupyter notebook is provided [here](https://github.com/e-barrere/Spotify-Song-Popularity-Prediction/blob/master/use_api_model.ipynb) to show the procedure employed.  

The data is composed of factual and technical details about songs. Each song is indexed by a unique identifier, and several features are retrieved from it. These features are described here (technical descriptions are taken from [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features))

- **Release year/date**
- **Artists** both referred by names and ids
- **Song title** both referred by names and ids
- **Genre** : all of the possible genres are listed on this [website](https://everynoise.com/everynoise1d.cgi?scope=mainstream%20only&vector=popularity)
- **Danceability** : describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- **Energy** : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity
- **Key** : The key the track is in. Integers map to pitches using standard Pitch Class notation.
- **Loudness** : The overall loudness of a track in decibels (dB)
- **Mode** : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- **Speechiness** : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value
- **Acousticness** : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- **Instrumentalness** : Predicts whether a track contains no vocals. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **Liveness** : Detects the presence of an audience in the recording.
- **Valence** : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
- **Tempo** : The overall estimated tempo of a track in beats per minute (BPM). 
- **Duration_ms** : The duration of the track in milliseconds.
- **Time_signature** : An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). 

The aim of this challenge is to predict the popularity of a song. Let's dive into the popularity metric. 

The popularity is a score (integer) from 0 to 100. It is also computed by Spotify. It is a combination of recent stream counts and other factors such as the number of playlists the track is in, save rate, skip rate, and share rate. The popularity is measured at a certain time $t$, and in the case of this challenge it refers to the popularity of songs in March 2022.

### Metrics
<a class="anchor" id="metrics"></a>

In the case of a regression, there are three scores that are used usually. They are the **Mean Squared Error (MSE)**, the **Root Mean Squared Error (RMSE)** and the **Mean Absolute Error (MAE)**. 

The MSE computes the average squared Euclidean distance between the data $y_i$ and the prediction $\hat{y}_i$

$$
    MSE(y, \hat{y}) = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2
$$

The RMSE is the square root of the MSE. It has the advantage of being in the same unit as the predictions and the data.$

$$
    RMSE(y, \hat{y}) = \sqrt{ MSE(y, \hat{y})} = \sqrt{\frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2}
$$

The MAE computes the average Manhattan distance between the data $y_i$ and the prediction $\hat{y}_i$

$$
    MAE(y, \hat{y}) = \frac{1}{N}\sum_{i=1}^N |y_i - \hat{y}_i| \enspace .
$$

The score used in the challenge is the **RMSE**.