# Predicting Song Popularity
***

<img src= 'IMAGES/header.jpg' width=10000 />

## Objective
***
The objective of this analysis was to build regression and classifying models that could predict the popularity of a song given various audio features.

## Overview of the Data
***
Spotify is one of the most popular music streaming services around. They have an emmense collection of songs dating back to 1921. I obtained a dataset from the [kaggle website](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) which contains over 175,000 songs between the years 1921-2020. This data was obtained from the Spotify API and I obtained further data using the API to get songs that were newly added in 2021. Spotify characterizes each of these songs with 13 audio features and also assigns each song a popularity score ranging from 0-100. 

The dataset contained:
* 170,000+ tracks
* About 30,000+ artists
* 17 track audio_features

### Audio Features and their descriptions obtained from [Spotify API website](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features)

#### Content
The "data.csv" file contains more than 170.000 songs collected from Spotify Web API, and also you can find data grouped by artist, year, or genre in the data section.

#### Primary:
- id 
    - Id of track generated by Spotify
#### Numerical:
- acousticness (Ranges from 0 to 1): The positiveness of the track
    - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability (Ranges from 0 to 1)
    - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- energy (Ranges from 0 to 1)
    - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- duration_ms (Integer typically ranging from 200k to 300k)
    - The duration of the track in milliseconds.
- instrumentalness (Ranges from 0 to 1)
    - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- valence (Ranges from 0 to 1)
    - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- popularity (Ranges from 0 to 100)
    - The popularity of the album. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated from the popularity of the album’s individual tracks.
- tempo (Float typically ranging from 50 to 150)
    - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- liveness (Ranges from 0 to 1)
    - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- loudness (Float typically ranging from -60 to 0)
    - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- speechiness (Ranges from 0 to 1)
    - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- year 
    - Ranges from 1921 to 2020
#### Dummy:
- mode (0 = Minor, 1 = Major)
    - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- explicit (0 = No explicit content, 1 = Explicit content)
#### Categorical:
- key (All keys on octave encoded as values ranging from 0 to 11
    - The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
- artists (List of artists mentioned)
    - The artists of the album. Each artist object includes a link in href to more detailed information about the artist.
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
    - The date the album was first released, for example “1981-12-15”. Depending on the precision, it might be shown as “1981” or “1981-12”.
- name 
    - The name of the album. In case of an album takedown, the value may be an empty string.

## Exploratory Data Analysis
***

### Top 10 most popular tracks

With such an interesting dataset at my disposal, I wanted to see what were the top 10 most popular tracks on Spotify

<img src = 'IMAGES/top10_tracks.png'>

### Top 20 most popular artists
I also wanted to know which artists were the most popular
<img src='IMAGES/top20_artists.png'>

### Time series analysis of audio features over time

I was interested to see how these audio features changed over time so I performed a time series analysis

<img src='IMAGES/ts_audio.png'>

### Popularity Distribution
I took a look at the distribution of the popularity scores. I noticed that a majority of the songs in this dataset are not that popular. Since I used this column to create a binary column for classification, I determined a good threshold would be a value of 35.

<img src= 'IMAGES/pop_dist.png' width=1000>

### Audio Features Distribution

I wanted to look at the distribution of each individual audio feature. Judging by some of these features, it looks like performing linear regression may be difficult.

<img src='IMAGES/af_dist.png'>

### Heatmap
Lastly, I wanted to take a look at the correlation between all the audio features to see if there was any possible multicollinearity. It doesnt seem like there are many variables with very high correlations. 

<img src='IMAGES/heatmap.png'>

## EDA C