#__What Makes a Song Popular?__
<img src="https://drive.google.com/uc?id=1Ln3Fp-PUu_AUuKY1Wf_oSZ87uxLWAM67" width="600">

##Motivations
Our team is composed of music lovers; Travis plays the guitar, Bryan plays the drums, and Aidas is probably a great singer (TBD)... We're basically a band. This is what made the idea of analyzing musical data so intriguing. We all enjoy music, but the quantifiable data found within each wave length caught our attention.

##Objective
Through our analysis, we aim to identify the song features that contribute most to a **song's popularity**. In this project, we utilized a third party library named [Spotipy](https://spotipy.readthedocs.io/en/2.16.1/) to interact with the [Spotify API](https://developer.spotify.com/documentation/web-api/), which allows us to access their vast database of songs as well as the metadata tied to each song.

###Spotify's Popularity Metric

It is important to note that Spotify's popularity rating is based on total number of plays compared to other tracks, as well as how recent those plays are. Our data does not include number of times a song has been played. Instead, our aim is to find ways to **predict** the popularity of a song without taking the number of plays into consideration.

#Data Dictionary

Field|Data type|Description
-------|---------|---------
__id__ |integer| Primary key (unique ID for each song) 
__artist_song__| String | Composite primary key consisting of  a concatenation of `artist` and `track_name` field 
__pop_category__ | Category | Categorical field used to bucket songs by popularity. Categories include: Low, Mid, High, Very High
__artist__| String | Artist name
__track_name__| String | Name of song
__tempo__| Integer | Beats per minute of a song
__time_signature__| String | Time signaure of a song
__year__| Integer | Year that a song is released
__duration_mins__| String | Duration of a song in minutes
__popularity__| Float | The popularity rating is based on total number of plays 
||compared to other tracks as well as  how recent those plays are. Measured as a value between 0 - 100
__acousticness__| Float | A confidence measure from 0 to 100 of whether the track is acoustic.
||100 represents high confidence the track is acoustic.
__danceability__| Float | Describes how suitable a track is for dancing based on a combination of musical elements including tempo, 
||rhythm stability, beat strength, and overall regularity. A value of 0 is least danceable and 100 is most danceable.
__energy__| Float | Energy is a measure from 0 to 100 and represents a perceptual measure of intensity
|| and activity. Typically, energetic tracks feel fast, loud, and noisy.
__instrumentalness__ | Float | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. 
|| Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 100, the greater likelihood the track contains no vocal content.
__liveness__| Float | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
|| A value above 80 provides strong likelihood that the track is live.
__loudness__| Float | Confidence measure from 0 - 100 that measures how loud a song is in terms of decibles
__speechiness__| Float | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry),
||the closer to 100 the attribute value. Values above 66 describe tracks that are probably made entirely of spoken words.
__valence__| Float | Confidence measure from 0 - 100 that measures the mood of a song (100 = happy, 0 = sad)

__Unused fields:__

Field|Description
------|------
__url__|url to spotify song page
__track_href__|A link to the Web API endpoint providing full details of the track.
__type__|"audio_feature" tag
__uri__| Unique indicator for a song within the Spotify platform
__analysis_url__|	An HTTP URL to access the full audio analysis of this track. An access token is required to access this data.
__key__|Musical key in which a song is in
__mode__|Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
__duration_ms__| Duration of song in milliseconds

#Raw Data Overview
------------

## Data Description

<img src="https://drive.google.com/uc?export=view&id=1qWFdUmFLy-Lvy0bHKvhA7La_6G4myRoA" width="800">

## Null Data

Field              |Missing Values
-------------------|---------------
artist             | 0
genre              | 0
id                 | 0
popularity         | 0
track_name         | 0
year               | 0
acousticness       | 2
analysis_url       | 2
danceability       | 2
duration_ms       | 2
energy             | 2
instrumentalness   | 2
key                | 2
liveness           | 2
loudness           | 2
mode               | 2
speechiness        | 2
tempo              | 2
time_signature     | 2
track_href         | 2
type               | 2
uri                | 2
valence            | 2

#Pre-processing
----------

<u>__Goals__</u>: 
- Remove/clean 
- Normalize  
- Transform 
- Apply categorical tags

## Data removal, normalization & index setting
In this step we dropped 7 fields that were deemed unnecessary for our analysis, removed null data, generated a unique composite key (_Artist_ + _Track Name_), and normalized our data all song features to be values between 0-100

## Genre tagging
This step focuses on generalizing the genre tagging for each song to aid our analysis. The original genre tagging that was pulled for each song was very specific and difficult to utilize. There were also various genre tags available for each song. Our approach to this was to only pull the first instance of genre within each song's list and then apply the below code to generalize the tagging. By generalizing the genre tags for each song, we will be able to run more efficient aggregations throughout our analysis.

--------------------------------------

<u>Example of raw pull for Genre</u>:

Song|Genre
----|----
ABC | ['pop','emo pop', '2000s pop']





## Popularity classification
This step is focused on generating popularity classification tags with the following logic:
```
if popularity < 60:
  pop_category = '01 - low'
elif popularity >= 60 & popularity < 75:
  pop_category = '02 - mid'
elif popularity >= 75 & popularity < 90:
  pop_category = '03 - high'
elif popularity >= 90:
  pop_category = '04 - very high'
  ```

<img src="https://drive.google.com/uc?export=view&id=1BZuLQXOAkJG68rWcvvKuavaeP1hauKqD" width="450" height="300">

## Outlier Removal
In this section we examine the extent to which outliers are present in the dataset using a boxplot.  After examination, and upon finding outliers, a data cleaning methodology was implemented.  A function was created with the following steps:
* Determine the .25 quartile of the data within a given column (using np.quantile).  We call this variable 'Q1'.
* Determine the .75 quartile of the data within a given column (using np.quantile).  We call this variable 'Q3'.
* Calculate the IQR as: IQR = Q3 - Q1
* Calculate an upper bound: UB = Q3 + IQR * 1.5
* Calculate a lower bound: LB = Q1 - IQR * 1.5
* Use np.where() to replace high and low outliers with upper bound, and lower bound, respectively.


<img src="https://drive.google.com/uc?export=view&id=1Fw4Ba9boDJn4MX-zWK_Zdm6RNVMmVlt_" width="450" height="350">

<img src="https://drive.google.com/uc?export=view&id=1czUbTsYDlNvniHjFTHdX46V2b1o9N_sc" width="450" height="350">


## Post-Preprocessing Data Description

<img src="https://drive.google.com/uc?export=view&id=14L0nO-cEnZv4idUaqJuKs7L49h0SBRfP" width="800">



## Preprocessing Shape Impact

Field|Raw Data|Processed Data|Change
-----|---|-------|----------
Row Count|13,404|12,571|-833
Column Count|26|19|-7


#Begin Analysis
------------
##Test Correlations Between Features
1. __High__ correlation between a predictor and the target is __preferred__ - we use the correlation as the agent of _relatedness_ between the two;
  - If Correlation between features > +-.3 = __Good Correlation__
  - If Correlation between features > +-.5 = __Strong Correlation__


2. __Low__ correlation between a pair of predictors is __preferred__ - if you include a pair of highly correlated predictors in the same model, it would cause a problem of _auto-correlation_ or _multi-collinearality_, which would negatively impact the model performance.



<img src="https://drive.google.com/uc?export=view&id=1TL22YyzXk7XfU2z_X9wtNrqyP68p9EnF" width="800">

##Correlation of Features vs Popularity

<img src="https://drive.google.com/uc?export=view&id=1w_H_5EvVnDyGVq4OB2_Bnp8NjyETJV6_" width="250" height="300">

##Test correlation of relevant features against popularity

Feature|Correlation to Popularity
-------|-----------
Speechiness|Positive
Valence|Positive
Danceability|Positive
Liveness|Negative
Acousticness|Negative
duration_mins|Negative
Loudness|Neutral
Enegry|Neutral
Tempo|Neutral





<img src="https://drive.google.com/uc?export=view&id=16b9GO2PK9K-oksaBaPJWJ2YBS3b55Gel" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1ONC5VI5PO6UxcFFunP8GNrtzzQ990I4u" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1e7L11_2iPZsERzSG--jjZFjB4pp907jV" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=19Y-z8eyDHDnfTuKnXDc_tmHXNdc3PXwS" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1uYXiBcsX2ZaRYTH9_0py5svagM7RPE3B" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=19MGCK0SG-nLW4UZCBZ4auy-FbkdR3i5k" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1MI7y_4ob69FqR42ZyCTQNSSAjre54O7K" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1o3VK6TmyYYPX7duvwLFI-PGkRu1_iOSu" width="400" height="300">
<img src="https://drive.google.com/uc?export=view&id=1e5FrpZjeh_x2A0TZ5L4fKnWoS_05-ePV" width="400" height="300">


# Feature Correlations by Genre

After our initial analysis of how popularity correlates with the various features within our dataset, we decided to go one dimension deeper and look at the same correlations aggregated by genres. 

Our hypothesis was that some features would have stronger correlations with popularity when compared to similar types of music. For example, we thought that Rock or Country songs would have strong correlation between popularity and instrumentallness. 

The only impactful changes in correlations were seen within the Reggae genre, but this genre only contains 70 samples within our data set, while other genres hold +2,000 samples. This brings us to the conclusion that our initial correlation analysis is not improved by grouping our data by genre.




<img src="https://drive.google.com/uc?export=view&id=1_v8VLIYjVJoSvbN0APMFpFfTJKF7eaz4" width="500" height="400">


<img src="https://drive.google.com/uc?export=view&id=1jh8JJ-tkqxsgNIvDaJaQSICaVmuyZr-p" width="600" height="500">


# Machine Learning

The 3 highest correlated features were sliced and sent to a csv format for machine learning.  The target features were twofold.  For prediction algorithms, the continous feature, popularity, was targeted.  For the classification algorithm, the popularity category feature, one that was created using np.select, was targeted.

The following machine learning widget was used:

[Wolfram Machine Learning for Small Datasets](https://www.wolframcloud.com/obj/philip/ai)

Several prediction algorithms were utilized, and the results were as follows:

## Performance Matrix - Prediction
| Algorithm | Mean Avg Dev (MAD) |
| -------------- | -------------- |
| Linear Regression | 6.63 |
| Random Forest | 7.01 |
| Grad Boosted Tree | 7.65 |
| Decision Tree | 7.74 |
| Nearest Neighbors | 7.79 |
| Neural Network | 7.80 |
| Naive (Average) | 8.06 |
| Guassian Process | 8.14 |

<img src="https://drive.google.com/uc?export=view&id=1Zg46IHQPLqEmQpJqnpadLgPemOWvDiHv" width="600" height="300">

### Machine Learning Performance over iterations
In this area we look at how the machine learning "learns" as it iterates through the dataset.  This is achieved by plotting the running MAD vs the iteration.

<img src="https://drive.google.com/uc?export=view&id=14vnzylp_I42yOobj9skPSiHcp043LxsF" width="500" height="400">

## Classification Results

Additionally, a classification algorithm was used to predict the popularity category (very high, high, medium, low).  The decision tree algorithm was used, and it predicted correctly __80.6%__ of the time.  Below you can see the plotted prediction accuracy vs iteration.

<img src="https://drive.google.com/uc?export=view&id=1kgsys0sH3MohgS591jTgr45tr_1kTO02" width="500" height="400">

# Challenges
**Pulling Data** - Since we chose to build our data set instead of a pre-made one, we spent some time learning how to utilize the Spotify API, adjusting the functions within the Spotipy library for our needs and understanding the features that we have access to.

**Correlation Analysis** - Analyzing the correlation between each feature can be a time consuming task and may not be easy to present if the analysis is spread across many cells in the notebook. We overcame this challenge by utilizing a correlation table that aggregates all of the findings into one image.



# Conclusion
------------------

From our analysis, we can make certain general conclusions (as an addition to our machine learning findings). By focusing our study on the various quantifiable features of songs, we can conclude that:

Higher values of danceability, energy, and loudness translate to songs being percieved as "happier" (high Valence values). From our analysis, we also saw that the most popular songs have _high_ Valence values compared to the rest, which allows us to make the observation that __up-beat danceable songs that make people feel happy have the highest chance of becoming popular__.

- The __pop genre__ usually contains these characteristics, which correlates with our result that shows pop as being the genre with the highest average popularity rating.

## __Secret Formula:__
__IF -->__ <img src="https://drive.google.com/uc?export=view&id=1FDQeAbmtVOe3JJBAU4RGbTEjbSTzLEdW" width="200" height="175"> __THEN -->__  <img src="https://drive.google.com/uc?export=view&id=1viPiBsCpMepbHmP-jbJKlRTmASUWzWQW" width="200" height="175">



## Potential Next Steps
Building off our general conclusion, we could expand the analysis by analyzing the _key_ of each song in relation to Valence values. This could be a valuable addition because songs using "major" keys tend to have a happier feeling, and vice versa for songs in "minor" keys.

After further understanding the implications behind Valence values, we could analyze the correlation between songs with high valence values against those with low valence values in terms of how much radio time they get, how many playlists they are included on, and how often they get played at parties.

These extra steps could help solidify our understanding behind why happier songs tend to have higher popularity ratings.



#Analysis Links
- [Spotify API](https://drive.google.com/open?id=1O_8MlMrTGuMUKN-UABRfWFISvj_50FTe)
- [Data Analysis](https://drive.google.com/open?id=1M46CWkHCB4BlqqIhrtbFN3bfX3ieMiHk)


# Sources
- Spotify API documentation: https://developer.spotify.com/documentation/web-api/
- Spotipy documentation: https://spotipy.readthedocs.io/en/2.16.1/
