# Juptyer Notebook - The Outliers

## [View on Github](https://github.com/edgeslab/cs418-project-the-outliers/blob/master/spotify-analysis/SpotifyAnalysis.ipynb)

## [Demo](https://spotify-outliers.appspot.com)

## Authors

Siham Hussein: shusse6@uic.edu, seehamrun@

Fatima Qarni: fqarni2@uic.edu, qarni@

Zaynab Almoujahed: zalmou2@uic.edu, zalmoujahed@ 

Amit Panthi: apanth2@uic.edu, apanth2@

David Qiao: dqiao4@uic.edu, chowsterr@


## Project Overview and Introduction

Music is a very impactful media and can affect our daily lives. It invokes many different senses, emotions and is key to human growth. However, not every song is created equal, different situations call for different soundtracks. 

The goal of the project is to create an Spotify playlist analyzer. We  will analyze Spotify's top ranking playlists and see what musical qualities are most prevalent. It should be able both see the trends amongst different playlists over the years and across countries, and also be able to predict which year/country a user's playlist is closest to.

## Data Overview 

Spotify provides us with playlists that contain the top tracks of the year (auto or user generated) per time period/region. Along with this, Spotify also keeps a number of stats pertaining to each song (danceability, loudness, energy, etc). Concatenating these stats will be the basis of our study.

### Data Sources 

We used the spotify API to gather information about the top 50 playlists from 60+ countries as well as gather playlists for the yearly top hits from various years (globally). 

## Data Cleaning

We obtained the data via the Spotify Web API. This is based off OAuth2 to make validated requests and then we used several endpoints to get the information about the playlist tracks. All the code for retrieving this is in [spotify_api.py](spotify_api.py)

Obtaining a playlist involved 3 steps: 
1. Obtaining a request token
2. Finding the playlist ID of the playlists we were interested in via the spotify web application
3. Use the spotify API to obtain the overall Playlist Object (`spotify_api.get_playlist()`)
4. Parse the Playlist Object to get the tracks in the playlist out of it and potentially retrieve more tracks if the playlist is large (`spotify_api.get_playlist_tracks`)
5. For each track from the above, call the Spotify API to obtain the "track features" (`spotify_api.get_audio_features`)
6. While getting the track features, call the Spotify API again to retrieve the song's artist, then another, to retrieve that artist's genre. This was the only way to get a song's genre since it is not available otherwise

These steps are wrapped into one function `spotify_api.get_playlist_audio_features()` which takes the playlist ID and an access token (populated from `spotify_api.get_access_token()`) and returns a list of those features which are then written to a CSV file using `spotify_api.export_to_csv`


## Exploratory Data Analysis

<u> __Structure__ </u>: The data we have is structured around the audio features of each track. For each track we can obtain the following object:

```
{
  "duration_ms" : 255349,
  "key" : 5,
  "mode" : 0,
  "time_signature" : 4,
  "acousticness" : 0.514,
  "danceability" : 0.735,
  "energy" : 0.578,
  "instrumentalness" : 0.0902,
  "liveness" : 0.159,
  "loudness" : -11.840,
  "speechiness" : 0.0461,
  "valence" : 0.624,
  "tempo" : 98.002,
  "id" : "06AKEBrKUckW0KREUWRnvT",
  "uri" : "spotify:track:06AKEBrKUckW0KREUWRnvT",
  "track_href" : "https://api.spotify.com/v1/tracks/06AKEBrKUckW0KREUWRnvT",
  "analysis_url" : "https://api.spotify.com/v1/audio-analysis/06AKEBrKUckW0KREUWRnvT",
  "type" : "audio_features"
}
```

This data tells us the different features of a track that we can use to perform categorization later on. Given a playlist, we have written several API Access functions that will retrieve these audio features for all of the tracks on the playlist.

<u> __Scope__ </u>: Spotify has several thousands of playlists that are automatically generated and even more user created ones. We are limiting the scope of this project to just obtaining tracks for "Top Hits" for different countries and tracks for Spotify's "Yearly Hits" playlists. The yearly hits are global hits for each different year that we have data for.

<u> __Granularity__ </u>: The data we have available is on a per-track granularity. While we do have the data grouped by countries and years, the underlying csvs all contain the same information per track. We can obtain more data about the individual song such as it's [audio analysis](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/) but that is out of scope for now.

<u> __Temporality__</u>: The data is about each individual track, as the underlying spotify api learns, some of these values may change, but that is really only for newer songs. Older song features are not expected to change much. The only sense of "time" that we have is when a track was first introduced. There are no specific "timestamps" in the data

<u> __Faithfulness__</u>: According to PTDS, We describe a dataset as "faithful" if we believe it accurately captures reality. The data that we are looking at consists of a score from 0-1 of various musical features. None of these values are entered manually or have other dependencies, rather these are provided by Spotify from their internal heuristics and analysis tools, so we believe this data is as faithful as we can get.

<u> __Hypothesis__</u>: We started off comparing general trends of features across countries, but that proved very boring since we we were focusing on just two specific terms and it wasn't providing any meaningful insights, however when you look at how features correlate with each other, there were some obvious trends country-wide. More explanations are below each plot.

In [9]:
import visualization
import pandas as pd 

all_data = visualization.get_csv('topTracksYearsCSV/AllYearsTopTracks.csv')
df = pd.DataFrame(all_data)
df.head()

Unnamed: 0.1,Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,genre,id,instrumentalness,key,...,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence,year
0,0,0.191,https://api.spotify.com/v1/audio-analysis/2dpa...,0.687,214290,0.792,brostep,2dpaYNEQHiRxtZbfNsse99,0.0,5,...,-2.749,1,0.0452,100.015,4,https://api.spotify.com/v1/tracks/2dpaYNEQHiRx...,audio_features,spotify:track:2dpaYNEQHiRxtZbfNsse99,0.671,2018
1,1,0.153,https://api.spotify.com/v1/audio-analysis/4w8n...,0.841,212500,0.798,dance pop,4w8niZpiMy6qz1mntFA5uM,3e-06,1,...,-4.206,0,0.229,95.948,4,https://api.spotify.com/v1/tracks/4w8niZpiMy6q...,audio_features,spotify:track:4w8niZpiMy6qz1mntFA5uM,0.591,2018
2,2,0.354,https://api.spotify.com/v1/audio-analysis/7dt6...,0.68,231267,0.563,pop,7dt6x5M1jzdTEt8oCbisTK,0.0,10,...,-5.843,1,0.0454,145.028,4,https://api.spotify.com/v1/tracks/7dt6x5M1jzdT...,audio_features,spotify:track:7dt6x5M1jzdTEt8oCbisTK,0.374,2018
3,3,0.00513,https://api.spotify.com/v1/audio-analysis/2xLM...,0.834,312820,0.73,pop,2xLMifQCjDGFmkHkpNLD9h,0.0,8,...,-3.714,1,0.222,155.008,4,https://api.spotify.com/v1/tracks/2xLMifQCjDGF...,audio_features,spotify:track:2xLMifQCjDGFmkHkpNLD9h,0.446,2018
4,4,0.934,https://api.spotify.com/v1/audio-analysis/0u2P...,0.351,200186,0.296,electropop,0u2P5u6lvoDfwTYjAADbn4,0.0,4,...,-10.109,0,0.0333,115.284,4,https://api.spotify.com/v1/tracks/0u2P5u6lvoDf...,audio_features,spotify:track:0u2P5u6lvoDfwTYjAADbn4,0.12,2018


## Visualizations
_Note: all our visualization code is under the visualization.py file, but all of the below are linked to the corresponding pages in our Data Studio Project. This allows us more flexability in the type of plots we show and interactability. The project is located at:_ [Spotify Outliers Visualizations Datastudio](https://datastudio.google.com/u/0/reporting/1gd-SVGexyjQwbE6JHMxl_oK0pAM07sBZ/page/ZNEo). _There are 4 pages (navigation bar is under the title)._

## Energy and Danceability 

![energy_danceability_year_datastudio](images/energy_danceability_year_datastudio.png)

Our inital idea was that energy and danceability are linked and as songs are more danceable, they correspond to having a higher energy, this turned out to be mostly true however, it seemed that the average for one was always slightly lower than the average for the other. What was interesting was that between the 1920s and 2000s, energy was increasing as danceability decreased. More interesting was how distinct the music from the roaring 20's was in terms of these features than the others, this is better seen in our dot plot. 

For higher resolution and interactability, see this chart on [Google Data Studio](https://datastudio.google.com/u/0/reporting/1gd-SVGexyjQwbE6JHMxl_oK0pAM07sBZ/page/ZNEo). _NOTE: We used datastudio for easier visualizations, however the code to do this in pandas  by calling this function:_

In [None]:
import visualization
visualization.create_yearly_pointplot('danceability', 'energy')

## Acousticness VS Energy

![acousticness_energy_year_datastudio](images/acousticness_energy_year_datastudio.png)

When we plotted acousticness, right away we could see an interesting drop from the 1920s to the 1950s and more. This got us thinking about what happened during that time. In the early 1960's Rock and Roll was starting to take over popularity than the 1920's jazz and blues scene, and along with that came the rap era as well. When we plotted Energy on the same graph as well, the trend was even more clear. Acousticness isn't considered high energy, but in addition to that, the populations interests was shifting to those tracks that are more danceable and have a higher energy, which is seen as the acousticness of the years went down.

For higher resolution and interactability, see this chart on [Google Data Studio](https://datastudio.google.com/u/0/reporting/1gd-SVGexyjQwbE6JHMxl_oK0pAM07sBZ/page/A5xo). _NOTE: We used datastudio for easier visualizations, however the code to do this in pandas  by calling this function:_

In [None]:
import visualization
visualization.create_yearly_pointplot('acousticness', 'energy')

## Distinct Genres Per Year

![distinct_genres_year_datastudio](images/distinct_genres_year_datastudio.png)
_For interaction, view on_ [Google Data Studio](https://datastudio.google.com/u/0/reporting/1gd-SVGexyjQwbE6JHMxl_oK0pAM07sBZ/page/7Cyo).

We had originally thought we would see some sort of trends in the genres that were listend to over the years, however, we saw that the number of genres the population listened to was still quite large. In fact when we looked into this it came across that most of the genres were [Adult Standards](https://en.wikipedia.org/wiki/Adult_standards), which looks to be a generic genre that encompasses popular music aimed at the older audience. However, drilling down in the top genres per year, we see that Adult Standards is popular only in 1920-1960s. Rock starts climbing up in the 1960s and becomes the top genre up till the year 2000, when pop starts climbing up, various forms of pop music sprout up in the list of top. 

Also if you drill down on the actual distribution of the genres per year, you can see that some like 'contemporary country' have lots of shifting popularity. Country music only came on the data for 2000+ and has since had it's ups and downs. In addition we can see that the adult's standads genre has become less and less popular over the span of the years and dissapears from the list in 2010. For more interesting things visit our [Songs Per Genre interactive visualization on Google Data Studio](https://datastudio.google.com/u/0/reporting/1gd-SVGexyjQwbE6JHMxl_oK0pAM07sBZ/page/Mjyo)

## Machine Learning Analysis

For this portion of the project, we created two different classifers. One was for a year prediction and one was for a genre prediction.

In order to do the classification, we took the AllYearsTopTracks csv file, which contains the top tracks for the years that we analyzed, along with a column which contains the year the song was from. 80% of this data was randomly selected to be used as training data and the remaining 20% was used as original testing data. We then ran each of the different types of classifiers and compared to see which had the greatest accuracy, and selected that classifier for the 

The features that we used in analysis were: acousticness, danceability, energy, genre, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, valence, and year.

### Year Prediction

![year_accuracy](images/year_accuracy.png)

From this analysis, we found that SVC was the best classifer for the problem. However, the accuracy is still only at about 43%, which was reflected in the results.

In [3]:
import classification
import GetTopYearly

year = classification.predict_playlist_year("37i9dQZEVXcIEsENDsr0sr")
print("The predicted year for the Siham's Discover Playlist was: " + str(year))

The predicted year for the Siham's Discover Playlist was: 2000


We created a classfier to predict the year most of the songs in a playlist were popular in.
We chose a playlist of 2018 top tracks, and the predicted year was 2010. 

### Genre Prediction

![genre_accuracy](images/genre_accuracy.jpg)

We found that Random Forest was the best classifier for genres, with a low accuracy of 35%, but as seen below, it ended up being more accurate than we assumed it would be.

We ran the classifer on some songs to see what the prediction would be:

In [5]:
genre = classification.predict_user_song_genre("7FGq80cy8juXBCD2nrqdWU")

print("The predicted genre for Halsey's Eastside (which is a pop song) was: " + str(genre))

The predicted genre for Halsey's Eastside (which is a pop song) was: dance pop




We found that even though the accuracy scores for genres was lower, the genre predictions still fell in the same family of genre. With this, for example, the pop and dance pop are very similar.

## Results

From these ML analyses, we learned how each genre of song has varying levels of each of the features we looked at which makes them significant. 
We also learned that the years classifier wasn't as accurate because the songs we had per year wasn't as significant as we thought it would be per year. There are a couple years that are very significant, but there isnt a huge amount of variability in the 2000's like we thought there would be. 

Our ML prediction tools allowed a user to see what their playlists/songs are like as compared to other songs, to predict the year/genre they are.

We found serveral trends over the course of the project that were interesting as well, about the correlations of the different features that we looked at over the course of the project.