In [None]:
#Run this cell to install the necessary dependencies
import pandas as pd
import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Project 2: Spotify

## The Data Science Life Cycle - Table of Contents

<a href='#section 0'>Background Knowledge</a>

<a href='#subsection 1a'>Formulating a question or problem</a> 

<a href='#subsection 1b'>Acquiring and preparing data</a>

<a href='#subsection 1c'>Conducting exploratory data analysis</a>

<a href='#subsection 1d'>Using prediction and inference to draw conclusions</a>
<br><br>

### Background Knowledge <a id='section 0'></a>


If you listen to music, chances are you use Spotify, Apple Music, or another similar streaming service. This new era of the music industry curates playlists, recommends new artists, and is based on the number of streams more than the number of albums sold. The way these streaming services do this is (you guessed it) data!

Spotify, like many other companies, hire many full-time data scientists to analyze all the incoming user data and use it to make predictions and recommendations for users. If you're interested, feel free to check out [Spotify's Engineering Page](https://engineering.atspotify.com/) for more information!

<img src="images/spotify.png" width = 700/>

<center><a href=https://hrblog.spotify.com/2018/02/08/amping-up-diversity-inclusion-at-spotify/>Image Reference</a></center> 

# The Data Science Life Cycle <a id='section 1'></a>

## Formulating a Question or Problem <a id='subsection 1a'></a>
It is important to ask questions that will be informative and can be answered using the data. There are many different questions we could ask about music data. For example, there are many artists who want to find out how to get their music on Spotify's Discover Weekly playlist in order to gain exposure. Similarly, users love to see their *Spotify Wrapped* listening reports at the end of each year.

<div class="alert alert-warning">
<b>Question:</b> Recall the questions you developed with your group on Tuesday. Write down that question below, and try to add on to it with the context from the articles from Wednesday. Think about what data you would need to answer your question. You can review the articles on the bCourses page under Module 4.3.
   </div>

**Original Question(s):** *here*


**Updated Question(s):** *here*


**Data you would need:** *here*


## Acquiring and Cleaning Data <a id='subsection 1b'></a>

We'll be looking at song data from Spotify. You can find the raw data [here](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21). We've cleaned up the datasets a bit, and we will be investigating the popularity and the qualities of songs from this dataset.

The following table, `spotify`, contains a list of tracks identified by their unique song ID along with attributes about that track.

Here are the descriptions of the columns for your reference. (We will not be using all of these fields):

|Variable Name   | Description |
|--------------|------------|
|`track_id` | 	Song unique ID |
|`track_name` | Song Name |
|`track_artist	`| Song Artist |
|`track_popularity` | Song Popularity (0-100) where higher is better |
|`track_album_id`| Album unique ID |
|`track_album_name` | Song album name |
|`track_album_release_date`| Date when album released |
|`playlist_name`| Name of playlist |
|`playlist_id`| Playlist ID |
|`playlist_genre`| Playlist genre |
|`playlist_subgenre	`|  Playlist subgenre |
|`danceability`| Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
|`energy`| Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
|`key`| The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
|`loudness`|  The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
|`mode`|  Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
|`speechiness`|  Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
|`acousticness`|  A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
|`instrumentalness`| Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
|`liveness`| Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
|`valence`| A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
|`tempo`| The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
|`duration_ms`| Duration of song in milliseconds |
|`creation_year`| Year when album was released |




In [None]:
spotify = Table.read_table('data/spotify.csv')
spotify.show(10)

<div class="alert alert-info">
<b>Question:</b> It's important to evalute our data source. What do you know about the source? What motivations do they have for collecting this data? What data is missing?
   </div>

*Insert answer here*

<div class="alert alert-info">
<b>Question:</b> Do you see any missing (nan) values? Why might they be there?
   </div>

*Insert answer here*

<div class="alert alert-info">
<b>Question:</b> We want to learn more about the dataset. First, how many total rows are in this table? What does each row represent?
    
   </div>

In [None]:
total_rows = ...
total_rows

*Insert answer here*

## Conducting Exploratory Data Analysis <a id='subsection 1c'></a>

Visualizations help us to understand what the dataset is telling us. We will be using bar charts, scatter plots, and line plots to try to answer questions like the following:
> What audio features make a song popular and which artists have these songs? How have features changed over time?

### Part 1: We'll start by looking at the length of songs using the `duration_ms` column.

Right now, the `duration` array contains the length of each song in milliseconds. However, that's not a common measurement when describing the length of a song - often, we use minutes and seconds. Using array arithmetic, we can find the length of each song in seconds and in minutes. There are 1000 milliseconds in a second, and 60 seconds in a minute. First, we will convert milliseconds to seconds.


In [None]:
#Access the duration column as an array.
duration = ...
duration

In [None]:
#Divide the milliseconds by 1000
duration_seconds = ...
duration_seconds

In [None]:
#Now convert duration_seconds to minutes.
duration_minutes = ...
duration_minutes 

<div class="alert alert-info">
<b>Question:</b> How would we find the average duration (in minutes) of the songs in this dataset?
   </div>

In [None]:
avg_song_length_mins = ...
avg_song_length_mins

Now, we can add in the duration for each song (in minutes) by adding a column to our `spotify` table called `duration_min`. Run the following cell to do so.

In [None]:
#This cell will add the duration in minutes column we just created to our dataset.
spotify = spotify.with_columns('duration_min', duration_minutes)
spotify

### Artist Comparison

Let's see if we can find any meaningful difference in the average length of song for different artists.

<div class="alert alert-success">
    <b>Note: </b>Now that we have the average duration for each song, you can compare average song length between two artists. Below is an example!
   </div>

In [None]:
sam_smith = spotify.where("track_artist", are.equal_to("Sam Smith"))
sam_smith

In [None]:
sam_smith_mean = ...
sam_smith_mean

In [None]:
#In this cell, choose an artist you want to look at.
artist_name = ...
artist_name

In [None]:
#In this cell, choose another artist you want to compare it to.
artist_name_2 = ...
artist_name_2

This exercise was just one example of how you can play around with data and answer questions.

### Top Genres and Artists
In this section, we are interested in the categorical information in our dataset, such as the playlist each song comes from or the genre. There are almost 33,000 songs in our dataset, so let's do some investigating. What are the most popular genres? We can figure this out by grouping by the playlist genre.

<div class="alert alert-info">
<b>Question:</b> How can we group our data by unique genres?
   </div>

In [None]:
genre_counts = spotify.group('...')
genre_counts

<div class="alert alert-info">
<b>Question:</b> In our dataset, it looks like the most popular genre is EDM. Make a barchart below to show how the other genres compare.
   </div>

In [None]:
genre_counts.barh('...', '...')

Notice that it was difficult to analyze the above bar chart because the data wasn't sorted first. Let's sort our data and make a new bar chart so that it is much easier to make comparisons.

In [None]:
genre_counts_sorted = genre_counts.sort('...', descending = ...)
genre_counts_sorted

In [None]:
genre_counts_sorted.barh('...', '...')

<div class="alert alert-info">
<b>Question:</b> Was this what you expected? Which genre did you think would be the most popular?
   </div>

*Insert answer here.*

<div class="alert alert-info">
<b>Question:</b> Let's take a look at all the artists in the dataset. We can take a look at the top 25 artists based on the number of songs they have in our dataset. We'll follow a similar method as we did when grouping by genre above. First, we will group our data by artist and sort by count.
   </div>

In [None]:
#Here, we will group and sort in the same line.

artists_grouped = spotify.group('...').sort('...', ...)
artists_grouped

In [None]:
top_artists = artists_grouped.take(np.arange(0, 25))
top_artists

In [None]:
top_artists.barh('track_artist', '...')

<div class="alert alert-info">
<b>Question:</b> What do you notice about the top 25 artists in our dataset?
   </div>

*insert answer here*

### Playlist Popularity

In our dataset, each song is listed as belonging to a particular playlist, and each song is given a "popularity score", called the `track_popularity`. Using the `track_popularity`, we can calculate an *aggregate popularity* for each playlist, which is just the sum of all the popularity scores for the songs on the playlist.

In order to create this aggregate popularity score, we need to group our data by playlist, and sum all of the popularity scores. First, we will create a subset of our `spotify` table using the `select` method. This lets us create a table with only the relevant columns we want. In this case, we only care about the name of the playlist and the popularity of each track. Keep in mind that each row still represents one track, even though we no longer have the track title in our table.

In [None]:
spotify_subset = spotify.select(['playlist_name', 'track_popularity'])
spotify_subset

<div class="alert alert-success">
<b>Note:</b> By grouping, we can get the number of songs from each playlist.
   </div>

In [None]:
playlists = spotify_subset.group('playlist_name')
playlists

<div class="alert alert-info">
    <b>Question:</b> We can use the <code>group</code> method again, this time passing in a second argument <code>collect</code>, which says that we want to take the sum rather than the count when grouping. This results in a table with the total aggregate popularity of each playlist.
   </div>

In [None]:
#Run this cell.
total_playlist_popularity = spotify_subset.group('playlist_name', collect = sum)
total_playlist_popularity

Similar to when we found duration in minutes, we can once again use the `column` method to access just the `track_popularity sum` column, and add it to our playlists table using the `with_column` method.

In [None]:
agg_popularity = total_playlist_popularity.column('track_popularity sum')
playlists = playlists.with_column('aggregate_popularity', agg_popularity)
playlists

<div class="alert alert-info">
<b>Question:</b> Do you think that the most popular playlist would be the one with the highest aggregate_popularity score, or the one with the highest number of songs? We can sort our playlists table and compare the outputs.

In [None]:
playlists.sort('...', descending=True)

<div class="alert alert-info">
<b>Question:</b> Now sort by aggregate popularity.
   </div>

In [None]:
playlists.sort('...', descending=True)

Comparing these two outputs shows us that the "most popular playlist" depends on how we judge popularity. If we have a playlist that has only a few songs, but each of those songs are really popular, should that playlist be higher on the popularity rankings? By way of calculation, playlists with more songs will have a higher aggregate popularity, since more popularity values are being added together. We want a metric that will let us judge the actual quality and popularity of a playlist, not just how many songs it has.

In order to take into account the number of songs on each playlist, we can calculate the "average popularity" of each song on the playlist, or the proportion of aggregate popularity that each song takes up. We can do this by dividing `aggregate_popularity` by `count`. Remember, since the columns are just arrays, we can use array arithmetic to calculate these values.

In [None]:
#Run this cell to get the average.
avg_popularity = playlists.column('aggregate_popularity') / playlists.column('count')

In [None]:
#Now add it to the playlists table.
playlists = playlists.with_column('average_popularity', avg_popularity)
playlists

Let's see if our "most popular playlist" changes when we judge popularity by the average popularity of the songs on a playlist.

In [None]:
playlists.sort('average_popularity', descending=True)

Looking at the table above, we notice that 8/10 of the top 10 most popular playlists by the `average_popularity` metric are playlists with less than 100 songs. Just because a playlist has a lot of songs, or a high aggregate popularity, doesn't mean that the average popularity of a song on that playlist is high. Our new metric of `average_popularity` lets us rank playlists where the size of a playlist has no effect on it's overall score. We can visualize the top 25 playlists by average popularity in a bar chart.

In [None]:
top_25_playlists = playlists.sort('average_popularity', descending=True).take(np.arange(25))
top_25_playlists.barh('...', '...')

Creating a new metric like `average_popularity` helps us more accurately and fairly measure the popularity of a playlist. 

We saw before when looking at the top 25 artists that they were all male. Now looking at the top playlists, we see that the current landscape of popular playlists and music may have an effect on the artists that are popular. For example, the RapCaviar is the second most popular playlist, and generally there tends to be fewer female rap artists than male. This shows that the current landscape of popular music can affect the types of artists topping the charts.

## Using prediction and inference to draw conclusions <a id='subsection 1a'></a>

Now that we have some experience making these visualizations, let's go back to the visualizations others are working on to analyze Spotify data using more complex techniques.

[Streaming Dashboard](https://public.tableau.com/profile/vaibhavi.gaekwad#!/vizhome/Spotify_15858686831320/Dashboard1)

[Audio Analysis Visualizer](https://developer.spotify.com/community/showcase/spotify-audio-analysis/)

Music and culture are very intertwined so it's interesting to look at when songs are released and what is popular during that time. In this last exercise, you will be looking at the popularity of artists and tracks based on the dates you choose.

Let's look back at the first five rows of our `spotify` table once more.

In [None]:
spotify.show(5)

<div class="alert alert-info">
    <b>Question:</b> Fill in the following cell the data according to the <code>creation_year</code> you choose.
   </div>

In [None]:
#Fill in the year as an integer.
by_year = spotify.where("creation_year", are.equal_to(...))
by_year

Based on the dataset you have now, use previous techniques to find the most popular song during that year. First group by what you want to look at, for example, artist/playlist/track.

In [None]:
your_grouped = by_year.group("...")
pop_track = your_grouped.sort("...", descending = True)
pop_track

In [None]:
pop_track.take(np.arange(25)).barh("...", "count")

<div class="alert alert-info">
<b>Question:</b> Finally, use this cell if you want to look at the popularity of a track released on a specific date. It's very similar to the process above.
   </div>

In [None]:
by_date = spotify.where("track_album_release_date", are.equal_to("..."))
your_grouped = by_date.group("...")
pop_track = your_grouped.sort("count", descending = True)
pop_track.take(np.arange(10)).barh("track_artist", "count")

<div class="alert alert-info">
<b>Question:</b> Tell us something interesting about this data.
   </div>

*Insert answer here.*

Notebook Authors: Alleanna Clark, William Furtado