![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-science-and-artificial-intelligence&branch=main&subPath=analysis/06-data-analysis-spotify.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Data Analysis with Spotify Data

[Spotify](https://en.wikipedia.org/wiki/Spotify), an audio streaming platform, has a huge database of songs and information about them.

*Optional: If you'd like to create your own data sets, you can check out the [Getting Spotify Data notebook](06b-getting-spotify-data.ipynb).*

[![what is a Viz](https://img.youtube.com/vi/pGntmcy_HX8/0.jpg)](https://www.youtube.com/watch?v=pGntmcy_HX8)

`Run` the cell below to import a dataset of about 40,000 songs that has been [exported from Spotify](https://developer.spotify.com/documentation/web-api).

In [None]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/spotify.csv')
data

### Column Descriptions

From https://developer.spotify.com/documentation/web-api/reference/

|Value|Description|
|-|-|
|track|The name of the track.|
|artist|The person or group the track is credited to.|
|track_id|The [Spotify ID](https://developer.spotify.com/documentation/web-api/#spotify-uris-and-ids) of the track|
|danceability|How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.|
|energy|A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy.|
|key|The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D|
|loudness|The average loudness of a track in decibels (dB). Values typically ranges between -60 and 0 db.|
|mode|The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|speechiness|Indicates the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech while below 0.33 most likely represent music and other non-speech-like tracks.|
|acousticness|A confidence measure indicating whether the track is acoustic. Value of 1 represents highest confidence.|
|instrumentalness|Predicts whether a track contains no vocals. The closer the value is to 1, the greater likelihood the track contains no vocal content.|
|liveness|Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.|
|valence|A measure to describe the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).|
|tempo|The overall estimated tempo (speed or pace) of a track in beats per minute (BPM).|
|duration_ms|The duration of the track in [milliseconds](https://en.wikipedia.org/wiki/Millisecond).|
|time_signature|An estimated overall time signature of a track. The time signature is a notational convention to specify how many beats are in each bar (or measure).|
|chorus_hit|The approximate start time of the chorus, in seconds|
|sections|The number of sections in the track. Sections are defined by large variations in rhythm or timbre.|
|popularity|A value between 0 and 1, with 1 being the most popular.|
|release_date|The date the track was released.|

We can create some new columns from these, such as

* `duration_s`: duration in seconds
* `release_year`: just the year that track was released
* `link`: a link to the track on Spotify

We'll also convert the `release_date` column to only display the year. Scroll to the far right of the dataframe to see the `release_date` column.

In [None]:
data['duration_s'] = data['duration_ms']/1000
data['release_year'] = data['release_date'].str[:4].astype(int)
data['link'] = 'https://open.spotify.com/track/' + data['track_id']
data['release_date'] = pd.to_datetime(data['release_date'], format='%Y-%m-%d')
data

***
## Visualization 1 ##

Scatter plot of Song lengths over the years

In [None]:
import plotly.express as px
px.scatter(data, x='release_date', y='duration_s', hover_data=['artist', 'track', 'link'], title=' Visualization 1 of Spotify Song Duration Over Time')

***
## Visualization 2 ##


`energy`: *A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy*

**and** 

`danceability`: *How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable*


We'll also colour the points by `loudness` and set the visualiztion `height` to `800` so it's a little larger.

In [None]:
px.scatter(data, x='energy', y='danceability', color='loudness', hover_data=['artist', 'track'], title='Danceability versus Energy', height=800)

***
## Visualization 3: Billions Club Playlist ##

If you would instead like to use a date set containing only the songs that have been played on Spotify more than a billion times, we have data from the [Billions Club Playlist](https://open.spotify.com/playlist/37i9dQZF1DX7iB3RCnBnN4).

In [None]:
bc = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/spotify-billions-club.csv')
bc

Then you can create columns and visualizations using the `bc` dataframe instead of using the `data` one.

In [None]:
bc['duration_s'] = bc['duration_ms']/1000
bc['release_year'] = bc['release_date'].str[:4].astype(int)
bc = bc.rename(columns={'track_href': 'link'})
bc['release_date'] = pd.to_datetime(bc['release_date'], format='%Y-%m-%d')

px.scatter(bc, x='energy', y='danceability', color='loudness', hover_data=['artist', 'track'], title='Danceability versus Energy for Billions Club Songs')

---

<span style="color:#663399">Your **assignment** is to create at least three Spotify visualizations retrieving data from different columns, and for each visualization write:</span>
* <span style="color:#663399">We created this visualization because</span>
* <span style="color:#663399">This visualization shows</span>
* <span style="color:#663399">Something interesting we learned from or noticed in this visualization is</span>

_Hint: Use the above vizualizations codes or from previous notebooks to create your own visualizations, then change the coluums names based on the Spotify data set._

---

The [next notebook](07-primary-data.ipynb) will introduce you to recording and using your own (primary) data.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)