![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-science-and-artificial-intelligence&branch=main&subPath=06-data-analysis-spotify.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Data Analysis with Spotify Data

[Spotify](https://en.wikipedia.org/wiki/Spotify), an audio streaming platform, has a huge database of songs and information about them.

Run the cell below to import a dataset of about 40,000 songs that has been [exported from Spotify](https://developer.spotify.com/documentation/web-api).

In [None]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-science-and-artificial-intelligence/spotify.csv')
data

### Column Descriptions

From https://developer.spotify.com/documentation/web-api/reference/

|Value|Description|
|-|-|
|track|The name of the track.|
|artist|The person or group the track is credited to.|
|track_id|The [Spotify ID](https://developer.spotify.com/documentation/web-api/#spotify-uris-and-ids) of the track|
|danceability|How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.|
|energy|A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy.|
|key|The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D|
|loudness|The average loudness of a track in decibels (dB). Values typically ranges between -60 and 0 db.|
|mode|The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|speechiness|Indicates the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech while below 0.33 most likely represent music and other non-speech-like tracks.|
|acousticness|A confidence measure indicating whether the track is acoustic. Value of 1 represents highest confidence.|
|instrumentalness|Predicts whether a track contains no vocals. The closer the value is to 1, the greater likelihood the track contains no vocal content.|
|liveness|Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.|
|valence|A measure to describe the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).|
|tempo|The overall estimated tempo (speed or pace) of a track in beats per minute (BPM).|
|duration_ms|The duration of the track in [milliseconds](https://en.wikipedia.org/wiki/Millisecond).|
|time_signature|An estimated overall time signature of a track. The time signature is a notational convention to specify how many beats are in each bar (or measure).|
|chorus_hit|The approximate start time of the chorus, in seconds|
|sections|The number of sections in the track. Sections are defined by large variations in rhythm or timbre.|
|popularity|A value between 0 and 1, with 1 being the most popular.|
|release_date|The date the track was released.|

We can create some new columns from these, such as

* `duration_s`: duration in seconds
* `release_year`: just the year that track was released
* `link`: a link to the track on Spotify

We'll also convert the `release_date` column to a date instead of a string.

In [None]:
data['duration_s'] = data['duration_ms']/1000
data['release_year'] = data['release_date'].str[:4].astype(int)
data['link'] = 'https://open.spotify.com/track/' + data['track_id']
data['release_date'] = pd.to_datetime(data['release_date'], format='%Y-%m-%d')
data

Now let's visualize the song lengths over the years.

In [None]:
import plotly.express as px
px.scatter(data, x='release_date', y='duration_s', hover_data=['artist', 'track', 'link'], title='Song Duration Over Time')

We can also see if there is a relationship between `energy` and `danceability`. We'll also colour the points by `loudness` and set the visualiztion `height` to `800` so it's a little larger.

In [None]:
px.scatter(data, x='energy', y='danceability', color='loudness', hover_data=['artist', 'track', 'link'], title='Danceability versus Energy', height=800)

---

<span style="color:#663399">Your **assignment** is to create at least three visualizations using Spotify data, and for each visualization write:</span>
* <span style="color:#663399">We created this visualization because</span>
* <span style="color:#663399">This visualization shows</span>
* <span style="color:#663399">Something interesting we learned from or noticed in this visualization is</span>

---

The [next notebook](07-data-logging.ipynb) will introduce you to recording and using your own (primary) data.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)

#### Getting More Data Using the Spotify API (Optional and Advanced)

An [application programming interface](https://en.wikipedia.org/wiki/API) is a set of commands to access data from another system. The [Spotify Web API](https://developer.spotify.com/documentation/web-api/) allows us to get information about songs, albums, and artists.

If you want to retireve more data and have a [Spotify account](https://www.spotify.com/us/signup), you can sign in to the [Developers Dashboard](https://developer.spotify.com/dashboard/login). From the Dashboard, you can click the `CREATE AN APP` button, type a name and description, and then click `CREATE`. Clicking on your new app in the Dashboard will show you the `Client ID` and `CLIENT SECRET` that you can paste into the code cell below.

In [None]:
CLIENT_ID = 'PASTE_YOUR_CLIENT_ID_HERE'
CLIENT_SECRET = 'PASTE_YOUR_CLIENT_SECRET_HERE'

if CLIENT_ID != 'PASTE_YOUR_CLIENT_ID_HERE': # make sure you've pasted in your client ID and client secret
    import requests
    try:
        auth_response = requests.post('https://accounts.spotify.com/api/token', {'grant_type':'client_credentials', 'client_id':CLIENT_ID, 'client_secret':CLIENT_SECRET})
        auth_response_data = auth_response.json()
        access_token = auth_response_data['access_token']
        headers = {'Authorization':'Bearer {token}'.format(token=access_token)}
    except:
        print('Remember to paste your client ID and secret into the code cell')

    def get_track_info(track_id):
        try:
            r = requests.get('https://api.spotify.com/v1/tracks/' + track_id, headers=headers)
            info = r.json()
        except:
            print('Error with track id:', track_id)
            info = None
        return info

    def get_track_features(track_id):
        try:
            r = requests.get('https://api.spotify.com/v1/audio-features/' + track_id, headers=headers)
            info = r.json()
        except:
            print('Error with track id:', track_id)
            info = None
        return info

    print('Spotify API setup complete')

Then you can get information and audio features for the songs in a playlist using the `playlist_id`.

For example, the Billions Club playlist url is https://open.spotify.com/playlist/37i9dQZF1DX7iB3RCnBnN4 so the `playlist_id` is `37i9dQZF1DX7iB3RCnBnN4`

In [None]:
playlist_id = '37i9dQZF1DX7iB3RCnBnN4'

if CLIENT_ID != 'PASTE_YOUR_CLIENT_ID_HERE':
    tracks = []
    for x in range(4):  # it only returns 100 tracks at a time
        offset = x*100
        r = requests.get('https://api.spotify.com/v1/playlists/' + playlist_id + '/tracks?offset=' + str(offset), headers=headers)
        for item in r.json()['items']:
            tracks.append([item['track']['artists'][0]['name'], item['track']['name'], item['track']['id']])
    pl = pd.DataFrame(tracks, columns=['artist', 'track', 'id'])

    track_features = {}
    for row in pl.itertuples():
        print(row[1], row[2]) # artist and track
        id = row[3]
        features = get_track_features(id)
        track_features[id] = features
    tf = pd.DataFrame(track_features).T
    
    from IPython.display import clear_output
    clear_output()

    playlist = pd.merge(pl, tf, on='id') # merge the dataframes
    playlist