Setup up imports

In [7]:
import os
import pandas as pd
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import re
import matplotlib.pyplot as plt
import requests

# load client credentials using .env file
# SPOTIPY_CLIENT_ID=YOUR_CLIENT_ID
# SPOTIPY_CLIENT_SECRET=YOUR_CLIENT_SECRET
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


### Lets get started by preprocessing our data
First we need to load the data from the JSON files given

In [8]:
DATA_DIR = 'spotify_million_playlist_dataset/data/'

playlists = []
for file in sorted(os.scandir(DATA_DIR), key=lambda e: e.name):
    print("processing slice: " + str(file.name))
    data = json.load(open(file.path))
    playlists.append(pd.DataFrame(data['playlists']))
    break

processing slice: mpd.slice.0-999.json


Let's clean up our array of playlists a little bit by combining them into a pandas DataFrame

In [9]:
playlists_frame = pd.concat(playlists)
print(playlists_frame.head())

               name collaborative  pid  modified_at  num_tracks  num_albums  \
0        Throwbacks         false    0   1493424000          52          47   
1  Awesome Playlist         false    1   1506556800          39          23   
2           korean          false    2   1505692800          64          51   
3               mat         false    3   1501027200         126         107   
4               90s         false    4   1401667200          17          16   

   num_followers                                             tracks  \
0              1  [{'pos': 0, 'artist_name': 'Missy Elliott', 't...   
1              1  [{'pos': 0, 'artist_name': 'Survivor', 'track_...   
2              1  [{'pos': 0, 'artist_name': 'Hoody', 'track_uri...   
3              1  [{'pos': 0, 'artist_name': 'Camille Saint-Saën...   
4              2  [{'pos': 0, 'artist_name': 'The Smashing Pumpk...   

   num_edits  duration_ms  num_artists description  
0          6     11532414           37       

### Now we can begin our analysis (sort of):

We'll start by querying the Spotify API with the Spotipy package to get some more details
on the songs that are contained within each playlist. Spotify calculates various data for each song
such as the time signature, tempo, timbre, etc.

In [10]:
#authorize our API session with credentials stored in the environment variables

auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

Spotify provides us with two different types of song data. One is called the [audio features](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/),
and the other is the [audio analysis](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis).
Essentially, the features are Spotify's interpretation of the audio analysis, it has higher-level attributes
like the 'danceability' and 'liveness' of a song. The audio analysis is every single piece of data
that Spotify was able to calculate from the songs sound signature.

Audio features sounds like it might be a little easier to handle, so we'll define a function to query
that data first.

In [None]:
def processSongFeatures(playlist, sp):
    songs = playlist['tracks']

    # get features for all songs
    song_features = []
    song_ids = []
    for song in songs:
        # get song id
        song_id = re.sub('spotify:track:', '', song['track_uri'])
        song_ids.append(song_id)
        print('processing: ' + song['track_name'])
        features = sp.audio_features(song_id)[0]
        song_features.append(features)

    # convert features into dataframe by song id
    features_by_id = pd.DataFrame(song_features, index=song_ids)
    features_by_id.index.name = 'song_id'
    print(features_by_id.head())

    # export data
    """export_dir = playlist['name'] + '-' + playlist['id']
    if not os.path.isdir(export_dir):
        os.mkdir(export_dir)

    features_by_id.to_csv(export_dir + '/features.csv')"""

Next we can work on handling the audio analysis data, this is a little bit more complicated. We'll start
by extracting the list of songs in each playlist from our dataframe of playlists. Using Spotify,
we query the API for the audio analysis for each song and create a map entry for it so we can associate
each song with its analysis.

In [12]:
# select just the first playlist (for testing purposes)
playlist = playlists_frame.iloc[0]
print(playlist)

MAX_RETRIES = 5
# get the songs from the first playlist
songs = playlist['tracks']

analyses = {}
# collect analyses of songs, dict of (song id : analysis)
for i, song in enumerate(songs):
    song_id = re.sub('spotify:track:', '', song['track_uri'])
    # sometimes the api request times out, we'll skip the song if it exceeds the max retries
    for request_attempt in range(MAX_RETRIES):
        print('analyzing: ' + song['track_name'])
        try:
            a = sp.audio_analysis(track_id=song_id)
        except requests.exceptions.ReadTimeout as rto:
            print('request to Spotify timed out for: ' + song['track_name'])
        else:
            break
    else:
        continue
    # remove some useless crap
    a.pop('meta')
    analyses[song_id] = a

name                                                    Throwbacks
collaborative                                                false
pid                                                              0
modified_at                                             1493424000
num_tracks                                                      52
num_albums                                                      47
num_followers                                                    1
tracks           [{'pos': 0, 'artist_name': 'Missy Elliott', 't...
num_edits                                                        6
duration_ms                                               11532414
num_artists                                                     37
description                                                    NaN
Name: 0, dtype: object
analyzing: Lose Control (feat. Ciara & Fat Man Scoop)
analyzing: Toxic
analyzing: Crazy In Love
analyzing: Rock Your Body
analyzing: It Wasn't Me
analyzing: Yeah!
analyzing:

Before moving on, lets check out what an analysis looks like so we can decide how we want to
structure our data.

In [20]:
# remove an (id, analysis) pair from the dictionary
analysis_pair = analyses.popitem()

# get the JSON response
json = analysis_pair[1]
# because the analysis is a JSON response (dictionary), we can access it like this
[print(category, value) for category, value in json.items()]

# don't forget to add the pair back
analyses[analysis_pair[0]] = analysis_pair[1]

track {'num_samples': 4593750, 'duration': 208.33333, 'sample_md5': '', 'offset_seconds': 0, 'window_seconds': 0, 'analysis_sample_rate': 22050, 'analysis_channels': 1, 'end_of_fade_in': 0.18009, 'start_of_fade_out': 202.34448, 'loudness': -3.533, 'tempo': 106.004, 'tempo_confidence': 0.779, 'time_signature': 4, 'time_signature_confidence': 1.0, 'key': 8, 'key_confidence': 0.387, 'mode': 1, 'mode_confidence': 0.535, 'codestring': 'eJxVmwuWIyuMRLfiJQDJd_8b63sDl-2eOXPmQaUzQUiKUKB-xmi1zFJf5fWMc3af51Vnea2x6xiztFfv-1Vr3fv03l7rmY7a2E97XrW07qOtnXJ4rG1eM5_TCo_WwbOtPm2e-qp7MCjntKcz2furn7KeMet-tf6017PqOs_hj3Pv1-x1zb5bebWz-GJxPX2NF297f3GMslnzYE0svJy-GO7O0lspc7RnvHo5fHRV_nvxX50_v-piL21Nhqt0H3FztbAet_2c2dn28xrNfa86e6lsbbiZ85Q2-mjnxQJY6Wl8imcnBmGHe63W6mDxpTg8mKOt13RDd1ixz8SObAhTzIdZlvto1ufZhU-s0XyEt_bJb9cu_vapbZY9XuyCIWt8chjb_Y3GZuvAjrsNTVXaGW0_67XdL-_jkAb73VPjjDPX3GzV9bx47hQO8rQ674nMh6fOwDR7c55PWf3FEfhLTuZMlscn2Eebi8cf7I8XsEre3Nvxu7U8GouvPqtslzSx1qiHMzraApu390oWu8J6GOxvKbV2rVLxjKK9al0a7Gm7PYWXtYK9e

You know that's kind of messy, here's the [API reference](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/)
for the structure of an audio analysis.

TL;DR, the response contains categories like `track` (special one),
`bars`, `beats`, etc. `Track` is special because it contains more general data such as the `duration`
and `time_signature`, these headers aren't in the other categories. The categories `bars`, `beats`,
and `tatums`, are all composed of [time interval objects](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/#time-interval-object)
which are just dictionaries with the start, end and how confident Spotify is that the data is right.

This brings us to `sections` and `segments`. `sections` is composed of [section objects](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/#section-object)
which are basically the same as the other categories, just with more dictionary keys-value pairs.
`segments` is where it gets more interesting (and where I got stuck). Again, `segments` is composed
of another object (aptly named) [segment objects](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/#section-object).
This is another dictionary with a caveat, some of the values are actually arrays, and not just an integer
as before. At first, I thought that this nesting would be a problem, but I found that pandas takes care
of this on its own, parsing the array as a pandas series.


#### Moving on...
Now we have a dictionary mapping a song to its analysis JSON response. Let's loop through everything
and parse it into a pandas Dataframe using a handy function called `json_normalize`. We're also gonna
save the song ids for later to make creating our final dataframe a little easier.

The catch here is that we're gonna split everything up by each analysis category, and then by song id.

In [1]:
ids = []
category_frames = {}
# iterate through all songs and their analyses
for song_id, a in analyses.items():
    print('processing song: ' + song_id)
    ids.append(song_id)
    # collect the analysis data for each category in the songs' analysis
    for category, vals in a.items():
        print('processing category: ' + category)
        norm = pd.json_normalize(vals)
        # val_list is a list of dataframes
        val_list = category_frames.get(category, [])
        # add the new analysis data for the category
        val_list.append(norm)
        category_frames[category] = val_list

print(category_frames)

NameError: name 'analyses' is not defined

Finally, we're getting a little closer to our final, 2 leveled multi-index pandas Dataframe.
We're going to use our `ids` array we created earlier to concatenate dataframes by their song ids.
This is the second, inner-level multi-index.

In [None]:
category_tables = {}
# concatenate the list of frames in each category to make tables based on song id
for cat_name, frame_list in category_frames.items():
    category_tables[cat_name] = pd.concat(frame_list, keys=ids)

Then we can concatenate again based off of the category names to create our outermost multi-index.

In [None]:
# combine the dictionary of categories and their lists of songs
category_tables_frame = pd.concat(category_tables.values(), keys=category_tables.keys())

Our final pandas dataframe has a structure like this:

| level=0   |     level=1           |     start    | ... | timbre            |
| :-: | :-: | :-: | :-: | --- |
| segments  | 3ELm3eyRhR4tF1ncqzMQEV|   252.15601  | ... | [pandas.Series]   |
|    -       |           -            |   258.15601  | ... | [pandas.Series]   |
|     -      |           -            |   259.15601  | ... | [pandas.Series]   |
|      -     |           -            |   261.15601  | ... | [pandas.Series]   |
|        -   | 2pbxqEYiXJTvFsybGGgSAi|   237.02356  | ... | [pandas.Series]   |
|       -    |           -            |   248.15601  | ... | [pandas.Series]   |
|      -     |           -            |   249.15601  | ... | [pandas.Series]   |
|      -     |           -            |   251.15601  | ... | [pandas.Series]   |
| tatums  | 3ELm3eyRhR4tF1ncqzMQEV|   252.15601  | ... | NaN   |
|    -       |           -            |   258.15601  | ... | NaN   |
|     -      |           -            |   259.15601  | ... | NaN   |
|      -     |           -            |   261.15601  | ... | NaN   |
|        -   | 2pbxqEYiXJTvFsybGGgSAi|   237.02356  | ... | NaN   |
|       -    |           -            |   248.15601  | ... | NaN  |
|      -     |           -            |   249.15601  | ... | NaN   |
|      -     |           -            |   251.15601  | ... | NaN   |

We can query the table by using `pandas.groupby()`