Setup up imports

In [6]:
import os
import pandas as pd
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import re
import matplotlib.pyplot as plt
import requests

%load_ext dotenv
%dotenv

### Lets get started by preprocessing our data
First we need to load the data from the JSON files given

In [3]:
DATA_DIR = 'spotify_million_playlist_dataset/data/'

playlists = []
for file in sorted(os.scandir(DATA_DIR), key=lambda e: e.name):
    print("processing slice: " + str(file.name))
    data = json.load(open(file.path))
    playlists.append(pd.DataFrame(data['playlists']))
    break

processing slice: mpd.slice.0-999.json


Let's clean up our array of playlists a little bit by combining them into a pandas DataFrame

In [4]:
playlists_frame = pd.concat(playlists)
print(playlists_frame.head())

               name collaborative  pid  modified_at  num_tracks  num_albums  \
0        Throwbacks         false    0   1493424000          52          47   
1  Awesome Playlist         false    1   1506556800          39          23   
2           korean          false    2   1505692800          64          51   
3               mat         false    3   1501027200         126         107   
4               90s         false    4   1401667200          17          16   

   num_followers                                             tracks  \
0              1  [{'pos': 0, 'artist_name': 'Missy Elliott', 't...   
1              1  [{'pos': 0, 'artist_name': 'Survivor', 'track_...   
2              1  [{'pos': 0, 'artist_name': 'Hoody', 'track_uri...   
3              1  [{'pos': 0, 'artist_name': 'Camille Saint-Saën...   
4              2  [{'pos': 0, 'artist_name': 'The Smashing Pumpk...   

   num_edits  duration_ms  num_artists description  
0          6     11532414           37       

### Now we can begin our analysis (sort of):

We'll start by querying the Spotify API with the Spotipy package to get some more details
on the songs that are contained within each playlist. Spotify calculates various data for each song
such as the time signature, tempo, timbre, etc.

In [None]:
#authorize our API session with credentials stored in the environment variables

auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

Spotify provides us with two different types of song data. One is called the [audio features](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/),
and the other is the [audio analysis](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis).
Essentially, the features are Spotify's interpretation of the audio analysis, it has higher-level attributes
like the 'danceability' and 'liveness' of a song. The audio analysis is every single piece of data
that Spotify was able to calculate from the songs sound signature.

Audio features sounds like it might be a little easier to handle, so we'll define a function to query
that data first.

In [None]:
def processSongFeatures(playlist, sp):
    songs = playlist['tracks']

    # get features for all songs
    song_features = []
    song_ids = []
    for song in songs:
        # get song id
        song_id = re.sub('spotify:track:', '', song['track_uri'])
        song_ids.append(song_id)
        print('processing: ' + song['track_name'])
        features = sp.audio_features(song_id)[0]
        song_features.append(features)

    # convert features into dataframe by song id
    features_by_id = pd.DataFrame(song_features, index=song_ids)
    features_by_id.index.name = 'song_id'
    print(features_by_id.head())

    # export data
    """export_dir = playlist['name'] + '-' + playlist['id']
    if not os.path.isdir(export_dir):
        os.mkdir(export_dir)

    features_by_id.to_csv(export_dir + '/features.csv')"""
