# Spotify Playlist - Data Preprocessing

In this notebook, data is being preprocessed for ML method application. A challenge dataset containing 10000 Spotify playlists is used and cluttered playlists will be extracted from it, and then features will be retrieved for each track within a playlist. These features will be retrieved from the Spotify API based on track IDs and after retrieval, playlists will be saved to a json file, which will be used in a different notebook.

*Note: A potentially cluttered playlists in this case is defined as any playlist with 50 or more tracks*

The first step is importing the necessary libraries and establishing credentials to connect to the Spotify API

In [1]:
import pandas as pd
import numpy as np 
import json
import spotipy
from spotipy import SpotifyClientCredentials, util
import time

client_id='your-client-id'
client_secret='your-client-secret'

#Credentials to access the Spotify Music Data
manager = SpotifyClientCredentials(client_id,client_secret)
sp = spotipy.Spotify(client_credentials_manager=manager)

The following part defines the necessary functions and demonstrates how each of them operates on a sample playlist

In [2]:
#Read json data and convert to DataFrame
def get_dataset():
    playlists_set = pd.read_json('challenge_set.json')
    return playlists_set
playlists_set = get_dataset()
playlists_set

Unnamed: 0,date,version,playlists,name,description
0,2018-01-16 08:47:28.198015,v1,"{'name': 'spanish playlist', 'num_holdouts': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
1,2018-01-16 08:47:28.198015,v1,"{'name': 'Groovin', 'num_holdouts': 48, 'pid':...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
2,2018-01-16 08:47:28.198015,v1,"{'name': 'uplift', 'num_holdouts': 40, 'pid': ...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
3,2018-01-16 08:47:28.198015,v1,"{'name': 'WUBZ', 'num_holdouts': 27, 'pid': 10...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
4,2018-01-16 08:47:28.198015,v1,"{'name': 'new', 'num_holdouts': 41, 'pid': 100...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
...,...,...,...,...,...
9995,2018-01-16 08:47:28.198015,v1,"{'name': 'Playlist 2015', 'num_holdouts': 20, ...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
9996,2018-01-16 08:47:28.198015,v1,"{'name': 'Workout', 'num_holdouts': 24, 'pid':...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
9997,2018-01-16 08:47:28.198015,v1,"{'name': 'Girlz', 'num_holdouts': 16, 'pid': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
9998,2018-01-16 08:47:28.198015,v1,"{'name': 'let's get lost', 'num_holdouts': 35,...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018


In [3]:
#Extract a sample playlist
def get_playlist_from_dataset(playlists_set, index):
    playlists = playlists_set.playlists
    playlist = pd.DataFrame(playlists[index])
    return playlist
playlist = get_playlist_from_dataset(playlists_set, 8000)
playlist

Unnamed: 0,name,num_holdouts,pid,num_tracks,tracks,num_samples
0,bangerz,57,1018569,157,"{'pos': 1, 'artist_name': 'Kanye West', 'track...",100
1,bangerz,57,1018569,157,"{'pos': 2, 'artist_name': 'Young Thug', 'track...",100
2,bangerz,57,1018569,157,"{'pos': 3, 'artist_name': 'Kodak Black', 'trac...",100
3,bangerz,57,1018569,157,"{'pos': 4, 'artist_name': 'Kodak Black', 'trac...",100
4,bangerz,57,1018569,157,"{'pos': 6, 'artist_name': 'Kodak Black', 'trac...",100
...,...,...,...,...,...,...
95,bangerz,57,1018569,157,"{'pos': 149, 'artist_name': 'Meek Mill', 'trac...",100
96,bangerz,57,1018569,157,"{'pos': 150, 'artist_name': 'Meek Mill', 'trac...",100
97,bangerz,57,1018569,157,"{'pos': 151, 'artist_name': 'Meek Mill', 'trac...",100
98,bangerz,57,1018569,157,"{'pos': 154, 'artist_name': 'Meek Mill', 'trac...",100


In [4]:
#Preprocessing data - filtering only cluttered playlists
def filter_dataset(playlists_set):
    for i in range(0, len(playlists_set)):
        playlist = get_playlist_from_dataset(playlists_set, i)
        if (len(playlist.index) < 50):
            playlists_set.drop(labels=i, axis=0, inplace=True)
    length = len(playlists_set)
    result_set = playlists_set.reset_index(drop=True)
    return result_set, length
playlists_set, set_size = filter_dataset(playlists_set)
playlists_set

Unnamed: 0,date,version,playlists,name,description
0,2018-01-16 08:47:28.198015,v1,"{'name': 'April', 'num_holdouts': 63, 'pid': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
1,2018-01-16 08:47:28.198015,v1,"{'name': 'Other', 'num_holdouts': 73, 'pid': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
2,2018-01-16 08:47:28.198015,v1,"{'name': 'Classic', 'num_holdouts': 83, 'pid':...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
3,2018-01-16 08:47:28.198015,v1,"{'name': '2016', 'num_holdouts': 61, 'pid': 10...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
4,2018-01-16 08:47:28.198015,v1,"{'name': 'Party songs ', 'num_holdouts': 100, ...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
...,...,...,...,...,...
1995,2018-01-16 08:47:28.198015,v1,"{'name': 'Indie', 'num_holdouts': 83, 'pid': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
1996,2018-01-16 08:47:28.198015,v1,"{'name': 'The Collection', 'num_holdouts': 96,...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
1997,2018-01-16 08:47:28.198015,v1,"{'name': 'Main', 'num_holdouts': 111, 'pid': 1...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018
1998,2018-01-16 08:47:28.198015,v1,"{'name': '💃🏻', 'num_holdouts': 70, 'pid': 1049...",build/challenge/challenge_set.json,the challenge set for the RecSys Challenge 2018


In [5]:
#Obtain track IDs for each track in the playlist
def get_track_ids(playlist):
    track_uris = []
    for i in range(0, len(playlist)):
        track_uris.append(playlist['tracks'][i]['track_uri'])
    track_ids = []
    for uri in track_uris:
        track_ids.append(uri.split(':')[-1])
    return track_ids
track_ids = get_track_ids(playlist)
track_ids

['1Wsbr1d2BouNGk2q92mIj7',
 '20dP2DaMHIAmwWAbp7peSr',
 '34oWbFBfGEElvgO0a5c9V4',
 '5v7kaZNsnyByrSJOfO8gKq',
 '32nztUVOEvvlUtKBufJuzq',
 '6vFpwoagZKaQveINvGrbFK',
 '4X5f3vT8MRuXF68pfjNte5',
 '7odIekt1GqLVEAAWdnd9mJ',
 '2KpCpk6HjXXLb7nnXoXA5O',
 '2Kbxfq8wkjMCno1QJe4yyw',
 '22DKsoYFV5npPXmnPpXL7i',
 '6gBFPUFcJLzWGx4lenP6h2',
 '1Ci4wASMY4xtKVMeHA6Sd5',
 '0m1KYWlT6LhFRBDVq9UNx4',
 '63OmVzNwoMNn6hT7p5vwvX',
 '2AGUFka8kBWCM47h5uTlDb',
 '177WEvlLsCc0FzCTWslawr',
 '5mPSyjLatqB00IkPqRlbTE',
 '0N3W5peJUQtI4eyR6GJT5O',
 '42GcjriRK6srwHkfbkBqVl',
 '2UOYzhusMTypF7oAQwksCj',
 '5GhJq5J9ZWIEDZdyw7EWzt',
 '26oF6WjkOjDIRK9YsdZp2l',
 '6BbINUfGabVyiNFJpQXn3x',
 '3NJG6vMH1ZsectZkocMEm0',
 '2gZUPNdnz5Y45eiGxpHGSc',
 '7lL3MvFWFFSD25pBz72Agj',
 '1NMYfyq8eE4qUCj2SXvrJd',
 '0pSBuHjILhNEo55xK1zrRt',
 '3sgL7CN65BzZq3qqG9SkJZ',
 '3AIgVHfr3FnXJvcdwgP7Km',
 '2bjwRfXMk4uRgOD9IBYl9h',
 '0xl1w2q4VLojeXp4JfazPL',
 '7qqOmEfeTPgjBsm1Qm98eg',
 '4XkOcWt0C2JX1s2RXybosk',
 '0Fv5N0cHBsl4bzCbollCAS',
 '5NQbUaeTEOGdD6hHcre0dZ',
 

In [6]:
#Obtain track features based on a track ID
def get_track_features(id):

    meta = sp.track(id)
    features = sp.audio_features(id)

    # meta
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']
    ids =  meta['id']

    # features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    valence = features[0]['valence']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    key = features[0]['key']
    time_signature = features[0]['time_signature']

    track = [name, album, artist, ids, release_date, popularity, length, danceability, acousticness,
            energy, instrumentalness, liveness, valence, loudness, speechiness, tempo, key, time_signature]
    columns = ['name','album','artist','id','release_date','popularity','length','danceability','acousticness','energy','instrumentalness',
                'liveness','valence','loudness','speechiness','tempo','key','time_signature']
    return track,columns
track, columns = get_track_features(track_ids[0])
track, columns

(['Pt. 2',
  'The Life Of Pablo',
  'Kanye West',
  '1Wsbr1d2BouNGk2q92mIj7',
  '2016-06-10',
  68,
  130293,
  0.674,
  0.552,
  0.752,
  0,
  0.787,
  0.236,
  -4.073,
  0.333,
  145.076,
  11,
  4],
 ['name',
  'album',
  'artist',
  'id',
  'release_date',
  'popularity',
  'length',
  'danceability',
  'acousticness',
  'energy',
  'instrumentalness',
  'liveness',
  'valence',
  'loudness',
  'speechiness',
  'tempo',
  'key',
  'time_signature'])

In [7]:
#Create a DataFrame of all the playlist tracks and their respective features
def create_playlist_dataframe(track_ids):
    tracks = []
    for track_id in track_ids:
        time.sleep(.5)
        track, columns = get_track_features(track_id)
        tracks.append(track)
    df1 = pd.DataFrame(tracks,columns=columns)
    return df1
df1 = create_playlist_dataframe(track_ids)
df1

Unnamed: 0,name,album,artist,id,release_date,popularity,length,danceability,acousticness,energy,instrumentalness,liveness,valence,loudness,speechiness,tempo,key,time_signature
0,Pt. 2,The Life Of Pablo,Kanye West,1Wsbr1d2BouNGk2q92mIj7,2016-06-10,68,130293,0.674,0.5520,0.752,0.0,0.7870,0.236,-4.073,0.3330,145.076,11,4
1,pick up the phone,Birds In The Trap Sing McKnight,Travis Scott,20dP2DaMHIAmwWAbp7peSr,2016-09-16,70,252256,0.711,0.1140,0.739,0.0,0.2260,0.430,-3.804,0.1290,136.919,7,4
2,No Flockin',No Flockin',Kodak Black,34oWbFBfGEElvgO0a5c9V4,2015-11-13,72,165290,0.943,0.0673,0.595,0.0,0.0839,0.815,-8.372,0.1910,117.532,5,4
3,Skrt,Skrt,Kodak Black,5v7kaZNsnyByrSJOfO8gKq,2016-02-12,62,224864,0.901,0.5850,0.352,0.0,0.1220,0.199,-10.038,0.0916,111.065,2,4
4,Vibin In This Bih,Lil Big Pac,Kodak Black,32nztUVOEvvlUtKBufJuzq,2016-06-10,0,169926,0.710,0.0171,0.579,0.0,0.1050,0.305,-5.724,0.2120,163.992,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Fall Thru,Wins & Losses,Meek Mill,7dK9JJXrbDFc29n23BqUso,2017-07-21,47,222780,0.558,0.3840,0.808,0.0,0.2600,0.630,-4.284,0.3670,157.418,9,4
96,Never Lose (feat. Lihtz Kamraz),Wins & Losses,Meek Mill,2C4XQsZFeze4yMatFzY9M2,2017-07-21,34,236520,0.759,0.0306,0.792,0.0,0.1420,0.542,-3.259,0.1770,130.119,1,4
97,Glow Up,Wins & Losses,Meek Mill,2N3LQk27fglVbVqQ8zA6lQ,2017-07-21,44,210346,0.720,0.0823,0.783,0.0,0.1210,0.604,-4.097,0.1290,76.085,1,4
98,Ball Player (feat. Quavo),Wins & Losses,Meek Mill,6pVSAHKEKOWxDcfr8haIGe,2017-07-21,40,251706,0.724,0.0457,0.827,0.0,0.0848,0.200,-4.211,0.1290,144.973,11,4


In [8]:
#Extract only necessary features from the df
def extract_features(df1):
    df = df1[['name','album','artist','id','release_date','popularity','danceability','energy','valence','loudness']]
    return df
df = extract_features(df1)
df

Unnamed: 0,name,album,artist,id,release_date,popularity,danceability,energy,valence,loudness
0,Pt. 2,The Life Of Pablo,Kanye West,1Wsbr1d2BouNGk2q92mIj7,2016-06-10,68,0.674,0.752,0.236,-4.073
1,pick up the phone,Birds In The Trap Sing McKnight,Travis Scott,20dP2DaMHIAmwWAbp7peSr,2016-09-16,70,0.711,0.739,0.430,-3.804
2,No Flockin',No Flockin',Kodak Black,34oWbFBfGEElvgO0a5c9V4,2015-11-13,72,0.943,0.595,0.815,-8.372
3,Skrt,Skrt,Kodak Black,5v7kaZNsnyByrSJOfO8gKq,2016-02-12,62,0.901,0.352,0.199,-10.038
4,Vibin In This Bih,Lil Big Pac,Kodak Black,32nztUVOEvvlUtKBufJuzq,2016-06-10,0,0.710,0.579,0.305,-5.724
...,...,...,...,...,...,...,...,...,...,...
95,Fall Thru,Wins & Losses,Meek Mill,7dK9JJXrbDFc29n23BqUso,2017-07-21,47,0.558,0.808,0.630,-4.284
96,Never Lose (feat. Lihtz Kamraz),Wins & Losses,Meek Mill,2C4XQsZFeze4yMatFzY9M2,2017-07-21,34,0.759,0.792,0.542,-3.259
97,Glow Up,Wins & Losses,Meek Mill,2N3LQk27fglVbVqQ8zA6lQ,2017-07-21,44,0.720,0.783,0.604,-4.097
98,Ball Player (feat. Quavo),Wins & Losses,Meek Mill,6pVSAHKEKOWxDcfr8haIGe,2017-07-21,40,0.724,0.827,0.200,-4.211


Now that we have demonstrated (on a sample playlist) how data is retrieved, filtered and processed, the same process will be repeated for other playlists from the dataset and they will be saved in a new json file, ready for applying algorithms.

In [9]:
#Write playlist to a JSON file
def write_json(new_data, filename='playlists.json'):
    with open(filename,'r+') as file:
        # load existing data into a dict
        file_data = json.load(file)
        # join new_data with file_data 
        file_data["playlists"].append(new_data)
        file_data["size"] = file_data["size"] + 1
        # set file's current position at offset
        file.seek(0)
        # convert back to json
        json.dump(file_data, file, indent = 2)

In [11]:
#Apply entire procedure to 200 playlists from the dataset
for i in range(0, 200):
    playlist = get_playlist_from_dataset(playlists_set, i)
    track_ids = get_track_ids(playlist)
    df1 = create_playlist_dataframe(track_ids)
    df = extract_features(df1)
    jsonDF = json.loads(df.to_json(orient="columns"))
    write_json(jsonDF)