# Step 1a | Retrieve Track Features from Spotify API

Since the retrieval of track-related features from the Spotify API proved to be quite costly from a time perspective, I have decided to section the task in chunks. Therefore, I am running the same process in parallel on both Jupyter Lab and Google Colab with GPU and high-RAM. 

At the end, I will concatenate all the chunks of data in one final dataframe on Jupyter Lab, where I will complete my pipeline.

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True) # force remount to ensure new directories are included

Mounted at /content/gdrive


In [None]:
# check GPU is enabled
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
#!pip install spotipy

In [None]:
# imports
import pickle
from pandas.errors import MergeError
import json
import spotipy
import spotipy.util as util
import pandas as pd
import pickle
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
with open('/content/gdrive/MyDrive/Colab_spotify/albums_uri_list.pkl', 'rb') as f:
  albums_uri_list = pickle.load(f)

In [None]:
len(albums_uri_list)

36978

In [None]:
# retrieve client id and client secret from txt file saved locally
file = '/content/gdrive/MyDrive/Colab_spotify/spotify_creds.txt'

with open(file,'r') as f: 
    f = f.read().splitlines()
    cid = f[0].split(':')[1]
    secret = f[1].split(':')[1]

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)

# initiate spotipy client
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [None]:
spotify = spotipy.Spotify(requests_timeout=100,retries=10)

In [None]:
# function to extract tracks by album
def get_tracks_from_albums(list_of_albums_uri):
    
    complete_tracks_df = pd.DataFrame() # empty df
    
    track_ids = []

    for album_uri in list_of_albums_uri:
        current_album_tracks = []
        album_id = album_uri.split(':')[2]
        album_results = sp.album(album_id) 

        for i in range(len(album_results['tracks']['items'])):
            track_id = album_results['tracks']['items'][i]['id']
            track_ids.append(track_id)
            current_album_tracks.append(track_id)

        track_meta={'id':[],'album_name':[], 'track_name':[], 
                       'artist_name':[],'explicit':[],'track_popularity':[]}
        
        for track_id in current_album_tracks:
            # get track's meta data
            meta = sp.track(track_id)
            
            # track id
            track_meta['id'].append(track_id)
            
            # album name
            album_name = meta['album']['name']
            track_meta['album_name']+=[album_name]

            # track name
            track_name = meta['name']
            track_meta['track_name']+=[track_name]

            # artist name
            s = ', '
            artist_name = s.join([artist['name'] for artist in meta['artists']])
            track_meta['artist_name']+=[artist_name]

            # explicit: lyrics could be considered offensive or unsuitable for children
            explicit = meta['explicit']
            track_meta['explicit'].append(explicit)

            # track popularity
            track_popularity = meta['popularity']
            track_meta['track_popularity'].append(track_popularity)

        # build a track metadata df 
        track_meta_df = pd.DataFrame.from_dict(track_meta) 

        # find track features and aggregate them in a df
        track_features = sp.audio_features(track_meta_df['id'])
        #print(len(track_features))
        try:
            track_features_df = pd.DataFrame.from_dict(track_features)
        except:
            pass
        
       
        # convert milliseconds to mins
        # duration_ms: The duration of the track in milliseconds.
        # 1 minute = 60 seconds = 60 × 1000 milliseconds = 60,000 ms
        #track_features_df['duration_mins'] = track_features_df['duration_ms']/60000
             
        try:
            # combine metadata and track features dataframes into one large dataframe
            tracks_df = track_meta_df.merge(track_features_df)
        except MergeError:
            pass
        
        complete_tracks_df = pd.concat([complete_tracks_df,tracks_df])
        
    return complete_tracks_df

In [None]:
tracks_chunk_5 = get_tracks_from_albums(albums_uri_list[10000:20000])

In [None]:
tracks_chunk_5

Unnamed: 0,id,album_name,track_name,artist_name,explicit,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
0,6mJjdx37IVJ0W115uP49DS,Gettin' Out The Good Stuff,Every Time I Get Around You,David Lee Murphy,False,27,0.752,0.591,2,-10.564,1,0.0312,0.035000,0.000000,0.0257,0.8360,130.675,audio_features,spotify:track:6mJjdx37IVJ0W115uP49DS,https://api.spotify.com/v1/tracks/6mJjdx37IVJ0...,https://api.spotify.com/v1/audio-analysis/6mJj...,206280,4
1,5NdQX6XqsuGyzDH3Lm1xaV,Gettin' Out The Good Stuff,The Road You Leave Behind,David Lee Murphy,False,29,0.781,0.431,4,-12.026,1,0.0303,0.133000,0.000000,0.0693,0.4450,123.034,audio_features,spotify:track:5NdQX6XqsuGyzDH3Lm1xaV,https://api.spotify.com/v1/tracks/5NdQX6XqsuGy...,https://api.spotify.com/v1/audio-analysis/5NdQ...,233507,4
2,5SxZbOZbLwmHLlOCd8cztu,Gettin' Out The Good Stuff,She's Really Something To See,David Lee Murphy,False,11,0.574,0.312,2,-13.367,1,0.0251,0.214000,0.000000,0.3370,0.2820,148.118,audio_features,spotify:track:5SxZbOZbLwmHLlOCd8cztu,https://api.spotify.com/v1/tracks/5SxZbOZbLwmH...,https://api.spotify.com/v1/audio-analysis/5SxZ...,238347,4
3,1sTtxkHk8tB6elREzttYyc,Gettin' Out The Good Stuff,Genuine Rednecks,David Lee Murphy,False,21,0.680,0.598,2,-10.969,1,0.0316,0.134000,0.000009,0.2270,0.5610,136.133,audio_features,spotify:track:1sTtxkHk8tB6elREzttYyc,https://api.spotify.com/v1/tracks/1sTtxkHk8tB6...,https://api.spotify.com/v1/audio-analysis/1sTt...,254240,4
4,41CfHwWCuRp0XVZgPlfuKj,Gettin' Out The Good Stuff,100 Years Too Late,David Lee Murphy,False,8,0.552,0.279,7,-10.976,1,0.0270,0.354000,0.000000,0.1030,0.1870,135.890,audio_features,spotify:track:41CfHwWCuRp0XVZgPlfuKj,https://api.spotify.com/v1/tracks/41CfHwWCuRp0...,https://api.spotify.com/v1/audio-analysis/41Cf...,250840,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,4m7UgJ3yDs75kn2ptsuCca,Tesla,Feels Good,"Flux Pavilion, Tom Cane",False,20,0.482,0.757,4,-3.333,0,0.0568,0.004200,0.084300,0.2000,0.2540,124.998,audio_features,spotify:track:4m7UgJ3yDs75kn2ptsuCca,https://api.spotify.com/v1/tracks/4m7UgJ3yDs75...,https://api.spotify.com/v1/audio-analysis/4m7U...,228857,4
10,3P4pVY3MSsWOzq286bmvgr,Tesla,Who Wants to Rock,"Flux Pavilion, Riff Raff",False,37,0.812,0.883,9,-3.840,1,0.1560,0.002900,0.028800,0.1410,0.5390,105.023,audio_features,spotify:track:3P4pVY3MSsWOzq286bmvgr,https://api.spotify.com/v1/tracks/3P4pVY3MSsWO...,https://api.spotify.com/v1/audio-analysis/3P4p...,219350,4
11,1AjXklmJCpU1NdvjbeLlBq,Tesla,I Got Something,Flux Pavilion,False,20,0.644,0.936,0,-4.300,1,0.0371,0.000239,0.686000,0.1800,0.2100,100.063,audio_features,spotify:track:1AjXklmJCpU1NdvjbeLlBq,https://api.spotify.com/v1/tracks/1AjXklmJCpU1...,https://api.spotify.com/v1/audio-analysis/1AjX...,241880,4
12,1nWGLxs3Gpynin4EAiT7Uo,Tesla,Ironheart,"Flux Pavilion, BullySongs",False,23,0.430,0.835,1,-4.593,1,0.0355,0.002930,0.000453,0.0446,0.2150,145.052,audio_features,spotify:track:1nWGLxs3Gpynin4EAiT7Uo,https://api.spotify.com/v1/tracks/1nWGLxs3Gpyn...,https://api.spotify.com/v1/audio-analysis/1nWG...,202056,4


In [None]:
# pickle to save partial results
tracks_chunk_5.to_pickle('/content/gdrive/MyDrive/Colab_spotify/track_chunk_5.pkl')

In [None]:
tracks_chunk_6 = get_tracks_from_albums(albums_uri_list[25000:])

In [None]:
tracks_chunk_6

Unnamed: 0,id,album_name,track_name,artist_name,explicit,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
0,39ZZHQG2E7h5SpJDvUrkDo,Respighi: Ancient Airs and Dances,"Ancient Airs And Dances, Suite No.1, [P. 109]:...","Ottorino Respighi, Boston Symphony Orchestra, ...",False,13,0.3140,0.0643,2,-21.547,1,0.0388,0.955000,0.911000,0.1380,0.3110,120.344,audio_features,spotify:track:39ZZHQG2E7h5SpJDvUrkDo,https://api.spotify.com/v1/tracks/39ZZHQG2E7h5...,https://api.spotify.com/v1/audio-analysis/39ZZ...,181720,4
1,7liVHOZmyeuMOI80YlNmH4,Respighi: Ancient Airs and Dances,"Ancient Airs And Dances, Suite No.1, [P. 109]:...","Ottorino Respighi, Boston Symphony Orchestra, ...",False,12,0.2800,0.0824,2,-19.060,1,0.0346,0.961000,0.918000,0.1570,0.3210,93.352,audio_features,spotify:track:7liVHOZmyeuMOI80YlNmH4,https://api.spotify.com/v1/tracks/7liVHOZmyeuM...,https://api.spotify.com/v1/audio-analysis/7liV...,202200,4
2,3p97y6YitJ4WiDFoH6Dp0U,Respighi: Ancient Airs and Dances,"Ancient Airs And Dances, Suite No.1, [P. 109]:...","Ottorino Respighi, Boston Symphony Orchestra, ...",False,11,0.0925,0.0274,11,-33.148,0,0.0428,0.935000,0.785000,0.0919,0.0597,85.937,audio_features,spotify:track:3p97y6YitJ4WiDFoH6Dp0U,https://api.spotify.com/v1/tracks/3p97y6YitJ4W...,https://api.spotify.com/v1/audio-analysis/3p97...,321760,3
3,3KdLaJMFiFEQ3ovSLe616W,Respighi: Ancient Airs and Dances,"Ancient Airs And Dances, Suite No.1, [P. 109]:...","Ottorino Respighi, Boston Symphony Orchestra, ...",False,11,0.4610,0.2210,2,-17.682,1,0.0387,0.938000,0.874000,0.1130,0.5770,123.535,audio_features,spotify:track:3KdLaJMFiFEQ3ovSLe616W,https://api.spotify.com/v1/tracks/3KdLaJMFiFEQ...,https://api.spotify.com/v1/audio-analysis/3KdL...,223333,4
4,4gng0wpvWsn4JQHryblaiF,Respighi: Ancient Airs and Dances,"Ancient Airs And Dances, Suite No.2, [P. 138]:...","Ottorino Respighi, Boston Symphony Orchestra, ...",False,10,0.3030,0.0145,7,-22.625,1,0.0384,0.974000,0.914000,0.0569,0.2760,95.919,audio_features,spotify:track:4gng0wpvWsn4JQHryblaiF,https://api.spotify.com/v1/tracks/4gng0wpvWsn4...,https://api.spotify.com/v1/audio-analysis/4gng...,256827,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13,6VP69eiW8k8OVUjS2lBYKy,Rise of the Blood Legion- The Best of (Chapter 1),Adrenalize,In This Moment,False,30,0.4430,0.9600,6,-3.933,1,0.0872,0.000236,0.000000,0.1310,0.4320,132.190,audio_features,spotify:track:6VP69eiW8k8OVUjS2lBYKy,https://api.spotify.com/v1/tracks/6VP69eiW8k8O...,https://api.spotify.com/v1/audio-analysis/6VP6...,255653,4
14,3GrFAJiT7FWRBKuLqVHy3b,Rise of the Blood Legion- The Best of (Chapter 1),It is Written,In This Moment,True,19,0.2990,0.7620,1,-15.699,1,0.9140,0.055900,0.377000,0.6070,0.1570,80.819,audio_features,spotify:track:3GrFAJiT7FWRBKuLqVHy3b,https://api.spotify.com/v1/tracks/3GrFAJiT7FWR...,https://api.spotify.com/v1/audio-analysis/3GrF...,30227,4
15,326QxhmInOvbgbaNzeHxRz,Rise of the Blood Legion- The Best of (Chapter 1),Burn,In This Moment,False,27,0.3580,0.8190,4,-5.256,1,0.0606,0.002710,0.003870,0.1050,0.0452,146.093,audio_features,spotify:track:326QxhmInOvbgbaNzeHxRz,https://api.spotify.com/v1/tracks/326QxhmInOvb...,https://api.spotify.com/v1/audio-analysis/326Q...,284693,4
16,5hamZeMqIFSZbEgQ2kQ9IN,Rise of the Blood Legion- The Best of (Chapter 1),Whore,In This Moment,True,33,0.4740,0.8840,6,-3.845,0,0.1010,0.025400,0.000000,0.0783,0.5240,179.559,audio_features,spotify:track:5hamZeMqIFSZbEgQ2kQ9IN,https://api.spotify.com/v1/tracks/5hamZeMqIFSZ...,https://api.spotify.com/v1/audio-analysis/5ham...,245960,4


In [None]:
# pickle to save partial results
tracks_chunk_6.to_pickle('/content/gdrive/MyDrive/Colab_spotify/track_chunk_6.pkl')

In [None]:
# concatenate two chunks before exporting them to jupyter lab 
tracks_colab_df = pd.concat([tracks_chunk_5,tracks_chunk_6])

In [None]:
# check merge was executed correctly
if len(tracks_colab_df) == len(tracks_chunk_6)+len(tracks_chunk_5):
  print("Merge executed correctly")
else:
  print("Not all records have been included in the merged dataframe")

Merge executed correctly


In [None]:
# pickle tracks chunk retrieved on colab for use on jupyter lab
tracks_colab_df.to_pickle('/content/gdrive/MyDrive/Colab_spotify/track_colab.pkl')

I will then save the pickle file in my **local directory** where the jupyter notebooks for this project are saved. By doing so, I will be able to complete my pipeline on jupyter lab.