# Notebook 2: Get Additional Music Data from Spotify API

### Introduction

Now that I have the list of 1000 albums to get data for, I can start working with the Spotify API to get the additional data required. I'll need to first get album IDs and artist genre which I will use to validate my PCA results later. Then I can get the track list for each album and the individual audio features for each track. 

You'll need a Spotify token of your own. I have a separate file called credentials.py in the same folder that uses the util function from the spotipy library to create a token with my username, client_id, client_secret, redirect_uri, and scope. You can find more details on that here, under "Becoming a Spotify Developer" : https://towardsdatascience.com/get-your-spotify-streaming-history-with-python-d5a208bbcbd3

In [1]:
import pandas as pd
import pickle
import sys
import requests
from credentials import token

sys.setrecursionlimit(1000000) #to allow pickling

### Read in Critics DataFrame

The critics dataframe contains information on the artist and album that I'll need to extract, format, and feed into the Spotify API.

In [2]:
import pickle
with open('../data/critics_df_all.pickle', 'rb') as read_file:
    critics = pickle.load(read_file)

In [3]:
critics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093 entries, 0 to 1092
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Album_Title    1093 non-null   object
 1   Artist_Name    1093 non-null   object
 2   Critic_Rating  1093 non-null   object
 3   User_Rating    1093 non-null   object
 4   Release_Date   1093 non-null   object
dtypes: object(5)
memory usage: 42.8+ KB


In [4]:
album_title = critics.Album_Title
artist_name = critics.Artist_Name

In [5]:
#This is the format to submit information to the Spotify API.
q_list = list(map(lambda x,y: 'album:'+ str(x) + ' artist:' + str(y), album_title, artist_name))

In [58]:
# with open('../data/album_artist_list.pickle', 'wb') as to_write:
#     pickle.dump(q_list, to_write)

### Get Album IDs for Reviewed Albums

In [8]:
# with open('../data/album_artist_list.pickle', 'rb') as read_file:
#     q_list = pickle.load(read_file)

In [6]:
def get_info(q, token, type = 'album', query = 'id'):
    headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': f'Bearer ' + token,
    }
    params = [
        ('q', q),
        ('type', type),
        ('market', 'US')
    ]
    json_key = str(type + 's')
    try: 
        response = requests.get('https://api.spotify.com/v1/search',
                           headers = headers, params = params, timeout = 5)
        json = response.json()
        first_result = json[json_key]['items'][0][query]
        return first_result
    except:
        return "None Found"

In [7]:
album_ids = pd.DataFrame([q_list, [get_info(i, token) for i in q_list]]).T

In [8]:
album_ids.columns = ['album_artist', 'album_id']

In [9]:
album_ids

Unnamed: 0,album_artist,album_id
0,album:Ten Freedom Summers artist:Wadada Leo Smith,None Found
1,album:Fetch the Bolt Cutters artist:Fiona Apple,0fO1KemWL2uCCQmM22iKlj
2,album:SMiLE artist:Brian Wilson,4Uc6YCjpfyjj02rZfg2EUv
3,album:Van Lear Rose artist:Loretta Lynn,3mheNcbxiCqs3EcN5DcCye
4,album:To Pimp A Butterfly artist:Kendrick Lamar,7ycBtnsMtyVbbwTfJwRjSP
...,...,...
1088,album:Music Tapes for Clouds & Tornadoes artis...,1Cm8AoA6lAX80LvdNuoEro
1089,album:Goths artist:The Mountain Goats,6VTTkMIKHhmFsZkKXsvS5I
1090,album:Stubborn Persistent Illusions artist:Do ...,1wrLF6seLRorRM7Khq6RJX
1091,album:Severant artist:Kuedo,4E68d3pPsJlzNqVbR1amZP


### Get Artist Genre Data

In [10]:
def get_artist_id(q, token):
    headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': f'Bearer ' + token,
    }
    params = [
        ('q', q),
        ('type', 'artist'),
        ('market', 'US')
    ]
    try: 
        response = requests.get('https://api.spotify.com/v1/search',
                           headers = headers, params = params, timeout = 5)
        json = response.json()
        first_result = json['artists']['items'][0]['genres']
        return first_result
    except:
        return "None Found"

In [11]:
album_genre = pd.concat([album_ids, pd.Series([get_info(i, token, type = 'artist', query = 'genres') 
                                 for i in artist_name]).rename('genre', inplace = True)], axis = 1)

In [12]:
album_genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093 entries, 0 to 1092
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   album_artist  1093 non-null   object
 1   album_id      1093 non-null   object
 2   genre         1093 non-null   object
dtypes: object(3)
memory usage: 25.7+ KB


In [13]:
# This drops all instances where the album from Metacritic is not on Spotify, as well as 
# instances when the same album shows up twice. 
album_genre = album_genre[album_genre.album_id != 'None Found'].drop_duplicates(subset = 'album_id').reset_index(drop = True)

In [14]:
album_genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 992 entries, 0 to 991
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   album_artist  992 non-null    object
 1   album_id      992 non-null    object
 2   genre         992 non-null    object
dtypes: object(3)
memory usage: 23.4+ KB


In [57]:
# with open('../data/album_genre_critic.pickle', 'wb') as to_write:
#     pickle.dump(album_genre, to_write)

### Get Track List for each Album ID

In [4]:
# with open('../data/album_genre_critic.pickle', 'rb') as read_file:
#     album_genre = pickle.load(read_file)

In [15]:
def get_tracklist(album_id, token):
    headers = {
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Authorization': f'Bearer ' + token,
    }
    get_url = 'https://api.spotify.com/v1/albums/' + album_id + '/tracks'
    
    response = requests.get(get_url, headers=headers, params=None)
    json = response.json()
    first_result = []
    album_id_list = []
    for i in json['items']:
        first_result.append(i['id'])
        album_id_list.append(album_id)
    return pd.DataFrame(list(zip(album_id_list, first_result)), columns = ['album_id', 'track_id'])

In [16]:
validalbum_trackid = pd.concat([get_tracklist(i, token)  for i in album_genre['album_id']])

In [17]:
album_track_df = album_genre.merge(validalbum_trackid, how = 'inner', on = 'album_id')

In [58]:
# with open('../data/album_track_df.pickle', 'wb') as to_write:
#     pickle.dump(album_track_df, to_write)

### Get Track Features for All Tracks

In [None]:
# with open('../data/album_track_df.pickle', 'rb') as read_file:
#     album_track_df = pickle.load(read_file)

In [18]:
tracks = album_track_df['track_id']

In [19]:
def get_features(track_list, token):
    headers = {
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Authorization': f'Bearer ' + token,
    }
    ids = ','.join(track_list)    
    params = [
        ('ids', ids),
        ('market', 'US')
    ]    
    response = requests.get('https://api.spotify.com/v1/audio-features/', headers=headers, params=params)
    json = response.json()
    return pd.DataFrame(json['audio_features'])

In [20]:
all = pd.concat([pd.concat(list(map(lambda x, y: pd.concat([album_track_df.iloc[x:y, :].reset_index(drop = True), 
                                                   get_features(tracks[x:y], token)], axis = 1), 
                           [i for i in range(0, 12101, 50)], 
                           [i for i in range(50, 12151, 50)]))), 

                 pd.concat([album_track_df.iloc[12150:, :].reset_index(drop = True), 
                            get_features(tracks[12150:], token)], axis = 1)]).reset_index(drop = True)

In [21]:
all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12166 entries, 0 to 12165
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   album_artist      12166 non-null  object 
 1   album_id          12166 non-null  object 
 2   genre             12166 non-null  object 
 3   track_id          12166 non-null  object 
 4   danceability      12166 non-null  float64
 5   energy            12166 non-null  float64
 6   key               12166 non-null  int64  
 7   loudness          12166 non-null  float64
 8   mode              12166 non-null  int64  
 9   speechiness       12166 non-null  float64
 10  acousticness      12166 non-null  float64
 11  instrumentalness  12166 non-null  float64
 12  liveness          12166 non-null  float64
 13  valence           12166 non-null  float64
 14  tempo             12166 non-null  float64
 15  type              12166 non-null  object 
 16  id                12166 non-null  object

In [66]:
# with open('../data/full.pickle', 'wb') as to_write:
#     pickle.dump(all, to_write)

### Explore all features

In [21]:
# with open('../data/full.pickle', 'rb') as read_file:
#     all = pickle.load(read_file)

In [22]:
all.album_id.value_counts().value_counts()
# Albums have, at most, 20 tracks. 

11    176
10    156
12    145
13     93
20     66
14     65
9      64
15     51
16     41
8      29
17     25
7      19
18     18
19     12
6      12
4       9
5       8
1       2
2       1
Name: album_id, dtype: int64

In [23]:
all.describe()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0,12166.0
mean,0.494388,0.583324,5.221437,-9.25675,0.6759,0.087871,0.342467,0.22844,0.200862,0.404076,119.49533,245989.7,3.856403
std,0.182532,0.259203,3.608929,4.710791,0.468057,0.11014,0.344426,0.33827,0.162778,0.245823,30.425833,130360.4,0.514495
min,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4120.0,0.0
25%,0.365,0.38525,2.0,-11.44325,0.0,0.0337,0.0184,2.9e-05,0.101,0.198,95.753,184136.5,4.0
50%,0.504,0.6,5.0,-8.208,1.0,0.0447,0.206,0.007795,0.128,0.379,118.6045,228307.0,4.0
75%,0.625,0.804,9.0,-6.05325,1.0,0.0835,0.66675,0.447,0.257,0.58675,139.33025,282876.5,4.0
max,0.985,1.0,11.0,0.606,1.0,0.956,0.996,0.999,1.0,0.985,220.217,4277994.0,5.0


The ranges of values for danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, duration_ms all make sense.  
There is no track that is extremely danceable, loud, 100% speechy, confidently acoustic, confidently instrumental, or with totally positive valence.  
67% of all tracks are in major key.  
Tracks are generally lower valence.  
There are a few songs that are mostly spoken words, but the majority are not speechy.  
Most tracks are not live.  
Tempo ranges from 0 bpm (???) to 220 (makes sense).  
Mean track duration is around 4 minutes, with a minimum of 4 seconds, and a maximum of 7 minutes. Median is almost 4 minutes.  
Not sure what it means to have a 0 time signature, but most tracks are in 4/4 time.