### The goal of this jupyter notebook is to find genres for a mass amount of artists. Spotify does not provide a large list of their artist library. So we have to find a way to pull various artists and their genre. The manner in which I accomplished this is as follows:

#### 1) Pull my top 200 stream number artist list into a dataframe (737 artists)
#### 2) Perform an api search on artist name to extract the artist's spotify api id and url
#### 3) Using the api ids from 2), perform a search to extract artists related to the artists in the top 200 list. Combine into one master dataframe (~4.5k artists)
#### 4) Perform api call using the ids from the master artist list to extract genre for each artist
#### 5) Save our new master list of artists and their corresponding genre (as a pickle file)
#### 6) I take the master list of artists and pull the artist's most popular songs
#### 7) In a final api call, I take the giant song list of the most popular songs (from 5) and pull the song features of the song (danceability, energy, key etc.)
#### 8) I store the final list of songs and song features in a pickle file for analysis

In [1]:
#importing modules
import pandas as pd
import csv
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2
import numpy as np
import os
from API_KEYS import grab_api_keys
import pickle
from collections import defaultdict

In [55]:
#IMPORTANT - URL for songs is too long, an adjustment to pandas settings is required
pd.set_option("display.max_colwidth", 10000)

#grabbing API keys from hidden file
CLIENT_ID, SECRET_CLIENT_ID = grab_api_keys()

In [56]:
#setting up authentication credentials
credentials = oauth2.SpotifyClientCredentials(
        client_id=CLIENT_ID,
        client_secret=SECRET_CLIENT_ID)

#creating temp token for api pull
token = credentials.get_access_token()
spotify = spotipy.Spotify(auth=token)

In [4]:
#setting up cwd for saving files
cwd = os.getcwd()
print (cwd)

C:\Users\abels\Desktop\spotify_scrape\Data


In [5]:
#reading combined csv files to pandas dataframe
df = pd.read_csv(cwd + r'\final_csv_files\Combined_Top_200_Stream_Numbers.csv')

In [6]:
#Using a mask to remove duplicate headers from csv files
df['Streams'] = pd.to_numeric(df['Streams'], errors='coerce')
mask = (~np.isnan(df['Streams']))
df = df[mask]

In [7]:
#extracting a list of unique artists
unique_artists = df['Artist'].drop_duplicates().to_frame()

In [8]:
unique_artists.head()

Unnamed: 0,Artist
1,The Weeknd
2,The Chainsmokers
3,DJ Snake
4,Clean Bandit
5,Drake


In [9]:
#setting up a column to house the artists api number
unique_artists['api_artist'] = ""

In [10]:
#setting up a column to house the artists api url
unique_artists['api_url'] = ""

In [11]:
#reseting index so I can make pandas calls to 'Artist'
unique_artists = unique_artists.reset_index()
unique_artists = unique_artists.drop('index', axis=1)
unique_artists.head()

Unnamed: 0,Artist,api_artist,api_url
0,The Weeknd,,
1,The Chainsmokers,,
2,DJ Snake,,
3,Clean Bandit,,
4,Drake,,


In [12]:
#Creating a for loop to search the spotify api in order to pull artist api # as well as artist url
for x in range(len(unique_artists)):
    placeholder = spotify.search(unique_artists.iloc[x][0])
    if len(placeholder['tracks']['items']) != 0:
        try:
            for counter in range(len(placeholder['tracks']['items'])):
                if unique_artists.iloc[x][0] == placeholder['tracks']['items'][counter]['album']['artists'][0]['name']:
                    unique_artists.at[x,'api_artist'] = placeholder['tracks']['items'][counter]['album']['artists'][0]['name']
                    unique_artists.at[x,'api_url'] = placeholder['tracks']['items'][counter]['album']['artists'][0]['external_urls']['spotify']
        except:
            print (placeholder)

In [13]:
#ensuring the api pull worked correclty
unique_artists.head()

Unnamed: 0,Artist,api_artist,api_url
0,The Weeknd,The Weeknd,https://open.spotify.com/artist/1Xyo4u8uXC1ZmMpatF05PJ
1,The Chainsmokers,The Chainsmokers,https://open.spotify.com/artist/69GGBxA162lTqCwzJG5jLp
2,DJ Snake,DJ Snake,https://open.spotify.com/artist/540vIaP2JwjQb9dm3aArA4
3,Clean Bandit,Clean Bandit,https://open.spotify.com/artist/6MDME20pz9RveH9rEXvrOM
4,Drake,Drake,https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4


In [14]:
#extracting the api_id from the api_url
unique_artists['artist_api_id'] = unique_artists['api_url'].str[32:1000]
unique_artists['genres'] = np.NaN
unique_artists['genres'].astype('object')
unique_artists.head()

Unnamed: 0,Artist,api_artist,api_url,artist_api_id,genres
0,The Weeknd,The Weeknd,https://open.spotify.com/artist/1Xyo4u8uXC1ZmMpatF05PJ,1Xyo4u8uXC1ZmMpatF05PJ,
1,The Chainsmokers,The Chainsmokers,https://open.spotify.com/artist/69GGBxA162lTqCwzJG5jLp,69GGBxA162lTqCwzJG5jLp,
2,DJ Snake,DJ Snake,https://open.spotify.com/artist/540vIaP2JwjQb9dm3aArA4,540vIaP2JwjQb9dm3aArA4,
3,Clean Bandit,Clean Bandit,https://open.spotify.com/artist/6MDME20pz9RveH9rEXvrOM,6MDME20pz9RveH9rEXvrOM,
4,Drake,Drake,https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4,3TVXtAsR1Inumwj472S9r4,


In [26]:
#creating an empty list to house a large number of artists
#in the section of code below, I create a function to pull the related artists of the artists in the unique_artist
#this will create a master list of artists for my final api pull
artist_list = []

In [65]:
#creating function to pull related artists from unique_artist df
def grab_related(list_):    
    for h in range(len(unique_artists)):
        try:
            related_artist = spotify.artist_related_artists(unique_artists.iloc[h]['artist_api_id'])
            for j in related_artist['artists']:
                if j['name'] not in artist_list:
                    artist_list.extend([j['name']])
        except:
            print(1) #added print one to help determine how many api calls failed

In [66]:
#pulling all related artists
grab_related(unique_artists)
artist_list = list(set(artist_list))

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1


In [67]:
#length of final artist list
len(artist_list)

4978

In [72]:
#confirming pull worked correctly
artist_list[0:15]

['Dido',
 'James TW',
 'Justin Stone',
 'Natasha Bedingfield',
 'GAMMAL',
 'Valee',
 'BÖ',
 'Chris Norman',
 'Majk',
 'The Tremeloes',
 'Julius LaRosa',
 'Tegan and Sara',
 'Kylie Minogue',
 'Mike Perry',
 'Fonky Family']

In [73]:
#there were some data value issues with the api pull. Some api resulted in whitespace and float values (presumably because the artist's name has a numeric name)
#this function creates and formats the issues in artist list
#the function will drop any NaN values as well as remove float values and whitespace
def format_artists(list_):
    pdartist_list = pd.DataFrame(list_, columns=['name'])
    pdartist_list.dropna()
    pdartist_list = pdartist_list[pdartist_list['name'].map(type) != float]
    pdartist_list = pdartist_list[pdartist_list['name'].map(len) > 1]
    return list(pdartist_list['name'])

In [74]:
#formating the list
artist_list = format_artists(artist_list)

In [75]:
#ensuring the formatting worked as intended
artist_list[0:15]

['Dido',
 'James TW',
 'Justin Stone',
 'Natasha Bedingfield',
 'GAMMAL',
 'Valee',
 'BÖ',
 'Chris Norman',
 'Majk',
 'The Tremeloes',
 'Julius LaRosa',
 'Tegan and Sara',
 'Kylie Minogue',
 'Mike Perry',
 'Fonky Family']

In [76]:
#main function for the dataset
#using the formatted master list of artists, perform an api search on the artist
#extract the artists api number as well as url
def extract_artists(artist_list):    
    append_list = []
    for x in range(len(artist_list)):
        placeholder = spotify.search(artist_list[x])
        if len(placeholder['tracks']['items']) != 0:
            try:
                for counter in range(len(placeholder['tracks']['items'])):
                    if artist_list[x] == placeholder['tracks']['items'][counter]['album']['artists'][0]['name']:
                        append_list.extend([{'Artist':artist_list[x], 'api_url':placeholder['tracks']['items'][counter]['album']['artists'][0]['external_urls']['spotify']}])
                        break
            except:
                print (placeholder)
    return append_list

In [77]:
#NOTE about naming convention - On my first go, I used test as a variable name to see if things would work.
#after determining the api pulls worked, I did not want to rerun the api call because of the amount of calls made.
#so i kept the name test as the main variable. In the ideal world, this should be named something better

#extracting the api information from the cleaned artist_list
test = []
test = extract_artists(artist_list)
print (test)

[{'Artist': 'Dido', 'api_url': 'https://open.spotify.com/artist/2mpeljBig2IXLXRAFO9AAs'}, {'Artist': 'James TW', 'api_url': 'https://open.spotify.com/artist/0B3N0ZINFWvizfa8bKiz4v'}, {'Artist': 'Justin Stone', 'api_url': 'https://open.spotify.com/artist/5Vu87j6vCvfwc7FNVGnBwk'}, {'Artist': 'Natasha Bedingfield', 'api_url': 'https://open.spotify.com/artist/7o95ZoZt5ZYn31e9z1Hc0a'}, {'Artist': 'GAMMAL', 'api_url': 'https://open.spotify.com/artist/3O6DpqAKwn7L1KS9s9x0w5'}, {'Artist': 'Valee', 'api_url': 'https://open.spotify.com/artist/4hRL2QmahOYxXNmNKtG1AI'}, {'Artist': 'Chris Norman', 'api_url': 'https://open.spotify.com/artist/2Pawr6MMX9VBIQ9oUHg7jc'}, {'Artist': 'Majk', 'api_url': 'https://open.spotify.com/artist/1Ld3ajH4r4DZd5Ey4zLk5X'}, {'Artist': 'The Tremeloes', 'api_url': 'https://open.spotify.com/artist/2VL8sLEJ6lHEwocjc1pN9w'}, {'Artist': 'Julius LaRosa', 'api_url': 'https://open.spotify.com/artist/7eYjNm8FZWh2E7MPw7uHaJ'}, {'Artist': 'Tegan and Sara', 'api_url': 'https://open

In [78]:
#converting outpot to a dataframe
testdf = pd.DataFrame.from_dict(test)

In [79]:
#confirming the extraction worked
testdf.head()

Unnamed: 0,Artist,api_url
0,Dido,https://open.spotify.com/artist/2mpeljBig2IXLXRAFO9AAs
1,James TW,https://open.spotify.com/artist/0B3N0ZINFWvizfa8bKiz4v
2,Justin Stone,https://open.spotify.com/artist/5Vu87j6vCvfwc7FNVGnBwk
3,Natasha Bedingfield,https://open.spotify.com/artist/7o95ZoZt5ZYn31e9z1Hc0a
4,GAMMAL,https://open.spotify.com/artist/3O6DpqAKwn7L1KS9s9x0w5


In [80]:
#checking how many artists remain from original artist list (after formating and cleaning)
len(testdf)

4665

In [81]:
#extracting api id number from api url
testdf['artist_api_id'] = testdf['api_url'].str[32:1000]
testdf['genres'] = np.NaN
testdf['genres'].astype('object')
testdf.head()

Unnamed: 0,Artist,api_url,artist_api_id,genres
0,Dido,https://open.spotify.com/artist/2mpeljBig2IXLXRAFO9AAs,2mpeljBig2IXLXRAFO9AAs,
1,James TW,https://open.spotify.com/artist/0B3N0ZINFWvizfa8bKiz4v,0B3N0ZINFWvizfa8bKiz4v,
2,Justin Stone,https://open.spotify.com/artist/5Vu87j6vCvfwc7FNVGnBwk,5Vu87j6vCvfwc7FNVGnBwk,
3,Natasha Bedingfield,https://open.spotify.com/artist/7o95ZoZt5ZYn31e9z1Hc0a,7o95ZoZt5ZYn31e9z1Hc0a,
4,GAMMAL,https://open.spotify.com/artist/3O6DpqAKwn7L1KS9s9x0w5,3O6DpqAKwn7L1KS9s9x0w5,


In [59]:
#this function performs an api call using the api id
#the purpose of this function is to find related artists to our top 200 list of artists
def find_related_artists(artist_list):
    temp_list = []
    for h in range(len(artist_list)):
        try:
            related_artist = spotify.artist_related_artists(artist_list[h])
            for j in related_artist['artists']:
                if j['name'] not in artist_list:
                    temp_list.extend([j['name']])
        except:
            print (unique_artists.iloc[h])
    return temp_list

In [83]:
#converting the genre column to an object type (I will be storing lists in this column)
testdf['genres'] = testdf['genres'].astype(object)

In [84]:
#from our artist master list, extract the genre for each artist. Store in dataframe
for x in range(len(testdf)):
    if testdf.iloc[x]['artist_api_id'] != '':
        placeholder = spotify.artist(testdf.iloc[x]['artist_api_id'])
        try:
            testdf.at[x,'genres'] = placeholder['genres']
        except:
            testdf.at[x,'genres'] = 'error'

retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs


In [86]:
#check to ensure pull worked
testdf

Unnamed: 0,Artist,api_url,artist_api_id,genres
0,Dido,https://open.spotify.com/artist/2mpeljBig2IXLXRAFO9AAs,2mpeljBig2IXLXRAFO9AAs,"[dance pop, new wave pop, pop, pop rock]"
1,James TW,https://open.spotify.com/artist/0B3N0ZINFWvizfa8bKiz4v,0B3N0ZINFWvizfa8bKiz4v,"[australian pop, dance pop, indie cafe pop, neo mellow, pop]"
2,Justin Stone,https://open.spotify.com/artist/5Vu87j6vCvfwc7FNVGnBwk,5Vu87j6vCvfwc7FNVGnBwk,"[deep underground hip hop, indie pop rap]"
3,Natasha Bedingfield,https://open.spotify.com/artist/7o95ZoZt5ZYn31e9z1Hc0a,7o95ZoZt5ZYn31e9z1Hc0a,"[dance pop, europop, folk-pop, neo mellow, pop, pop rock, post-teen pop, r&b]"
4,GAMMAL,https://open.spotify.com/artist/3O6DpqAKwn7L1KS9s9x0w5,3O6DpqAKwn7L1KS9s9x0w5,"[indie cafe pop, swedish pop]"
5,Valee,https://open.spotify.com/artist/4hRL2QmahOYxXNmNKtG1AI,4hRL2QmahOYxXNmNKtG1AI,"[chicago rap, trap music, underground hip hop, vapor trap]"
6,Chris Norman,https://open.spotify.com/artist/2Pawr6MMX9VBIQ9oUHg7jc,2Pawr6MMX9VBIQ9oUHg7jc,"[classic uk pop, italian disco]"
7,Majk,https://open.spotify.com/artist/1Ld3ajH4r4DZd5Ey4zLk5X,1Ld3ajH4r4DZd5Ey4zLk5X,"[albanian hip hop, albanian pop]"
8,The Tremeloes,https://open.spotify.com/artist/2VL8sLEJ6lHEwocjc1pN9w,2VL8sLEJ6lHEwocjc1pN9w,"[brill building pop, british invasion, bubblegum pop, classic uk pop, folk rock, merseybeat, nederpop, rock-and-roll, rockabilly]"
9,Julius LaRosa,https://open.spotify.com/artist/7eYjNm8FZWh2E7MPw7uHaJ,7eYjNm8FZWh2E7MPw7uHaJ,[]


In [95]:
#total artists
len(artist_list)

4977

In [101]:
#pickle our final artist/genre dataset for analysis
pickle.dump(testdf, open( cwd + r"\final_csv_files\artists_master.p", "wb" ) )

In [102]:
#extract a unique list of artist ids
artists_ids = testdf['artist_api_id']
artists_ids.head()

0    2mpeljBig2IXLXRAFO9AAs
1    0B3N0ZINFWvizfa8bKiz4v
2    5Vu87j6vCvfwc7FNVGnBwk
3    7o95ZoZt5ZYn31e9z1Hc0a
4    3O6DpqAKwn7L1KS9s9x0w5
Name: artist_api_id, dtype: object

In [237]:
#function to pull the song features
#for all artists in our master list, pull their most popular tracks and the song features
def pull_top_tracks(artist_ids):
    append_list = {}
    for x in range(len(artist_ids)):
        placeholder = spotify.artist_top_tracks(artist_ids[x])
        #analysis_placeholder = spotify.audio_features(placeholder['tracks'][0]['id'])
        for y in range(len(placeholder['tracks'])):
            append_list[placeholder['tracks'][y]['id']] = {'artist_id':artist_ids[x], 'name':placeholder['tracks'][y]['name'], 
                                    'explicit':placeholder['tracks'][y]['explicit'],
                                    'popularity':placeholder['tracks'][y]['popularity'],
                                    'track_id': placeholder['tracks'][y]['id'],
                                    'Artist_name':placeholder['tracks'][y]['artists'][0]['name']}
    return append_list

In [244]:
#create dict to house api calls
song_list_dict = {}

In [247]:
#pull top songs
song_list_dict = pull_top_tracks(artists_ids)

In [248]:
#api call worked - 44,000 songs and their song features have been loaded in the dictionary
len(song_list_dict)

44146

In [249]:
#printing out dictionary to ensure formatting is good
song_list_dict

{'751gBcu62kORDelX7FV0mM': {'artist_id': '2mpeljBig2IXLXRAFO9AAs',
  'name': 'Thank You',
  'explicit': False,
  'popularity': 58,
  'track_id': '751gBcu62kORDelX7FV0mM',
  'Artist_name': 'Dido'},
 '3adnLFXKO5rC1lhUNSeg3N': {'artist_id': '2mpeljBig2IXLXRAFO9AAs',
  'name': 'White Flag',
  'explicit': False,
  'popularity': 52,
  'track_id': '3adnLFXKO5rC1lhUNSeg3N',
  'Artist_name': 'Dido'},
 '5kj1AhJxSUums4ddBEaMhT': {'artist_id': '2mpeljBig2IXLXRAFO9AAs',
  'name': 'Here With Me',
  'explicit': False,
  'popularity': 50,
  'track_id': '5kj1AhJxSUums4ddBEaMhT',
  'Artist_name': 'Dido'},
 '2Y1nYcVwVnPOn1FHZ0dc5L': {'artist_id': '2mpeljBig2IXLXRAFO9AAs',
  'name': 'Take You Home',
  'explicit': False,
  'popularity': 57,
  'track_id': '2Y1nYcVwVnPOn1FHZ0dc5L',
  'Artist_name': 'Dido'},
 '1ujIGAJ2sp9ZXJVnZJxbQa': {'artist_id': '2mpeljBig2IXLXRAFO9AAs',
  'name': 'Hurricanes',
  'explicit': False,
  'popularity': 55,
  'track_id': '1ujIGAJ2sp9ZXJVnZJxbQa',
  'Artist_name': 'Dido'},
 '35b9

In [250]:
#save dictionary as a pickle file
pickle.dump(song_list_dict, open( cwd + r"\final_csv_files\dict_object.p", "wb" ) )

In [256]:
#loading the giant song list to a dataframe
giant_song_list = pd.DataFrame.from_dict(song_list_dict,orient='index').reset_index()

In [257]:
#inspecting
giant_song_list.head()

Unnamed: 0,index,artist_id,name,explicit,popularity,track_id,Artist_name
0,000TiSS4vK5su0MkoFyQbd,1YzDKK9gJRBqqkL5wxQQTa,Tenebre,False,53,000TiSS4vK5su0MkoFyQbd,Sercho
1,000xYdQfIZ4pDmBGzQalKU,3qvcCP2J0fWi0m0uQDUf6r,"Eu, Você, O Mar e Ela",False,52,000xYdQfIZ4pDmBGzQalKU,Luan Santana
2,002DUjoJzO3NqA4w3mTA2i,1FyYqlTR8CuFH7eRGd0tpe,5gether,False,18,002DUjoJzO3NqA4w3mTA2i,2gether
3,002r1ZwqA9IL2pWtJMOs9f,07d5etnpjriczFBB8pxmRe,Worryin' Bout Me,True,49,002r1ZwqA9IL2pWtJMOs9f,BJ The Chicago Kid
4,003F0rm5lqxcmhvJPKgfaJ,3MRynBsyLGzv3IQ9Fip6hO,El Remedio,False,43,003F0rm5lqxcmhvJPKgfaJ,Ana Guerra


In [260]:
#changing name of 'index' to 'track_id'
giant_song_list.columns.values[0]='track_id' 

In [261]:
giant_song_list.head()

Unnamed: 0,track_id,artist_id,name,explicit,popularity,track_id.1,Artist_name
0,000TiSS4vK5su0MkoFyQbd,1YzDKK9gJRBqqkL5wxQQTa,Tenebre,False,53,000TiSS4vK5su0MkoFyQbd,Sercho
1,000xYdQfIZ4pDmBGzQalKU,3qvcCP2J0fWi0m0uQDUf6r,"Eu, Você, O Mar e Ela",False,52,000xYdQfIZ4pDmBGzQalKU,Luan Santana
2,002DUjoJzO3NqA4w3mTA2i,1FyYqlTR8CuFH7eRGd0tpe,5gether,False,18,002DUjoJzO3NqA4w3mTA2i,2gether
3,002r1ZwqA9IL2pWtJMOs9f,07d5etnpjriczFBB8pxmRe,Worryin' Bout Me,True,49,002r1ZwqA9IL2pWtJMOs9f,BJ The Chicago Kid
4,003F0rm5lqxcmhvJPKgfaJ,3MRynBsyLGzv3IQ9Fip6hO,El Remedio,False,43,003F0rm5lqxcmhvJPKgfaJ,Ana Guerra


In [262]:
#creating pickle file of the dataframe
pickle.dump(song_list_dict, open( cwd + r"\final_csv_files\song_dataframe_master.p", "wb" ) )

In [264]:
#creating a unique list of track ids
track_list = giant_song_list['track_id']

In [15]:
#i have two functions here
#in the former, I had to pass a list of track ids to the api. Making the api calls individually 
def pull_song_features(song_list):
    song_feature_dict = {}
    for x in range(len(song_list)):
        song_feature_dict[song_list[x]] = spotify.audio_features(song_list[x])
    return song_feature_dict

In [91]:
#the following function pulls the song features
#songs are passed as a list and an api call is done on all track ids in the list
#this was done because making an api call on 44000 songs would cause the api to timeout
def pull_song_features_v2(song_list):
    song_feature_dict = {}
    api_result = spotify.audio_features(tracks=song_list)
    for x in range(len(api_result)):
        try:
            if api_result[x] is not None:
                song_feature_dict.update({song_list[x] : api_result[x]})
        except:
            print (api_result[x])
    return song_feature_dict

In [5]:
#load song_list_dict from a pickle
cwd = os.getcwd()
song_list_dict = pickle.load(open(cwd + r"\final_csv_files\song_dataframe_master.p", "rb"))

In [7]:
#convert song list dictionary to pandas dataframe
giant_song_list2 = pd.DataFrame.from_dict(song_list_dict,orient='index').reset_index()
giant_song_list2.columns.values[0]='track_id' 
giant_song_list2.head()

Unnamed: 0,track_id,artist_id,name,explicit,popularity,track_id.1,Artist_name
0,000TiSS4vK5su0MkoFyQbd,1YzDKK9gJRBqqkL5wxQQTa,Tenebre,False,53,000TiSS4vK5su0MkoFyQbd,Sercho
1,000xYdQfIZ4pDmBGzQalKU,3qvcCP2J0fWi0m0uQDUf6r,"Eu, Você, O Mar e Ela",False,52,000xYdQfIZ4pDmBGzQalKU,Luan Santana
2,002DUjoJzO3NqA4w3mTA2i,1FyYqlTR8CuFH7eRGd0tpe,5gether,False,18,002DUjoJzO3NqA4w3mTA2i,2gether
3,002r1ZwqA9IL2pWtJMOs9f,07d5etnpjriczFBB8pxmRe,Worryin' Bout Me,True,49,002r1ZwqA9IL2pWtJMOs9f,BJ The Chicago Kid
4,003F0rm5lqxcmhvJPKgfaJ,3MRynBsyLGzv3IQ9Fip6hO,El Remedio,False,43,003F0rm5lqxcmhvJPKgfaJ,Ana Guerra


In [8]:
#inspecting shape
giant_song_list2.shape

(44146, 7)

In [9]:
#creating an empty list to house list of songs for final api all
track_list2 = []

In [10]:
giant_song_list2.head()

Unnamed: 0,track_id,artist_id,name,explicit,popularity,track_id.1,Artist_name
0,000TiSS4vK5su0MkoFyQbd,1YzDKK9gJRBqqkL5wxQQTa,Tenebre,False,53,000TiSS4vK5su0MkoFyQbd,Sercho
1,000xYdQfIZ4pDmBGzQalKU,3qvcCP2J0fWi0m0uQDUf6r,"Eu, Você, O Mar e Ela",False,52,000xYdQfIZ4pDmBGzQalKU,Luan Santana
2,002DUjoJzO3NqA4w3mTA2i,1FyYqlTR8CuFH7eRGd0tpe,5gether,False,18,002DUjoJzO3NqA4w3mTA2i,2gether
3,002r1ZwqA9IL2pWtJMOs9f,07d5etnpjriczFBB8pxmRe,Worryin' Bout Me,True,49,002r1ZwqA9IL2pWtJMOs9f,BJ The Chicago Kid
4,003F0rm5lqxcmhvJPKgfaJ,3MRynBsyLGzv3IQ9Fip6hO,El Remedio,False,43,003F0rm5lqxcmhvJPKgfaJ,Ana Guerra


In [39]:
#pulling unique list of songs
track_list2 = giant_song_list2.iloc[:,0:1]

In [40]:
type(track_list2)

pandas.core.frame.DataFrame

In [42]:
#converting to a series
track_list2 = track_list2['track_id']

In [43]:
track_list2.head()

0    000TiSS4vK5su0MkoFyQbd
1    000xYdQfIZ4pDmBGzQalKU
2    002DUjoJzO3NqA4w3mTA2i
3    002r1ZwqA9IL2pWtJMOs9f
4    003F0rm5lqxcmhvJPKgfaJ
Name: track_id, dtype: object

In [44]:
#converting series to list - api call uses list when passing multiple song ids
track_list2 = track_list2.tolist()

In [45]:
track_list2

['000TiSS4vK5su0MkoFyQbd',
 '000xYdQfIZ4pDmBGzQalKU',
 '002DUjoJzO3NqA4w3mTA2i',
 '002r1ZwqA9IL2pWtJMOs9f',
 '003F0rm5lqxcmhvJPKgfaJ',
 '003FTlCpBTM4eSqYSWPv4H',
 '003jGIUhWE9ZW1TOqal5pl',
 '004S8bMhFQjnbuqvdh6W71',
 '004skCQeDn1iLntSom0rRr',
 '005JLaUYDk1wv8aYAZW0rs',
 '007IfzN6xST7Q7CbTvXN2l',
 '007UFyVZAdZIaqpxPsWYdq',
 '008gS3ob2GZv3e9fkWQ1k7',
 '009G1RDIr3UgPrFzOPJPfb',
 '00CmjeeHvAVKvx3tcIiZTy',
 '00Do6AxXoO7o8VvRPvvRKZ',
 '00EXQo4oF8LyQMa8byKvGM',
 '00ElBScr0ogElyOeYC70vD',
 '00EuNz6D7recXyjU3wJEUg',
 '00EvDpGR2lpTr4t0fFhEEA',
 '00FRRwuaJP9KimukvLQCOz',
 '00GOPLxW4PGQuUYdPJh8K1',
 '00HIh9mVUQQAycsQiciWsh',
 '00I0pcNkN3IOX3fsYbaB4N',
 '00Ia46AgCNfnXjzgH8PIKH',
 '00LI9j2NghSFCatUnnlTDs',
 '00Lc0cL8BsE7EIxp1bml7e',
 '00LhOqmEJG93SZ2f0XxgTb',
 '00LsEOcaYrZ3yxM9mOzavR',
 '00Mb3DuaIH1kjrwOku9CGU',
 '00Mhg8ZfH0Qcn4oL5X5t4p',
 '00NxCtvQTy3wkSvpaKEyKe',
 '00PESLcezJekwcaHOWGNku',
 '00PLtXXER1XcTRZvs3LioS',
 '00RI7b6oZDjx6IQC2eH6bh',
 '00S35gEf40z03JTJgvQMqi',
 '00St1bd1fxYD9shL3XNgmk',
 

In [48]:
len(track_list2)

44146

In [49]:
#chunking track list into lists of lists. each (smaller) list is what is passed to the api call
track_list3 = np.array_split(track_list2,883)

In [54]:
#number of small lists
len(track_list3[882])

49

In [92]:
#dictionary to house the song features
#for loop to pull the song features for the 44k songs
song_features = {}
for x in track_list3:
    song_features.update(pull_song_features_v2(x))

retrying ...1secs


In [93]:
song_features.keys()

dict_keys(['000TiSS4vK5su0MkoFyQbd', '000xYdQfIZ4pDmBGzQalKU', '002DUjoJzO3NqA4w3mTA2i', '002r1ZwqA9IL2pWtJMOs9f', '003F0rm5lqxcmhvJPKgfaJ', '003FTlCpBTM4eSqYSWPv4H', '003jGIUhWE9ZW1TOqal5pl', '004S8bMhFQjnbuqvdh6W71', '004skCQeDn1iLntSom0rRr', '005JLaUYDk1wv8aYAZW0rs', '007IfzN6xST7Q7CbTvXN2l', '007UFyVZAdZIaqpxPsWYdq', '008gS3ob2GZv3e9fkWQ1k7', '009G1RDIr3UgPrFzOPJPfb', '00CmjeeHvAVKvx3tcIiZTy', '00Do6AxXoO7o8VvRPvvRKZ', '00EXQo4oF8LyQMa8byKvGM', '00ElBScr0ogElyOeYC70vD', '00EuNz6D7recXyjU3wJEUg', '00EvDpGR2lpTr4t0fFhEEA', '00FRRwuaJP9KimukvLQCOz', '00GOPLxW4PGQuUYdPJh8K1', '00HIh9mVUQQAycsQiciWsh', '00I0pcNkN3IOX3fsYbaB4N', '00Ia46AgCNfnXjzgH8PIKH', '00LI9j2NghSFCatUnnlTDs', '00Lc0cL8BsE7EIxp1bml7e', '00LhOqmEJG93SZ2f0XxgTb', '00LsEOcaYrZ3yxM9mOzavR', '00Mb3DuaIH1kjrwOku9CGU', '00Mhg8ZfH0Qcn4oL5X5t4p', '00NxCtvQTy3wkSvpaKEyKe', '00PESLcezJekwcaHOWGNku', '00PLtXXER1XcTRZvs3LioS', '00RI7b6oZDjx6IQC2eH6bh', '00S35gEf40z03JTJgvQMqi', '00St1bd1fxYD9shL3XNgmk', '00SzIFHBVTpRonBTHnHuok', '

In [94]:
#load the dictionary into a dataframe
song_features_df = pd.DataFrame.from_dict(song_features,orient='index')

In [95]:
#inspect output
song_features_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
000TiSS4vK5su0MkoFyQbd,0.717,0.646,6,-7.692,0,0.107,0.237,0.0,0.127,0.0966,80.022,audio_features,000TiSS4vK5su0MkoFyQbd,spotify:track:000TiSS4vK5su0MkoFyQbd,https://api.spotify.com/v1/tracks/000TiSS4vK5su0MkoFyQbd,https://api.spotify.com/v1/audio-analysis/000TiSS4vK5su0MkoFyQbd,159250,4
000xYdQfIZ4pDmBGzQalKU,0.509,0.803,0,-6.743,1,0.04,0.684,0.000539,0.463,0.651,166.018,audio_features,000xYdQfIZ4pDmBGzQalKU,spotify:track:000xYdQfIZ4pDmBGzQalKU,https://api.spotify.com/v1/tracks/000xYdQfIZ4pDmBGzQalKU,https://api.spotify.com/v1/audio-analysis/000xYdQfIZ4pDmBGzQalKU,187119,4
002DUjoJzO3NqA4w3mTA2i,0.783,0.88,2,-4.927,1,0.237,0.256,0.0,0.105,0.54,113.932,audio_features,002DUjoJzO3NqA4w3mTA2i,spotify:track:002DUjoJzO3NqA4w3mTA2i,https://api.spotify.com/v1/tracks/002DUjoJzO3NqA4w3mTA2i,https://api.spotify.com/v1/audio-analysis/002DUjoJzO3NqA4w3mTA2i,251267,4
002r1ZwqA9IL2pWtJMOs9f,0.668,0.778,10,-4.912,0,0.0991,0.352,0.0,0.561,0.531,126.919,audio_features,002r1ZwqA9IL2pWtJMOs9f,spotify:track:002r1ZwqA9IL2pWtJMOs9f,https://api.spotify.com/v1/tracks/002r1ZwqA9IL2pWtJMOs9f,https://api.spotify.com/v1/audio-analysis/002r1ZwqA9IL2pWtJMOs9f,235751,4
003F0rm5lqxcmhvJPKgfaJ,0.683,0.676,1,-6.688,0,0.147,0.159,0.0,0.0726,0.434,98.992,audio_features,003F0rm5lqxcmhvJPKgfaJ,spotify:track:003F0rm5lqxcmhvJPKgfaJ,https://api.spotify.com/v1/tracks/003F0rm5lqxcmhvJPKgfaJ,https://api.spotify.com/v1/audio-analysis/003F0rm5lqxcmhvJPKgfaJ,180933,4


In [96]:
#reset index so track id is no longer index
song_features_df = song_features_df.reset_index()

In [98]:
#rename index column to 'track_id'
song_features_df.columns.values[0]='track_id' 
song_features_df.head()

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,000TiSS4vK5su0MkoFyQbd,0.717,0.646,6,-7.692,0,0.107,0.237,0.0,0.127,0.0966,80.022,audio_features,000TiSS4vK5su0MkoFyQbd,spotify:track:000TiSS4vK5su0MkoFyQbd,https://api.spotify.com/v1/tracks/000TiSS4vK5su0MkoFyQbd,https://api.spotify.com/v1/audio-analysis/000TiSS4vK5su0MkoFyQbd,159250,4
1,000xYdQfIZ4pDmBGzQalKU,0.509,0.803,0,-6.743,1,0.04,0.684,0.000539,0.463,0.651,166.018,audio_features,000xYdQfIZ4pDmBGzQalKU,spotify:track:000xYdQfIZ4pDmBGzQalKU,https://api.spotify.com/v1/tracks/000xYdQfIZ4pDmBGzQalKU,https://api.spotify.com/v1/audio-analysis/000xYdQfIZ4pDmBGzQalKU,187119,4
2,002DUjoJzO3NqA4w3mTA2i,0.783,0.88,2,-4.927,1,0.237,0.256,0.0,0.105,0.54,113.932,audio_features,002DUjoJzO3NqA4w3mTA2i,spotify:track:002DUjoJzO3NqA4w3mTA2i,https://api.spotify.com/v1/tracks/002DUjoJzO3NqA4w3mTA2i,https://api.spotify.com/v1/audio-analysis/002DUjoJzO3NqA4w3mTA2i,251267,4
3,002r1ZwqA9IL2pWtJMOs9f,0.668,0.778,10,-4.912,0,0.0991,0.352,0.0,0.561,0.531,126.919,audio_features,002r1ZwqA9IL2pWtJMOs9f,spotify:track:002r1ZwqA9IL2pWtJMOs9f,https://api.spotify.com/v1/tracks/002r1ZwqA9IL2pWtJMOs9f,https://api.spotify.com/v1/audio-analysis/002r1ZwqA9IL2pWtJMOs9f,235751,4
4,003F0rm5lqxcmhvJPKgfaJ,0.683,0.676,1,-6.688,0,0.147,0.159,0.0,0.0726,0.434,98.992,audio_features,003F0rm5lqxcmhvJPKgfaJ,spotify:track:003F0rm5lqxcmhvJPKgfaJ,https://api.spotify.com/v1/tracks/003F0rm5lqxcmhvJPKgfaJ,https://api.spotify.com/v1/audio-analysis/003F0rm5lqxcmhvJPKgfaJ,180933,4


In [99]:
#load the song features dataframe into a pickle file for final analysis
pickle.dump(song_features_df, open( cwd + r"\final_csv_files\song_feature_masterdf.p", "wb" ))