# Spotify API Data Collection (for Machine Learning)  

### In this notebook, we will fetch music data from the Spotify API using the spotipy library to be used in a machine learning model  

The Spotify API allows us to collect data about music on Spotify, including metadata and music features  
Searching for music will give us songs that are currently popular or have recently been popular  
It is important to note that searching for old tracks will not give the tracks that were popular at that time, but tracks made at that time that are popular now  
To find tracks that were popular in earlier years, it would likely be easier to use Billboard charts for the desired timeframe

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import requests

This will create an instance of spotipy with the Spotify Developer Account Credentials  
The Spotify Developer ID and Secret are stored in environment variables on the system

In [2]:
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

## Data Collection

We will store the names of features we want to collect in a list and create dictionaries to temporarily hold the data  
The metadata and analysis lists and dictionaries are separated to make inserting data easier, as the data is gathered in two separate steps  
We will be looking at 2000 different songs from each year from 2016 to 2020, as the Spotify API only allows searches to access the first 2000 results  
Data is gathered in increments of 50, the maximum allowed by the Spotify API

In [11]:
features_list = ['artist', 'name', 'id', 'release_date', 'popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
metadata_list = features_list[:5]
analysis_list = features_list[5:]

features = {feature : [] for feature in features_list}
metadata = {feature : features[feature] for feature in metadata_list}
analysis = {feature : features[feature] for feature in analysis_list}

start_year = 2014
end_year = 2020
search_limit = 50
num_tracks = 2000

for year in range(start_year, end_year + 1):
    for i in range(0, num_tracks, search_limit):
        print("\r", "year {}, iteration {}".format(year, i // search_limit + 1), end="")
        current_tracks = []
        try:
            results = sp.search(q='year:{}'.format(year), limit=search_limit, type='track', offset=i)
        except (requests.HTTPError, spotipy.SpotifyException):
            pass
        for j, t in enumerate(results['tracks']['items']):
            #if t['id'] != '04boE4u1AupbrGlI62WvoO':
            current_tracks.append(t['id'])
            for k, v in metadata.items():
                if k == 'artist':
                    v.append(t['artists'][0]['name'])
                elif k == 'release_date':
                    v.append(t['album'][k])
                else:
                    v.append(t[k])
        try:
            feature_results = sp.audio_features(current_tracks)
        except (requests.HTTPError, spotipy.SpotifyException):
            pass
        for feature in feature_results:
            for k, v in analysis.items():
                try:
                    v.append(feature[k])
                except TypeError:
                    v.append(-1)

 year 2020, iteration 40

## Data Storage

Once the dictionaries are filled, the data will be moved to a pandas dataframe for more features and easier analysis  
The songs are sorted by popularity to see which songs are currently the most popular  

Some songs have a popularity of 0, which is a result of Spotify's popularity algorithm likely only taking recent streaming data into account  
Old songs with a popularity of 0 may have been popular earlier, but they aren't streamed as much anymore  
Some songs are too new to have a popularity value yet, so they also have a popularity of 0  

We will filter out songs with popularity 0 and new songs  
As of September 29, 2020, a new song is defined as one that was released on or after September 18, 2020

In [12]:
df = pd.DataFrame(features)
# Filter out songs with popularity 0, songs with erroneous data, very new songs, and duplicate songs
df = df[(df.popularity != 0) & (df.key >= 0) & (df.release_date < '2020-09-18')].sort_values(by='popularity', ascending=False).drop_duplicates(subset=['artist', 'name']).reset_index(drop=True)
df

Unnamed: 0,artist,name,id,release_date,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,24kGoldn,Mood (feat. Iann Dior),3tjFYV6RSFtuktYl3ZtYcq,2020-07-24,100,0.700,0.72200,7,-3.558,0,0.0369,0.221000,0.000000,0.2720,0.756,90.989,140526,4
1,Cardi B,WAP (feat. Megan Thee Stallion),4Oun2ylbjFKMPTiaSbbCih,2020-08-07,99,0.935,0.45400,1,-7.509,1,0.3750,0.019400,0.000000,0.0824,0.357,133.073,187541,4
2,The Weeknd,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,2020-03-20,97,0.514,0.73000,1,-5.934,1,0.0598,0.001460,0.000095,0.0897,0.334,171.005,200040,4
3,DaBaby,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,2020-04-17,97,0.746,0.69000,11,-7.956,1,0.1640,0.247000,0.000000,0.1010,0.497,89.977,181733,4
4,BTS,Dynamite,0t1kP63rueHleOhQkYSXFY,2020-08-28,97,0.746,0.76500,6,-4.410,0,0.0993,0.011200,0.000000,0.0936,0.737,114.044,199054,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11088,Dionne Warwick,"That's What Friends Are For (with Elton John, ...",1cx2NcSnJYdr3kvHWmX7fF,2014-03-03,45,0.683,0.28400,0,-15.207,0,0.0314,0.333000,0.000000,0.0608,0.207,120.233,257360,4
11089,Seven Lions,Worlds Apart,4jdnm77ZmM83Td3fjGUkzw,2014-06-03,45,0.415,0.72300,0,-8.013,0,0.0430,0.017300,0.000155,0.1210,0.209,139.935,376242,4
11090,Death From Above 1979,Trainwreck 1979 - Moulder Mix with in/out fades,06vOVdH94mIEjIgIhHdhdO,2014-09-09,45,0.522,0.93400,5,-5.030,0,0.0617,0.000021,0.003990,0.2530,0.304,135.052,226693,4
11091,Wade Bowen,Sun Shines on a Dreamer,36cmw1oeo5fkzOMTBMRxY6,2014-10-28,45,0.486,0.66700,0,-6.448,1,0.0283,0.160000,0.000016,0.1240,0.255,94.867,253771,4


After filtering, the lowest popularity value is around 40, so there is a mix of popular and semi-popular songs, but not unpopular songs  
This dataset is not ideal to see what differentiates popular songs from unpopular ones, but hopefully it will still be viable for ML

In [13]:
print(min(df.popularity))

44


We will store our collected data in a csv file for later use

In [14]:
df.to_csv('spotify-learning.csv',index=False)