# Spotify API Data Collection  

### In this notebook, we will fetch music data from the Spotify API using the spotipy library  

The Spotify API allows us to collect data about music on Spotify, including metadata and music features  
Searching for music will give us songs that are currently popular or have recently been popular  
It is important to note that searching for old tracks will not give the tracks that were popular at that time, but tracks made at that time that are popular now  
To find tracks that were popular in earlier years, it would likely be easier to use Billboard charts for the desired timeframe

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import requests

This will create an instance of spotipy with the Spotify Developer Account Credentials  
The Spotify Developer ID and Secret are stored in environment variables on the system

In [2]:
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

## Data Collection

We will store the names of features we want to collect in a list and create dictionaries to temporarily hold the data  
The metadata and analysis lists and dictionaries are separated to make inserting data easier, as the data is gathered in two separate steps

In [3]:
features_list = ['artist', 'name', 'id', 'release_date', 'popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
metadata_list = features_list[:5]
analysis_list = features_list[5:]

features = {feature : [] for feature in features_list}
metadata = {feature : features[feature] for feature in metadata_list}
analysis = {feature : features[feature] for feature in analysis_list}

We will be looking at 2000 different songs from 2018 to 2020, as the Spotify API only allows searches to access the first 2000 results  
Data is gathered in increments of 50, the maximum allowed by the Spotify API

In [4]:
start_year = 2018
end_year = 2020
search_limit = 50
num_tracks = 2000

#for year in range(start_year, end_year + 1):
for i in range(0, num_tracks, search_limit):
    print("\r", "iteration {}".format(i // search_limit), end="")
    current_tracks = []
    try:
        results = sp.search(q='year:{}-{}'.format(start_year, end_year), limit=search_limit, type='track', offset=i)
    except (requests.HTTPError, spotipy.SpotifyException):
        pass
    for j, t in enumerate(results['tracks']['items']):
        current_tracks.append(t['id'])
        for k, v in metadata.items():
            if k == 'artist':
                v.append(t['artists'][0]['name'])
            elif k == 'release_date':
                v.append(t['album'][k])
            else:
                v.append(t[k])
    try:
        feature_results = sp.audio_features(current_tracks)
    except (requests.HTTPError, spotipy.SpotifyException):
        pass
    for feature in feature_results:
        for k, v in analysis.items():
            v.append(feature[k])

 iteration 39

## Data Storage

Once the dictionaries are filled, the data will be moved to a pandas dataframe for more features and easier analysis  
The songs are sorted by popularity to see which songs are currently the most popular  

Some songs have a popularity of 0, which is a result of Spotify's popularity algorithm likely only taking recent streaming data into account  
Old songs with a popularity of 0 may have been popular earlier, but they aren't streamed as much anymore  
Some songs are too new to have a popularity value yet, so they also have a popularity of 0  

We will filter out songs with popularity 0 and new songs  
As of August 10, 2020, a new song is defined as one that was released on or after August 1, 2020

In [5]:
df = pd.DataFrame(features)
# Filter out songs with popularity 0 and very new songs
df = df[(df.popularity != 0) & (df.release_date < '2020-08-01')].sort_values(by='popularity', ascending=False).reset_index(drop=True)
df

Unnamed: 0,artist,name,id,release_date,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,DaBaby,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,2020-04-17,100,0.746,0.690,11,-7.956,1,0.1640,0.24700,0.000000,0.1010,0.497,89.977,181733,4
1,The Weeknd,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,2020-03-20,99,0.514,0.730,1,-5.934,1,0.0598,0.00146,0.000095,0.0897,0.334,171.005,200040,4
2,Jawsh 685,Savage Love (Laxed - Siren Beat),1xQ6trAsedVPCdbtDAmk0c,2020-06-11,97,0.767,0.481,0,-8.520,0,0.0803,0.23400,0.000000,0.2690,0.761,150.076,171375,4
3,Harry Styles,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,2019-12-13,96,0.548,0.816,0,-4.209,1,0.0465,0.12200,0.000000,0.3350,0.557,95.390,174000,4
4,Topic,Breaking Me,3H7ihDc1dqLriiWXwsc2po,2019-12-19,95,0.789,0.720,8,-5.652,0,0.2180,0.22300,0.000000,0.1290,0.664,122.031,166794,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1899,J Balvin,Sigo Extrañándote,5Uc9brIj5A76d4TYpLJt94,2020-07-31,41,0.872,0.865,9,-4.248,1,0.2440,0.17400,0.000034,0.1590,0.935,93.035,202373,4
1900,J Balvin,Rojo,5Gl3YVEjwOOQmLTsA2ajS7,2020-07-31,40,0.608,0.610,11,-3.862,1,0.1500,0.14100,0.000048,0.0872,0.377,172.305,150853,4
1901,J Balvin,Brillo,614Z2GSe3D7ckYkGGgTZag,2020-07-31,40,0.524,0.384,9,-10.049,0,0.3600,0.89000,0.000000,0.0964,0.768,145.947,159573,4
1902,J Balvin,Azul,7aQiyjp2K1wWeS7p5mXoOU,2020-07-31,39,0.842,0.828,6,-2.495,0,0.0696,0.10100,0.001790,0.0579,0.653,94.019,206373,4


After filtering, the lowest popularity value is 39, so there is a mix of popular and semi-popular songs, but not unpopular songs  
This dataset is not ideal to see what differentiates popular songs from unpopular ones, but hopefully it will still tell us about characteristics of popular songs and what makes them more successful than semi-popular songs

In [6]:
print(min(df.popularity))

39


We will store our collected data in a csv file for later use

In [7]:
df.to_csv('data\spotify.csv',index=False)

FileNotFoundError: [Errno 2] No such file or directory: 'data\\spotify.csv'

And we're done.