#Generating Spotify Track Data
##In this notebook, I'll be scraping track details for 80,000 Spotify tracks that will be used for our analysis. See Spotify_Clean_Data.ipynb for next steps.

##Package Setup

In [None]:
# Docs
# https://spotipy.readthedocs.io/en/2.7.1/#
# https://console.firebase.google.com/project/spotify-f1cf5/database/firestore/data~2F
# https://developer.spotify.com/documentation/web-api/

In [None]:
!pip install spotipy

In [None]:
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from google.colab import files

In [None]:
# FIREBASE IMPLEMENTATION

#!pip install firebase
#!pip install firestore
# uploaded = files.upload() # upload cred
# import os
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/spotify-f1cf5-firebase-adminsdk-xxrd4-763e07c601.json'
#from google.cloud import firestore
#import firebase_admin
#from firebase_admin import credentials
# cred = credentials.Certificate('/content/spotify-f1cf5-firebase-adminsdk-xxrd4-763e07c601.json')
# firebase_admin.initialize_app(cred, {
#     'databaseURL': 'https://spotify-f1cf5.firebaseio.com'
# })

##Authenticating and initializing lists

In [None]:
# Upload JSON of your credentials with "client_id" = {your client id}, "client_secret" = {your client secret}
uploaded = files.upload()
import json

with open('spotify_credentials.json') as json_file:
    creds = json.load(json_file)


c_id = creds['client_id']
c_secret = creds['client_secret']

#Auth credentials to Spotify, load manager to SpotiPy
client_credentials_manager = SpotifyClientCredentials(client_id=c_id, client_secret=c_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Saving spotify_credentials.json to spotify_credentials (1).json


In [None]:
# Initializing lists for basic track details
artist_name = []
artist_id = []
album_name = []
album_id = []
track_name = []
track_id = []
track_pop = []
track_year = []
track_spotify_genre = []

# Initiating lists for track audio features
key = []
acousticness = []
danceability = []
duration_ms = []
energy = []
instrumentalness = []
liveness = []
loudness = []
mode = []
speechiness = []
tempo = []
time_signature = []
valence = []

# Initializing genre
artist_genre = []


##Calling Spotify API to get track details
In this section, I begin requesting data from the Spotify API using the SpotiPy library. I iterate through each of 8 popular genres for the past 10 years. For each of these subsets, I pull the top 1000 results, which produces a total of 80,000 songs. This sp.search() method only pulls basic info about a track (artist, genre, etc.), so this is just the first call we have to make to get track info.

In [None]:
# Get collection of song ids and basic track details - 50 tracks per 
for genre in ['classical','country', 'hip-hop', 'house','indie','pop','r&b','rock']:
  for year in range(2011,2021): 
      query = "genre:" + genre + " year:" + str(year)
      for i in range(0,1000,50):
          track_results = sp.search(q=query, type='track', limit=50, offset=i) # change 'q=' argument for year
          for item in track_results['tracks']['items']:
              artist_name.append(item['album']['artists'][0]['name'])
              artist_id.append(item['album']['artists'][0]['id'])
              track_name.append(item['name'])
              track_id.append(item['id'])
              track_pop.append(item['popularity'])
              track_year.append(year)
              track_spotify_genre.append(genre)
              album_name.append(item['album']['name'])
              album_id.append(item['album']['id'])
len(track_name)

80000

While we now have the basic track features, we still need to get the audio_features of each track we scraped. Here, we're making an audio_features call for each track and recording the results.

In [None]:
# Get Audio Analysis (high level) track details  - 50 tracks per call

for i in range(0,len(track_id),50):
  track_features = sp.audio_features(track_id[i:i+50]) # returns features of first 50 tracks
  for j in range(0,len(track_features)): # iterate over those 50 tracks
    if track_features[j] is None:  # if track does not have audio features from spotify
      track_features[j] = {}
    key.append(track_features[j].get('key', np.nan))
    acousticness.append(track_features[j].get('acousticness', np.nan))
    danceability.append(track_features[j].get('danceability', np.nan))
    duration_ms.append(track_features[j].get('duration_ms', np.nan))
    energy.append(track_features[j].get('energy', np.nan))
    instrumentalness.append(track_features[j].get('instrumentalness', np.nan))
    liveness.append(track_features[j].get('liveness', np.nan))
    loudness.append(track_features[j].get('loudness', np.nan))
    mode.append(track_features[j].get('mode', np.nan))
    speechiness.append(track_features[j].get('speechiness', np.nan))
    tempo.append(track_features[j].get('tempo', np.nan))
    time_signature.append(track_features[j].get('time_signature', np.nan))
    valence.append(track_features[j].get('valence', np.nan))

In [None]:
# Get artist genre - 50 tracks (artists) per call
for i in range(0,len(artist_id),50):
  artist_features = sp.artists(artist_id[i:i+50]) # returns features of next 50 tracks
  for j in range(0,len(artist_features['artists'])): # iterate over those 50 tracks
    if artist_features['artists'][j] is None:
      artist_features['artists'][j] = {}
    artist_genre.append(artist_features['artists'][j].get('genres', np.nan))

##Outputing data for next steps
Finally, we're combining the lists into a dataframe and exporting them to a csv for next steps. See Spotify_Clean_Data.ipynb for data cleaning steps.

In [None]:
df = pd.DataFrame({"track_id":track_id,
                   "track_name":track_name,
                   "track_year":track_year,
                   "track_spotify_genre":track_spotify_genre,
                   "art_name":artist_name,
                   "art_id":artist_id,
                   "art_genre":artist_genre,
                   "alb_name":album_name,
                   "alb_id":album_id,
                   "track_pop":track_pop,
                   "key":key,
                   "acousticness":acousticness,
                   "danceability":danceability,
                   "duration_ms":duration_ms,
                   "energy":energy,
                   "instrumentalness":instrumentalness,
                   "liveness":liveness,
                   "loudness":loudness,
                   "mode":mode,
                   "speechiness":speechiness,
                   "tempo":tempo,
                   "time_signature":time_signature,
                   "valence":valence
                   })
#track_dict = df.to_dict(orient='records')

In [None]:
df.to_csv('full_trackset.csv')
files.download('full_trackset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#for i, item in enumerate(track_dict):  
#  firestore.Client().collection('tracks').document(str(i)).set(item)