# Create Spotify Song Pool

## General Description:
This Jupyter notebook creates a pool containing unique songs obtained through Spotify's API. It requests data for all unique songs found in a subset of Spotify's 1 Million Playlist Set. The subset consists of 10,000 playlists. The number of unique songs found in this subset and pushed into a pool csv file is approximaetly 170,000. This, we assume, is a sufficiently large enough pool for recommending songs to a playlist. Song features are added to the pool in order to be used as a source of information.


<hr style="height:2pt">

In [1]:
import pandas as pd
import time
import json
import numpy as np
import re

We import a set of functions we created in order to make notebook codes easier to read. These functions, stored in a .py file called "spotify_api_fuction_set", are used for handling a Library that communicates with the Spotify API called Spotipy. The Spotipy library can be found here (https://spotipy.readthedocs.io/en/latest/). Note that the functions created are specific to this project (See EDA section for list of functions inside this .py file).

In [2]:
import spotify_api_function_set as sps #imports set of functions created to use spotify API

We load a subset of 10,000 playlists from the 1 Million Playlist Dataset from Spotify using the json library.

In [3]:
path = 'data'
file_names = ["mpd.slice.0-999", "mpd.slice.1000-1999", "mpd.slice.2000-2999",
              "mpd.slice.3000-3999", "mpd.slice.4000-4999", "mpd.slice.5000-5999",
              "mpd.slice.6000-6999", "mpd.slice.7000-7999", "mpd.slice.8000-8999", "mpd.slice.9000-9999"]

spotify_playlist = []
for file in file_names:
    with open(path+"/"+file+".json", "r") as fd:
        plylist_temp = json.load(fd)
        plylist_temp = plylist_temp.get('playlists')
        spotify_playlist = spotify_playlist + plylist_temp

We define the number of playlists we wish to use as a source for song pool generation. In this case we will use all 10,000 playlists. From here, for each playlist, we extract each song's Uniform Resource Identifier (URI) and each song's artist URI so we can use it later with Spotify's API.

In [5]:
N = 10000 #Number of playlists to request

track_uri = []
artist_uri = []

for i in range(N):
    track_id = sps.get_playlist_n(spotify_playlist[i], feature = 'track_uri', n_playlist = i)
    artist_id = sps.get_playlist_n(spotify_playlist[i], feature = 'artist_uri', n_playlist = i)  
    
    track_uri.extend(track_id)
    artist_uri.extend(artist_id)

Since we expect many songs to be repeated from playlist to playlist, we store the track and artist URIs in a pandas dataframe in order to drop any duplicates based on track URIs.

In [6]:
data = [np.array(track_uri).T, np.array(artist_uri).T]
data = np.transpose(data)
temp_df = pd.DataFrame(data)
temp_df.columns = ['track_uri', 'artist_uri']

We check the length of the dataframe containing all songs extracted from the 10,000 playlists. We see that there are currently 664,712 songs in the dataframe.

In [7]:
len(temp_df)

664712

Dropping duplicated songs, we reduce the playlist to 170,089 unique songs. We do this before requesting API information in order to prevent unnecessary requests.

In [7]:
temp_df = temp_df.drop_duplicates(subset='track_uri') #Remove duplicates
len(temp_df)

170089

In [8]:
track_uri = list(temp_df.track_uri)
artist_uri = list(temp_df.artist_uri)
sp = sps.create_spotipy_obj() #create spotify object to use to request songs

We request song and artist features provided by spotify's API for all unique songs found in the 10,000 playlist subset. We time it to get a sense of speed. Note, this code took us approximately 22 minutes to run. Feel free to use a smaller Playlist subset (N above) to test the code first.

In [9]:
start_time = time.time()
t_features, a_features = sps.get_all_features(track_uri, artist_uri, sp)
print("--- %s seconds ---" % (time.time() - start_time))

--- 1293.6080300807953 seconds ---


In [43]:
data = [np.array(t_features).T, np.array(a_features).T]
data = np.transpose(data)
feature_pd = pd.DataFrame(data)
feature_pd.columns = ['t_features', 'a_features']

Before proceeding any further, we check to see if Spotify returned any NonType objects and drop them. When we ran the code, we got only one NonType object for a song, hence our pool was reduced by one song.

In [61]:
feature_pd = feature_pd.dropna()
t_features = list(feature_pd.t_features)
a_features = list(feature_pd.a_features)

We create a pandas dataframe containing unique songs with its features and categorize each song into a genre just like we did when doing data exploration and preparation. We also timed this step, fortunately for the 10,000 playlists, this took about 5 minutes.

In [62]:
songs_df = sps.create_song_df(t_features, a_features, list(range(len(t_features))))

In [65]:
start_time = time.time()
songs_df_unique = sps.genre_generator(songs_df)
print("--- %s seconds ---" % (time.time() - start_time))

  songs_df.set_value(j, 'genre', genre)


--- 265.5861220359802 seconds ---


We clean the data a bit further.

In [66]:
cols = ['song_uri', 'duration_ms', 'time_signature', 'key', 'tempo',
       'energy', 'mode', 'loudness', 'speechiness', 'danceability',
       'acousticness', 'instrumentalness', 'valence', 'liveness',
       'artist_followers', 'artist_name', 'artist_popularity', 'artist_uri','genre']
drop = set(cols)^set(songs_df_unique.columns)

In [67]:
pool_df = songs_df_unique.drop(drop, axis=1)

Check to see if things look ok.

In [68]:
pool_df.head()

Unnamed: 0,song_uri,duration_ms,time_signature,key,tempo,energy,mode,loudness,speechiness,danceability,acousticness,instrumentalness,valence,liveness,artist_followers,artist_uri,artist_name,artist_popularity,genre
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,226864,4,4,125.461,0.813,0,-7.105,0.121,0.904,0.0311,0.00697,0.81,0.0471,909185,spotify:artist:2wIVse2owClT7go1WT98tk,Missy Elliott,76,rap
1,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,198800,4,5,143.04,0.838,0,-3.914,0.114,0.774,0.0249,0.025,0.924,0.242,5455441,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Britney Spears,82,pop
2,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,235933,4,2,99.259,0.758,0,-6.583,0.21,0.664,0.00238,0.0,0.701,0.0598,16678709,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Beyoncé,87,pop
3,spotify:track:1AWQoqb9bSvzTjaLralEkT,267267,4,4,100.972,0.714,0,-6.055,0.14,0.891,0.202,0.000234,0.818,0.0521,7341126,spotify:artist:31TPClRtHm23RisEBtV3X7,Justin Timberlake,83,rap
4,spotify:track:1lzr43nnXAijIGYnCT8M8H,227600,4,0,94.759,0.606,1,-4.596,0.0713,0.853,0.0561,0.0,0.654,0.313,1044532,spotify:artist:5EvFsr3kj42KNv97ZEnqij,Shaggy,74,rap


Finally we store the pool into the specified path, we drop the index as it isn't necesarry.

In [69]:
pool_df.to_csv(path+'/'+'big_song_pool.csv', index=False)