## Data Collection
I will be using two datasets for building a library of song data and analyzing my listening preferences. First I will pull in a json file which includes my extended spotify listening history. Once I have this data I will pull in various features from the spotify API relating to each artist and track in my listening history. Lastly I will collect a library of similar datapoints for thousands of tracks which I can use to compare to a users listening history to surface recommendations.

1- Spotify Listening History
<br>2- Kaggle Song List
<br>3- Spotify API

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

import time
import pickle

#### Load in Datasets

In [2]:
#load kaggle data
kaggle_df = pd.read_csv('../data/data.csv')

#load extended listening history
extended = pd.read_json('../data/endsong_0.json')
extended1 = pd.read_json('../data/endsong_1.json')
extended2 = pd.read_json('../data/endsong_2.json')
extended = pd.concat([extended, extended1, extended2])

#remove rows with null track ids
extended = extended[extended['spotify_track_uri'].isnull() == False]
extended.reset_index(drop = True, inplace = True)

#pull track ids into useable format
extended['track_id'] = [extended['spotify_track_uri'][x][14:] for x in range(len(extended['spotify_track_uri']))]

### Spotify API

In [5]:
#import spotipy and use credentials to authenticate through spotify api
import spotipy

#!ln -s ../config.py config.py 
import config


from spotipy.oauth2 import SpotifyClientCredentials
client_credentials_manager = SpotifyClientCredentials(client_id=config.cid, client_secret=config.secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [6]:
def track_features(data):
    danceability = []
    energy = []
    key = []
    loudness = []
    mode = []
    speechiness = []
    acousticness = []
    instrumentalness = []
    liveness = []
    valence = []
    tempo = []
    track_id = []
    duration_ms = []
    time_signature = []
    
    for t in data:
        try:
            results = sp.audio_features(tracks = t)
            danceability.append(results[0]['danceability'])
            energy.append(results[0]['energy'])
            key.append(results[0]['key'])
            loudness.append(results[0]['loudness'])
            mode.append(results[0]['mode'])
            speechiness.append(results[0]['speechiness'])
            acousticness.append(results[0]['acousticness'])
            instrumentalness.append(results[0]['instrumentalness'])
            liveness.append(results[0]['liveness'])
            valence.append(results[0]['valence'])
            tempo.append(results[0]['tempo'])
            track_id.append(results[0]['id'])
            duration_ms.append(results[0]['duration_ms'])
            time_signature.append(results[0]['time_signature'])
        except:
            pass

#     return pd.DataFrame([artist_id[0], artist[0], genres[0], artist_popularity[0], followers[0]], 
#                         columns = ['artist_id', 'artist', 'genres', 'artist_popularity', 'followers'])
    
    track_features = pd.DataFrame(track_id, columns = ['trackID'])
    track_features['danceability'] = danceability
    track_features['energy'] = energy
    track_features['key'] = key
    track_features['loudness'] = loudness 
    track_features['mode'] = mode 
    track_features['speechiness'] = speechiness
    track_features['acousticness'] = acousticness
    track_features['instrumentalness'] = instrumentalness
    track_features['liveness'] = liveness
    track_features['valence'] = valence
    track_features['tempo'] = tempo
    track_features['duration_ms'] = duration_ms
    track_features['time_signature'] = time_signature
    
    return track_features

### Recommendations Library
I need to build a library of random tracks from spotify API to compare against my history and pick recommendations from.

<br> I will take two main approaches here: 
<br> 1) Pull random playlists from every spotify category then pull tracks from each of those playlists.
<br> 2) Identify artists related to my streaming history and pull top songs from these artists. 

#### Random Playlists & Tracks from Each Spotify Category
Get Random Playlists - Done
<br> Get tracks and features from playlists - Done
<br> Get artist data - TBD

In [3]:
#create a function that grabs offsets of the playlist ids
def get_playlist_ids(row):
    category_id = []
    cat_playlist = []
    playlist_ids = []

    for i in range (1, 55):
        category_id.append(sp.categories(limit=1, offset = i)['categories']['items'][0]['id'])
    
    category_id.remove('0JQ5DAqbMKFRNXsIvgZF9A')

    for a in category_id:
        cat_playlist.append(sp.category_playlists(category_id = a, offset = row)['playlists']['items'])
     
    for x in range(0,53):
        for y in range(0,20):
            try:
                playlist_ids.append(cat_playlist[x][y]['id'])
            except:
                playlist_ids.append(0)


    playlist_ids.remove(0)
    playlist_ids = set(playlist_ids)
    
    return playlist_ids

In [13]:
def get_track_ids(playlist_ids):
    
    results = sp.playlist_tracks(playlist_ids)
    t = results['items']
    ids = []
    
    while results['next']:
        results = sp.next(results)
        t.extend(results['items'])
    try:
        for s in t: ids.append(s['track']['id'])
    except:
        ids.append(0)
        
    return ids

In [45]:
def get_track_artist(tracks):
    results = sp.tracks(tracks)
    t = results['tracks']
    ids = []
    artist = artist
    name = []
    genre = []
    followers = []
    
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            followers.append(s['followers']['total'])
            name.append(s['name'])   
    except:
        ids.append(0)
        
    art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    art_feat['artists'] = artist
    art_feat['artistName'] = name
    art_feat['genre'] = genre
    art_feat['followers'] = followers
    
    return art_feat

In [4]:
offset = [0,20,40,60,80,100]
func_ids = [get_playlist_ids(x) for x in offset]

In [5]:
playlist_ids = set(list(func_ids[0]) + list(func_ids[1]) + list(func_ids[2]) + list(func_ids[3]) 
                   + list(func_ids[4]) + list(func_ids[5]))

playlist_ids.remove(0)

playlist_ids.remove('37i9dQZF1DWVztgMIUG66M') 

In [30]:
batches = np.array_split(list(playlist_ids), 5)

In [31]:
len(batches[0])

471

In [39]:
#My requests weren't processing so I seperated the requests into smaller batches and ran them at different times in the day to avoid rate limiting
ids = [get_track_ids(x) for x in batches[0]]

ids1 = [get_track_ids(x) for x in batches[1]]

ids2 = [get_track_ids(x) for x in batches[2]]

ids3 = [get_track_ids(x) for x in batches[3]]

ids4 = [get_track_ids(x) for x in batches[4]]

In [5]:
#Flattening lists of track ids
ids1 = set(sum(ids1,[]))
ids2 = set(sum(ids2,[]))
ids3 = set(sum(ids3,[]))
ids4 = set(sum(ids4,[]))

ids.remove(0)
ids1.remove(0)
ids2.remove(0)
ids3.remove(0)
ids4.remove(0)

In [8]:
%store ids 
%store ids1 
%store ids2 
%store ids3 
%store ids4

Stored 'ids' (set)
Stored 'ids1' (set)
Stored 'ids2' (set)
Stored 'ids3' (set)
Stored 'ids4' (set)


In [15]:
%store -r ids 
%store -r ids1 
%store -r ids2 
%store -r ids3 
%store -r ids4

In [16]:
#Splitting the data into smaller batches for rate limiting
ids_batches = np.array_split(list(ids), 36)
ids1_batches = np.array_split(list(ids1), 36)
ids2_batches = np.array_split(list(ids2), 36)
ids3_batches = np.array_split(list(ids3), 36)
ids4_batches = np.array_split(list(ids4), 36)

In [43]:
# library_feats = []
# counter = 0

In [34]:
for x in range(34,36):
    if counter < 1:
        library_feats.append(track_features(ids4_batches[x]))
        counter += 1
    else: 
        time.sleep(30)
        counter == 0
        library_feats.append(track_features(ids4_batches[x]))

In [77]:
len(library_feats)

178

In [11]:
#%store library_feats

In [2]:
%store -r library_feats

In [5]:
random_track_feats = pd.concat([library_feats[x] for x in range(len(library_feats))])
random_track_feats.drop_duplicates(subset = 'trackID', inplace = True)
random_track_feats

Unnamed: 0,trackID,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,7jgBDJZfhiHnIhekjFUrsX,0.663,0.736,6,-5.519,1,0.0283,0.2080,0.000037,0.1140,0.480,114.969,210240,4
1,0hcW7qPqWWJc8189OaqQvX,0.722,0.327,0,-9.612,0,0.0260,0.2990,0.000661,0.0549,0.266,112.249,318467,4
2,1RI3FXTNkgNqhEIawWkeDS,0.260,0.465,4,-8.093,1,0.0333,0.7800,0.000000,0.3330,0.324,191.624,193802,4
3,6JLcz9UGiVxAmEZXlCucn5,0.855,0.746,3,-3.777,0,0.2320,0.4020,0.000000,0.1460,0.375,95.026,321987,4
4,1ciemDCppxQbYhXzqMoBV0,0.717,0.866,5,-4.740,1,0.0434,0.1890,0.000000,0.0640,0.787,130.053,121860,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
920,3HOv0kTPDcaCDeIzO0Tzbm,0.641,0.336,0,-10.667,1,0.0551,0.7980,0.000877,0.1040,0.272,123.748,168492,4
921,45tHS9JkfVPnrK2kfQEX2r,0.448,0.514,9,-9.587,1,0.0258,0.0631,0.000605,0.0939,0.207,78.174,292973,4
923,1EHWTyklqY1X4UGRylJTU3,0.817,0.506,5,-6.996,0,0.1660,0.1220,0.000000,0.0846,0.807,151.975,154998,4
924,0w2MKVScVo0xCPQHJfdcgz,0.436,0.653,9,-6.235,1,0.1300,0.0671,0.000000,0.1970,0.425,94.455,216000,4


In [6]:
play_batches = np.array_split(random_track_feats['trackID'], 3500)
art_feat = []
counter = 0

In [7]:
len(play_batches[0])

38

In [None]:
for x in range(0,100):
    try:
        if counter < 1:
            time.sleep(5)
            art_feat.append(get_track_artist(play_batches[x]))
            counter += 1
        else: 
            time.sleep(5)
            counter == 0
            art_feat.append(get_track_artist(play_batches[x]))
    except:
        pass

In [37]:
len(art_feat)

102

#### Get Top Songs from Artists Related to Streaming History
Artists Data - Done
<br> Top Songs Per Artists - TBD

In [18]:
def related_artists(artist_ids):
    results = sp.artist_related_artists(artist_ids)
    t = results['artists']
    ids = []
    name = []
    genre = []
    followers = []
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            followers.append(s['followers']['total'])
            name.append(s['name'])
        
    except:
        ids.append(0)
        
    related = pd.DataFrame(ids, columns = ['artist_id'])
    related['artistName'] = name
    related['genre'] = genre
    related['followers'] = followers
    
    return related

In [23]:
len(history['artist_id'].unique())

636

In [24]:
rel_art = [related_artists(x) for x in history['artist_id'].unique()]

dfs = []
for x in range(0,636):
    dfs.append(rel_art[x])

related_artists = pd.concat(dfs, axis = 0)

related_artists.drop_duplicates(subset = 'artist_id', inplace = True)

related_artists.reset_index(inplace = True, drop = True)

related_artists.head()

In [30]:
%store related_artists

Stored 'related_artists' (DataFrame)


In [32]:
related_artists

Unnamed: 0,artist_id,artistName,genre,followers
0,1A9o3Ljt67pFZ89YtPPL5X,Snoh Aalegra,"[alternative r&b, neo soul, r&b, scandinavian ...",881683.0
1,3Y7RZ31TRPVadSFVy1o8os,H.E.R.,"[pop, r&b, rap, urban contemporary]",5661164.0
2,30DhU7BDmF4PH0JVhu8ZRg,Sabrina Claudio,"[pop, r&b]",1443420.0
3,5aMIbwZQvP2MHPMVC5zCGj,ODIE,"[canadian contemporary r&b, indie r&b]",241438.0
4,3tlXnStJ1fFhdScmQeLpuG,Brent Faiyaz,"[dmv rap, hip hop, pop, r&b, rap]",2889860.0
...,...,...,...,...
5928,2YEnrpAWWaNRFumgde1lLH,Oden & Fatzo,[disco house],16982.0
5929,4pSMnAlD8JVEW3eZDuaQH8,Anish Kumar,[],6794.0
5930,1uF7AFfGahplhiaHEy9NNl,Loods,"[australian dance, disco house]",10598.0
5931,4W991QdgKWX4TO864ypInA,Eats Everything,"[deep disco house, house, raw techno]",80982.0


In [None]:
# search for all tracks from related artists and artists from the random generated playlist
# pull more tracks using offset from random playlist generator
# goal will be to get 500k tracks
# once I have 500k tracks ids pull the track features that are needed for modeling 

#### Pulling in the Genre and Followers for Kaggle Dataset - Done

In [9]:
#Using a different function as the kaggle data doesnt include artist ID
def kaggle_artist_features(artist):
    results = sp.search(q=f'artist: {artist}', type='artist', limit=1)
    t = results['artists']['items']
    ids = []
    artist = artist
    name = []
    genre = []
    followers = []
    
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            followers.append(s['followers']['total'])
            name.append(s['name'])   
    except:
        ids.append(0)
        
    art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    art_feat['artists'] = artist
    art_feat['artistName'] = name
    art_feat['genre'] = genre
    art_feat['followers'] = followers
    
    return art_feat

In [90]:
#I ran into issues trying to pull artist data so I had to run API requests in smaller batches
#I also had to filter out strings that were too long to be searched

artist_batches = np.array_split(kaggle_df['artists'].loc[(kaggle_df['artists'].str.len() < 200)].unique(), 150)

kaggle_artists = []

In [17]:
for x in range(140,150):
    if counter < 1:
        time.sleep(30)
        kaggle_artists.append([kaggle_artist_features(a) for a in artist_batches[x]])
        counter += 1
    else: 
        time.sleep(30)
        counter == 0
        kaggle_artists.append([kaggle_artist_features(a) for a in artist_batches[x]])

In [71]:
#Condensing the artists data into one dataframe with the track data from kaggle_df
batch_df = []

for df in range(len(kaggle_artists)):
    batch_df.append(pd.concat([x for x in kaggle_artists[df]]))

test= pd.concat([batch_df[x] for x in range(len(batch_df))])

kaggle = pd.merge(left = kaggle_df, right = test, on = ['artists', 'artists'], how = 'left')

#### Extended Streaming History

In [19]:
def get_track_artist(tracks):
    results = sp.tracks(tracks)
    t = results['tracks']
    ids = []
    track_id = [track for track in tracks]
    track_name = []
    
    try:
        for s in t: 
            ids.append(s['artists'][0]['id'])
            track_name.append(s['name']) 
    except:
        ids.append(0)
        
    
    #art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    #art_feat['track_id'] = track_id
    
    df = artist_features(ids)
    df['track_id'] = track_id
    df['trackName'] = track_name
    
    return df

In [7]:
def artist_features(artist):
    results = sp.artists(artist)
    t = results['artists']
    ids = []
    artist_id = [x for x in artist]
    name = []
    genre = []
    popularity = []
    followers = []
    
    
    try:
        for s in t: 
            ids.append(s['id'])
            genre.append(s['genres'])
            popularity.append(s['popularity'])
            followers.append(s['followers']['total'])   
    except:
        ids.append(0)
        
    art_feat = pd.DataFrame(ids, columns = ['artist_id'])
    art_feat['artist_id'] = artist_id
    art_feat['genre'] = genre
    art_feat['popularity'] = popularity
    art_feat['followers'] = followers
    
    return art_feat

In [None]:
ext_batches = np.array_split(extended['track_id'], 1000)
art_feat = []
counter = 0

In [155]:
for x in range(900,1000):
    if counter < 1:
        time.sleep(5)
        art_feat.append(get_track_artist(ext_batches[x]))
        counter += 1
    else: 
        time.sleep(5)
        counter == 0
        art_feat.append(get_track_artist(ext_batches[x]))

In [32]:
#Pulling the track features for each song in extended history
ext_batches = np.array_split(extended['track_id'], 100)
track_feat = []
counter = 0

In [58]:
for x in range(95,100):
    if counter < 1:
        time.sleep(30)
        track_feat.append(track_features(ext_batches[x]))
        counter += 1
    else: 
        time.sleep(30)
        counter == 0
        track_feat.append(track_features(ext_batches[x]))

In [185]:
#Merge the track and artists features pulled from spotify api
track_df = pd.concat([track_feat[x] for x in range(len(track_feat))])
art_df = pd.concat([art_feat[x] for x in range(len(art_feat))])
track_df.drop_duplicates(inplace = True)
art_df.drop_duplicates(subset = 'track_id', inplace = True)

extended_spotify = pd.merge(track_df, art_df, how = 'left', left_on = 'trackID', right_on ='track_id')
extended = pd.merge(extended, extended_spotify, how = 'left', left_on = 'track_id', right_on = 'trackID')

In [194]:
#save variables as csvs to pass to read across notebooks
kaggle.to_csv('../data/kaggle.csv')
history.to_csv('../data/streaminghistory.csv')
extended.to_csv('../data/extendedhistory.csv')

### User Input Functions
First I will build a function where a user can input a track and artist and receive features associated with the song and artist. I will pickle and use this function in later notebooks when serving recommendations.

In [1]:
#User Input Functions
def get_users_track(artist, track):
    
    results = sp.search(q="artist: " + artist + "track: " + track, type="track", limit =1)['tracks']['items'][0]
    artist_id = results['album']['artists'][0]['id']
    track_id = results['id']
    trackName = results['name']
    
    #get artist and track features
    user_artists = artist_features([artist_id])
    user_track = track_features([track_id])
    user_table = pd.concat([user_artists, user_track], axis = 1)
    user_table.drop(columns = ['trackID','artist_id', 'duration_ms', 'time_signature'], inplace = True)
    user_table.index = [trackName]
        
    return user_table

In [None]:
with open('../models/get_users_track.pkl', 'wb') as f:
    pickle.dump(get_users_track, f)

In [18]:
sample = get_users_track('happier than ever', 'billie ellish')

In [20]:
%store sample

Stored 'sample' (DataFrame)
